Import Large CSVs into SQL Server Using PowerShell
A buddy of mine was running into issues importing a 1 GB CSV of Longitudes and Latitudes available from geonames.org. The CSV contains over 9 million rows, and no matter what he did, DTS/SSIS would not import the data because of conversion issues. That's actually been happening to me a lot lately, too, so I find myself relying on PowerShell for quick imports.
I wrote this easily reusable PowerShell script which allows for the import of an unlimited number of rows. For the data I'm working with, this script takes about 31 seconds to insert 1 million rows on an indexed table!
Based off of the information provided by geoname's readme, I created the SQL table using as precise of datatypes as I could. After the import, however, I found that while they said to use varchar across the board, there were two columns in which nvarchar was required.
1CREATE TABLE allCountries (
2 GeoNameId int PRIMARY KEY,
3 Name nvarchar(200),
4 AsciiName varchar(200),
5 AlternateNames nvarchar(max),
6 Latitude float,
7 Longitude float,
8 FeatureClass char(1),
9 FeatureCode varchar(10),
10 CountryCode char(2),
11 Cc2 varchar(60),
12 Admin1Code varchar(20),
13 Admin2Code varchar(80),
14 Admin3Code varchar(20),
15 Admin4Code varchar(20),
16 Population bigint,
17 Elevation varchar(255),
18 Dem int,
19 Timezone varchar(40),
20 ModificationDate date,
21)
For the PowerShell portion, I used SqlBulkCopy and [System.IO.File]::OpenText because they're the fastest for this type of operation, and a datatable because it's the easiest to work with, in my opinion. I explored using iDataReader and streaming but it got too convoluted.
Initially, I loaded all 9 million rows into memory, but that used all my RAM and caused a ton of paging. To get around this, I performed the bulk import after every 100000 rows, then emptied the data table.
A huge thanks goes out to Stevie Nova whose code showed me how to optimise datatable row inserts using $row.ItemArray instead of a manual population ($row.item("GeoNameID") = $data[0]) This technique decreased my execution time by 400%!
So the basic idea behind this script came from said buddy who split up his CSV into physical files. I thought it would be better to split it within PowerShell itself, and thought something like this would do the trick:
1if ($dt.rows.count -eq 100000) {
2 $bulkCopy.WriteToServer($dt)
3 $dt.Clear()
4}
That is pretty, but I wanted to make the code as optimized as possible, and figured repeated requests for $dt.rows.count would cause a performance hit. Sure enough, $dt.rows.count took 11 seconds longer per 1 million rows than modding $i, so I went with that.
After much tweaking, I got down to about 30-40 seconds per 1 million rows. Talk about I was excited when I saw this after I ran the import on a non-indexed table on a database with a Bulk-logged recovery model:
Script complete. 9,000,000 rows have been inserted into the database. Total Elapsed Time: 00:04:45.9362223
The indexed table took 00:05:16 to run.
Here's the script in its entirety:
1####################################################
2#
3# PowerShell CSV to SQL Import Script
4#
5####################################################
6
7# Database variables
8$sqlserver = "sqlserver"
9$database = "locations"
10$table = "allcountries"
11
12# CSV variables;
13$csvfile = "C:\temp\allCountries.txt"
14$csvdelimiter = "`t"
15$firstrowcolumnnames = $false
16
17####################################################
18#
19# No additional changes are required below unless
20# you want to modify your sqlbulkcopy options or
21# your SQL authentication details
22#
23####################################################
24
25Write-Output "Script started..."
26$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
27
28# 100k worked fastest and kept memory usage to a minimum
29$batchsize = 100000
30
31# Build the sqlbulkcopy connection, and set the timeout to infinite
32$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
33$bulkcopy = new-object ("Data.SqlClient.Sqlbulkcopy") $connectionstring
34$bulkcopy.DestinationTableName = $table
35$bulkcopy.bulkcopyTimeout = 0
36$bulkcopy.batchsize = $batchsize
37$bulkcopy.EnableStreaming = 1
38
39# Create the datatable, and autogenerate the columns.
40$datatable = New-Object "System.Data.DataTable"
41
42# Open the text file from disk
43$reader = new-object System.IO.StreamReader($csvfile)
44$line = $reader.ReadLine()
45$columns = $line.Split($csvdelimiter)
46
47 if ($firstrowcolumnnames -eq $false) {
48 foreach ($column in $columns) {
49 $null = $datatable.Columns.Add()
50 }
51 # start reader over
52 $reader.DiscardBufferedData();
53 $reader.BaseStream.Position = 0;
54 }
55 else {
56 foreach ($column in $columns) {
57 $null = $datatable.Columns.Add($column)
58 }
59 }
60
61 # Read in the data, line by line
62 while (($line = $reader.ReadLine()) -ne $null) {
63 $row = $datatable.NewRow()
64 $row.itemarray = $line.Split($csvdelimiter)
65 $datatable.Rows.Add($row)
66
67 # Once you reach your batch size, write to the db,
68 # then clear the datatable from memory
69 $i++; if (($i % $batchsize) -eq 0) {
70 $bulkcopy.WriteToServer($datatable)
71 Write-Output "$i rows have been inserted in $($elapsed.Elapsed.ToString()).";
72 $datatable.Clear()
73 }
74 }
75
76# Close the CSV file
77$reader.Close()
78
79 # Add in all the remaining rows since the last clear
80 if($datatable.Rows.Count -gt 0) {
81 $bulkcopy.WriteToServer($datatable)
82 $datatable.Clear()
83 }
84
85# Sometimes the Garbage Collector takes too long.
86[System.GC]::Collect()
87
88Write-Output "Script complete. $i rows have been inserted into the database."
89Write-Output "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
In my last trial run, the script imported more than 1.7 million rows a minute while a semi-optimized SSIS insert performed at only 300,000 rows per minute!
If you're interested going deeper into sqlbulkcopy performance, check out this PDF report from SQLBI.com.
Oh, one last thing: the way this script is written requires that the SQL table column order and CSV column order be the same. If you need to rearrange your columns, you can do so by explicitly specifying them when populating your datatable, though doing so takes up to 400% longer.
Alternatively, as Joel Reinford suggested below, you can create a view with the same order as your CSV, and use that instead.
The code would look something like this
1$dt = New-Object "System.Data.DataTable"
2$null = $dt.Columns.Add("GeoNameId")
3$null = $dt.Columns.Add("Name")
4$null = $dt.Columns.Add("AsciiName")
5$null = $dt.Columns.Add("AlternateNames")
6$null = $dt.Columns.Add("Latitude")
7$null = $dt.Columns.Add("Longitude")
8$null = $dt.Columns.Add("FeatureClass")
9$null = $dt.Columns.Add("FeatureCode")
10$null = $dt.Columns.Add("CountryCode"
11$null = $dt.Columns.Add("Cc2")
12$null = $dt.Columns.Add("Admin1Code")
13$null = $dt.Columns.Add("Admin2Code")
14$null = $dt.Columns.Add("Admin3Code")
15$null = $dt.Columns.Add("Admin4Code")
16$null = $dt.Columns.Add("Population")
17$null = $dt.Columns.Add("Elevation")
18$null = $dt.Columns.Add("Dem")
19$null = $dt.Columns.Add("Timezone")
20$null = $dt.Columns.Add("ModificationDate")
21
22# yadda yadda
23
24while (($line = $reader.ReadLine()) -ne $null) {
25 $data = $line.Split("`t")
26 $row = $dt.NewRow()
27 $row.Item("GeoNameId") = $data[0]
28 $row.Item("Name") = $data[1]
29 $row.Item("AsciiName") = $data[2]
30 $row.Item("AlternateNames") = $data[3]
31 $row.Item("Latitude") = $data[4]
32 $row.Item("Longitude") = $data[5]
33 $row.Item("FeatureClass") = $data[6]
34 $row.Item("FeatureCode") = $data[7]
35 $row.Item("CountryCode") = $data[8]
36 $row.Item("Cc2") = $data[9]
37 $row.Item("Admin1Code") = $data[10]
38 $row.Item("Admin2Code") = $data[11]
39 $row.Item("Admin3Code") = $data[12]
40 $row.Item("Admin4Code") = $data[13]
41 $row.Item("Population") = $data[14]
42 $row.Item("Elevation") = $data[15]
43 $row.Item("Dem") = $data[16]
44 $row.Item("Timezone") = $data[17]
45 $row.Item("ModificationDate") = $data[18]
46 $dt.Rows.Add($row)
47 }
48
49# and so on