If you’ve ever tried to use PowerShell’s Import-CSV with large files, you know that it can exhaust all of your RAM. Previously, I created a script on ScriptCenter that used an alternative technique to import large CSV files, and even imported them into SQL Server at a rate of 1.7 million rows a minute. I’ve now gotten that number up to a massive 3.2 million rows a minute 5.35 million rows a minute for non-indexed tables and 2.6 million rows a minute 4.35 million rows a minute for tables with clustered indexes.
Importing into SQL Server
Two of the fastest ways to import data is to use bcp or SqlBulkCopy. BCP is finicky, and requires a separate download. My intention is to create a script that can be used on any machine running PowerShell v3 and above, so these scripts will be using SqlBulkCopy. The catch is that SqlBulkCopy.WriteToServer() requires a Datatable, Datarow, or IDataReader as input. This means your plain-text CSV data must be transformed into one of these objects prior to being passed to WriteToServer().
I’ve tested three different methods for transforming CSV data to a consumable object. The first method uses SqlBulkCopy, StreamReader and DataSet batches. The second streams an OleDbconnection query result set to SqlBulkCopy. The third uses SqlBulkCopy, VisualBasic.FileIO.TextFieldParser and DataSet batches. I actually tested logparser, too, but while logparser is amazing (and way way faster) for parsing files, it doesn’t take advantage of sqlbulkcopy, so these approaches actually outperform logparser (and my tests with SSIS for that matter).
Setup
My lab is a VMWare virtualized lab which now consists of a Windows 10 machine with 6GB RAM, and a SQL Server 2014 instance with 12GB RAM (8 dedicated to SQL Server). Both are on on the same ESX host which sports a directly connected Samsung 850 EVO 500GB SSD drive. The database’s recovery model is set to SIMPLE.
The 1.2 GB CSV in this example contains 10 million+ rows of tab-delimited longitude and latitude data. A zip containing the CSV is available for download at geonames.org.
First, I had to create the SQL table.
CREATE TABLE allCountries ( GeoNameId int PRIMARY KEY, Name nvarchar(200), AsciiName nvarchar(200), AlternateNames nvarchar(max), Latitude float, Longitude float, FeatureClass char(1), FeatureCode varchar(10), CountryCode char(2), Cc2 varchar(255), Admin1Code varchar(20), Admin2Code varchar(80), Admin3Code varchar(20), Admin4Code varchar(20), Population bigint, Elevation varchar(255), Dem int, Timezone varchar(40), ModificationDate smalldatetime, )
Then I tested the various methods. You’ll probably notice that these scripts don’t have much error handling. I wanted to keep the code easy to read and straightforward. Finalized scripts with error handling and all that will be placed on ScriptCenter shortly.
StreamReader to DataTable batches
This one is by far the fastest, at up to 5.35 million rows per minute on a non-indexed table.
Realistically, you’ll probably be importing to an indexed table, which performs up to 4.35 million rows per minute on optimized systems.
Basically, the script performs the following
- Creates the SQL Bulk Copy connection
- Creates the PowerShell datatable, along with its columns
- Reads the CSV file using System.IO.StreamReader
- Using readline(), loads the CSV data row by row into the datatable
- Performs the bulk import every x number of rows
- Empties the datatable
- Rinse repeat starting at step 4 until the end of the CSV file
In order to avoid having to specify column names each time you import a new CSV, the SQL table column order and CSV column order must be the same. If you need to rearrange your columns, I suggest creating a view and importing to the view.
If you’re wondering why batching was used, it’s because the datatable seems to get exponentially slower as it grows above a few hundred thousand rows. I actually wasn’t able to import all 9 million rows into the datatable before my memory was exhausted. I tested various batch sizes, and 50k worked best for me.
The downside to using this script is that it doesn’t handle embedded delimiters well. So if you have a comma delimiter, and your data contains “This Works”,”This, doesn’t”, then it will fail on “This, doesn’t.” You can address this with a regex (as seen below, after the code).
The Script
# Database variables $sqlserver = "sqlserver" $database = "locations" $table = "allcountries" # CSV variables $csvfile = "C:\temp\allcountries.txt" $csvdelimiter = "`t" $firstRowColumnNames = $false ################### No need to modify anything below ################### Write-Host "Script started..." $elapsed = [System.Diagnostics.Stopwatch]::StartNew() [void][Reflection.Assembly]::LoadWithPartialName("System.Data") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") # 50k worked fastest and kept memory usage to a minimum $batchsize = 50000 # Build the sqlbulkcopy connection, and set the timeout to infinite $connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;" $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) $bulkcopy.DestinationTableName = $table $bulkcopy.bulkcopyTimeout = 0 $bulkcopy.batchsize = $batchsize # Create the datatable, and autogenerate the columns. $datatable = New-Object System.Data.DataTable # Open the text file from disk $reader = New-Object System.IO.StreamReader($csvfile) $columns = (Get-Content $csvfile -First 1).Split($csvdelimiter) if ($firstRowColumnNames -eq $true) { $null = $reader.readLine() } foreach ($column in $columns) { $null = $datatable.Columns.Add() } # Read in the data, line by line while (($line = $reader.ReadLine()) -ne $null) { $null = $datatable.Rows.Add($line.Split($csvdelimiter)) $i++; if (($i % $batchsize) -eq 0) { $bulkcopy.WriteToServer($datatable) Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())." $datatable.Clear() } } # Add in all the remaining rows since the last clear if($datatable.Rows.Count -gt 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } # Clean Up $reader.Close(); $reader.Dispose() $bulkcopy.Close(); $bulkcopy.Dispose() $datatable.Dispose() Write-Host "Script complete. $i rows have been inserted into the database." Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())" # Sometimes the Garbage Collector takes too long to clear the huge datatable. [System.GC]::Collect()
As I mentioned earlier, you can use regex if you have delimiters embedded in quotes. This slows down the script by a fair degree, but here’s some code to get you started.
# from http://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes $fieldsEnclosedInQuotes = $true ... if ($fieldsEnclosedInQuotes) { $csvSplit = "($csvdelimiter)" $csvsplit += '(?=(?:[^"]|"[^"]*")*$)' } else { $csvsplit = $csvdelimiter } $regexOptions = [System.Text.RegularExpressions.RegexOptions]::ExplicitCapture ... $columns = [regex]::Split($firstline, $csvSplit, $regexOptions) ... $null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
Streaming a 32-bit OleDbconnection
What I like about this method is that it’s still high performance (~2million rows/minute) and you can select subsets of data from CSV if you wish. Just change the CommandText to have a where clause, for example: $cmd.CommandText = “SELECT count(*) FROM [$tablename] WHERE Category=’Cajun'”. If your CSV doesn’t have column names, you can use the built-in column names F1, F2, F3, etc.
Here are the results from a million row subset:
The downside to this script is that you have to use an x86 shell, but because it doesn’t use any RAM, the performance actually isn’t impacted. Switching to another shell may just get a little complex if you embed this script in a function. I’ve mitigated this to the best of my ability by telling it to call the 32-bit shell with the current scriptname and arguments.
The Script
# The Text OleDB driver is only available in PowerShell x86. Start x86 shell if using x64. # This has to be the first check this script performs. if ($env:Processor_Architecture -ne "x86") { Write-Warning "Switching to x86 shell" $ps32exe = "$env:windir\syswow64\windowspowershell\v1.0\powershell.exe" &$ps32exe "$PSCommandPath $args"; return } # Database variables $sqlserver = "sqlserver" $database = "locations" $table = "allcountries" # CSV variables $csvfile = "C:\temp\million.csv" $csvdelimiter = "," $firstRowColumnNames = $true ################### No need to modify anything below ################### Write-Host "Script started..." $elapsed = [System.Diagnostics.Stopwatch]::StartNew() [void][Reflection.Assembly]::LoadWithPartialName("System.Data") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") # Setup bulk copy $connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database" $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) $bulkcopy.DestinationTableName = $table $bulkcopy.bulkcopyTimeout = 0 # Setup OleDB using Microsoft Text Driver. $datasource = Split-Path $csvfile $tablename = (Split-Path $csvfile -leaf).Replace(".","#") switch ($firstRowColumnNames) { $true { $firstRowColumnNames = "Yes" } $false { $firstRowColumnNames = "No" } } $connstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=$datasource;Extended Properties='text;HDR=$firstRowColumnNames;FMT=Delimited($csvdelimiter)';" $conn = New-Object System.Data.OleDb.OleDbconnection $conn.ConnectionString = $connstring $conn.Open() $cmd = New-Object System.Data.OleDB.OleDBCommand $cmd.Connection = $conn # Perform select on CSV file, then add results to a datatable using ExecuteReader $cmd.CommandText = "SELECT * FROM [$tablename]" $bulkCopy.WriteToServer($cmd.ExecuteReader([System.Data.CommandBehavior]::CloseConnection)) # Get Totals $totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2) $conn.Close(); $conn.Open() $cmd.CommandText = "SELECT count(*) FROM [$tablename]" $totalrows = $cmd.ExecuteScalar() Write-Host "Total Elapsed Time: $totaltime seconds. $totalrows rows added." -ForegroundColor Green # Clean Up $cmd.Dispose(); $conn.Dispose(); $bulkcopy.Dispose()
VisualBasic.FileIO.TextFieldParser to DataTable batches
This technique is the slowest, but if you’ve got embedded delimiters and need to stay in x64, you may want to consider it.
The Script
# Database variables $sqlserver = "sqlserver" $database = "locations" $table = "allcountries" # CSV variables $parserfile = "C:\temp\million.csv" $parserdelimiter = "," $firstRowColumnNames = $false $fieldsEnclosedInQuotes = $true ################### No need to modify anything below ################### Write-Host "Script started..." $elapsed = [System.Diagnostics.Stopwatch]::StartNew() [void][Reflection.Assembly]::LoadWithPartialName("System.Data") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") [void][Reflection.Assembly]::LoadWithPartialName("Microsoft.VisualBasic") # 50k worked fastest and kept memory usage to a minimum $batchsize = 50000 # Build the sqlbulkcopy connection, and set the timeout to infinite $connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;" $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) $bulkcopy.DestinationTableName = $table $bulkcopy.bulkcopyTimeout = 0 $bulkcopy.batchsize = $batchsize # Open text parser for the column names $columnparser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser($parserfile) $columnparser.TextFieldType = "Delimited" $columnparser.HasFieldsEnclosedInQuotes = $fieldsEnclosedInQuotes $columnparser.SetDelimiters($parserdelimiter) Write-Warning "Creating datatable" $datatable = New-Object System.Data.DataTable foreach ($column in $columnparser.ReadFields()) {[void]$datatable.Columns.Add()} $columnparser.Close(); $columnparser.Dispose() # Open text parser again from start (there's no reset) $parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser($parserfile) $parser.TextFieldType = "Delimited" $parser.HasFieldsEnclosedInQuotes = $fieldsEnclosedInQuotes $parser.SetDelimiters($parserdelimiter) if ($firstRowColumnNames -eq $true) {$null = $parser.ReadFields()} Write-Warning "Parsing CSV" while (!$parser.EndOfData) { try { $null = $datatable.Rows.Add($parser.ReadFields()) } catch { Write-Warning "Row $i could not be parsed. Skipped." } $i++; if (($i % $batchsize) -eq 0) { $bulkcopy.WriteToServer($datatable) Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())." $datatable.Clear() } } # Add in all the remaining rows since the last clear if($datatable.Rows.Count -gt 0) { $bulkcopy.WriteToServer($datatable) $datatable.Clear() } Write-Host "Script complete. $i rows have been inserted into the database." Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())" # Clean Up #$parser.Dispose(); $bulkcopy.Dispose(); $datatable.Dispose(); $totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2) Write-Host "Total Elapsed Time: $totaltime seconds. $i rows added." -ForegroundColor Green # Sometimes the Garbage Collector takes too long to clear the huge datatable. [System.GC]::Collect()
VB.NET Equivalent
Using a VB.NET compiled exe imports about 4 million rows a minute.
Sub Main() Dim oTimer As New System.Diagnostics.Stopwatch oTimer.Start() Dim iRows As Int64 = 0 Using oBulk As New Data.SqlClient.SqlBulkCopy("server=.;database=DATABASE;trusted_connection=yes", SqlClient.SqlBulkCopyOptions.TableLock) With {.DestinationTableName = "AllCountries", .BulkCopyTimeout = 0, .BatchSize = 100000} Using oSR As New IO.StreamReader("C:\allcountries.txt") Using oDT As New DataTable With oDT.Columns .Add("GeoNameID", GetType(System.Int32)) .Add("Name", GetType(System.String)) .Add("AsciiName", GetType(System.String)) .Add("AlternateNames", GetType(System.String)) .Add("Latitude", GetType(System.Decimal)) .Add("Longitude", GetType(System.Decimal)) .Add("FeatureClass", GetType(System.String)) .Add("FeatureCode", GetType(System.String)) .Add("CountryCode", GetType(System.String)) .Add("Cc2", GetType(System.String)) .Add("Admin1Code", GetType(System.String)) .Add("Admin2Code", GetType(System.String)) .Add("Admin3Code", GetType(System.String)) .Add("Admin4Code", GetType(System.String)) .Add("Population", GetType(System.Int64)) .Add("Elevation", GetType(System.String)) .Add("Dem", GetType(System.Int32)) .Add("Timezone", GetType(System.String)) .Add("ModificationDate", GetType(System.DateTime)) End With Dim iBatchsize As Integer = 0 Do While Not oSR.EndOfStream Dim sLine As String() = oSR.ReadLine.Split(vbTab) oDT.Rows.Add(sLine) iBatchsize += 1 If iBatchsize = 100000 Then oBulk.WriteToServer(oDT) oDT.Rows.Clear() iBatchsize = 0 Console.WriteLine("Flushing 100,000 rows") End If iRows += 1 Loop oBulk.WriteToServer(oDT) oDT.Rows.Clear() End Using End Using End Using oTimer.Stop() Console.WriteLine(iRows.ToString & "Records imported in " & oTimer.Elapsed.TotalSeconds & " seconds.") Console.ReadLine() End Sub
Conclusion
There are a number of tools that can be used to import CSV files into SQL Server. If you’re looking to use PowerShell, the StreamReader will give you the fastest results, but if your code has embedded delimiters, this may not be the method for you. Using OleDBConnection is a powerful, fast way to import entire CSVs or subsets of CSV data, but requires the use of an x86 shell, unless you download the Microsoft Access Database Engine 2010 Redistributable which provides a 64-bit text driver. Microsoft.VisualBasic.FileIO.TextFieldParser is slower, but works in 64-bit and processes embedded delimiters well.
Recent Comments