Quickly Find Duplicates in Large CSV Files using PowerShell

Update 1/18/2015: This native PowerShell script can process over 165,000 rows a second. I'm so pumped. My million row CSV with 9k dupes processed in 6.4 seconds. I actually tried this using LINQ in PowerShell, but it seemed exponetially slower as the dataset grew.

A user on reddit's PowerShell subreddit asked for the fastest way to search for duplicates in a CSV. This got me thinking that perhaps I could employ a technique similar to the one that I used in Import-CSVtoSQL.ps1. With this script, I'm able to import more than 1.7 million rows per minute from CSV to SQL.

I soon realized, however, that because the technique emptied the dataset, I wouldn't be able to find duplicates within the entire CSV. So I wondered if it was possible to search a CSV using a set-based method rather than RBAR (row by agonizing row).

Finding duplicates in SQL Server using GROUP BY and HAVING is super fast because the query is set-based. Basically, set-based queries make SQL Server do work only once, whereas row-by-row based queries (such as CURSORS and UDFs) make SQL Server do work for every row. So how do I accomplish this set-based query natively, without SQL Server?

I thought it may be possible to perform a SELECT on this CSV data using XML or a datatable. At the end of the day, I ended up playing with:

  • bulkcopy
  • StreamReader
  • StringCollection
  • Dictionary
  • XmlTextWriter
  • XDocument
  • HashSet
  • Datatable
  • DataView
  • $dt.DefaultView.ToTable
  • notcontains
  • Select-Unique
  • Get-Content -ReadCount 0

Each of these either weren't fast enough or just didn't work.

I read PowerShell Deep Dives which actually gave me some additional ideas on how to stream text, so I returned to Import-CSVtoSQL.ps1 to see if I could increase the performance even more using some streaming (think $bulkCopy.WriteToServer($sqlcmd.ExecuteReader())).

I ended up figuring out how to stream directly to $bulkcopy.WriteToServer() but was hugely disappointed when it actually decreased performance (go StreamReader!) But then I realized that I had actually come up with a way to process the CSV using and the results are fantastic. Here's a screenshot of the results after parsing a one million row CSV file:

newestdupecheck

Ultimately, the 32-bit versions of OdbcConnection and OleDbConnection Text drivers did the trick. 64-bit drivers are available but you have to download them separately, and they weren't even faster. No thanks! We'll just use the 32-bit version of PowerShell.

Note that a couple things will impact the speed of this script. The first is the number of dupes returned. Because the duplicates are added to a datatable, the more dupes you have, the longer it will take to fill. Executing just the dupecheck query completed in 5.56 seconds.

nonquery

Also, the type of data it's comparing seems to matter. Text fields were super fast (6 secs), whereas number fields were much slower (14 secs). This can possibly be addressed by typing via scheme.ini, but this is fast enough for me.

The Script

 1# The Text OleDB driver is only available in PowerShell x86. Start x86 shell if using x64.
 2# This has to be the first check this script performs.
 3if ($env:Processor_Architecture -ne "x86")   { 
 4	Write-Warning "Switching to x86 shell"
 5	&"$env:windirsyswow64windowspowershellv1.0powershell.exe" "$PSCommandPath $args"; return 
 6}
 7
 8# Change to your CSV file name, must end in .csv or .tsv
 9$csvfile = "C:tempmillion-commas.txt"
10 
11# Does the first row contain column names?
12$firstRowColumns = $false
13 
14# What's the delimiter? Use `t for tabbed.
15$csvdelimter = ","
16 
17# By default, OleDbconnection columns are named F1, F2, F3, etc unless $firstRowColumns = $true
18# Alternatively, you could make it check all rows. I'll add that to the script later and post it.
19$checkColumns = "F2, F3"
20 
21################### No need to modify anything below ###################
22$datasource = Split-Path $csvfile
23$tablename = (Split-Path $csvfile -leaf).Replace(".","#")
24 
25switch ($firstRowColumns) {
26    $true { $firstRowColumns = "Yes" }
27    $false { $firstRowColumns = "No" }
28}
29 
30$elapsed = [System.Diagnostics.Stopwatch]::StartNew() 
31[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
32 
33# Setup OleDB using Microsoft Text Driver.
34$connstring = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=$datasource;Extended Properties='text;HDR=$firstRowColumns;FMT=Delimited($csvdelimter)';"
35 
36$conn = New-Object System.Data.OleDb.OleDbconnection
37$conn.ConnectionString = $connstring
38$conn.Open()
39$cmd = New-Object System.Data.OleDB.OleDBCommand
40$cmd.Connection = $conn
41 
42# Perform select on CSV file, then add results to a datatable using ExecuteReader
43$sql = "SELECT $checkColumns, COUNT(*) as DupeCount FROM [$tablename] GROUP BY $checkColumns HAVING COUNT(*) > 1"
44$cmd.CommandText = $sql
45$dt = New-Object System.Data.DataTable
46$dt.BeginLoadData()
47$dt.Load($cmd.ExecuteReader([System.Data.CommandBehavior]::CloseConnection))
48$dt.EndLoadData()
49$totaltime = [math]::Round($elapsed.Elapsed.TotalSeconds,2)
50
51# Get Total Row Count
52$conn.Open()
53$cmd.CommandText = "SELECT COUNT(*) as TotalRows FROM [$tablename]"
54$totalrows = $cmd.ExecuteScalar()
55$conn.Close()
56
57# Output some stats
58$dupecount = $dt.Rows.Count
59Write-Host "Total Elapsed Time: $totaltime seconds. $dupecount duplicates found out of $totalrows total rows. You can access these dupes using `$dt." -ForegroundColor Green

Note that if you get a "Cannot update. Database or object is read-only" error, your delimiter is probably wrong, or the file does not exist. Make sure you use the full path and that your file extension is .txt or .csv.

I used a million row, comma delimited subset of allcountries.zip. If you want to use OdbcConnection instead, you'll have to modify your $connstring to "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=$datasource;"

Note that OdbcConnection's connectionstring does not support Extended Properties, so your first row must be your headers unless you use scheme.ini. Also, I saw no performance gains or losses using OdbcConnection.

What's cool about this is that it's also an efficient, fast way to get subsets of large CSV files. Just change the $sql statement to get whatever results you're looking for.

I plan to detail this technique, formalize the script, add more automation and error handling then make it available on ScriptCenter shortly.