Quickly Find Duplicates in Large CSV Files using PowerShell

Update 1/18/2015: This native PowerShell script can process over 165,000 rows a second. I’m so pumped. My million row CSV with 9k dupes processed in 6.4 seconds. I actually tried this using LINQ in PowerShell, but it seemed exponetially slower as the dataset grew.

A user on reddit’s PowerShell subreddit asked for the fastest way to search for duplicates in a CSV. This got me thinking that perhaps I could employ a technique similar to the one that I used in Import-CSVtoSQL.ps1. With this script, I’m able to import more than 1.7 million rows per minute from CSV to SQL.

I soon realized, however, that because the technique emptied the dataset, I wouldn’t be able to find duplicates within the entire CSV. So I wondered if it was possible to search a CSV using a set-based method rather than RBAR (row by agonizing row).

Finding duplicates in SQL Server using GROUP BY and HAVING is super fast because the query is set-based. Basically, set-based queries make SQL Server do work only once, whereas row-by-row based queries (such as CURSORS and UDFs) make SQL Server do work for every row. So how do I accomplish this set-based query natively, without SQL Server?

I thought it may be possible to perform a SELECT on this CSV data using XML or a datatable. At the end of the day, I ended up playing with bulkcopy, StreamReader, StringCollection, Dictionary, XmlTextWriter, XDocument, HashSet, Datatable, DataView, $dt.DefaultView.ToTable, notcontains, Select-Unique, Get-Content -ReadCount 0, and other methods. Each of these either weren’t fast enough or just didn’t work.

I read PowerShell Deep Dives which actually gave me some additional ideas on how to stream text, so I returned to Import-CSVtoSQL.ps1 to see if I could increase the performance even more using some streaming (think $bulkCopy.WriteToServer($sqlcmd.ExecuteReader())).

I ended up figuring out how to stream directly to $bulkcopy.WriteToServer() but was hugely disappointed when it actually decreased performance (go StreamReader!) But then I realized that I had actually come up with a way to process the CSV using and the results are fantastic. Here’s a screenshot of the results after parsing a one million row CSV file:


Ultimately, the 32-bit versions of OdbcConnection and OleDbConnection Text drivers did the trick. 64-bit drivers are available but you have to download them separately, and they weren’t even faster. No thanks! We’ll just use the 32-bit version of PowerShell.

Note that a couple things will impact the speed of this script. The first is the number of dupes returned. Because the duplicates are added to a datatable, the more dupes you have, the longer it will take to fill. Executing just the dupecheck query completed in 5.56 seconds.


Also, the type of data it’s comparing seems to matter. Text fields were super fast (6 secs), whereas number fields were much slower (14 secs). This can possibly be addressed by typing via scheme.ini, but this is fast enough for me.

The Script

Note that if you get a “Cannot update. Database or object is read-only” error, your delimiter is probably wrong, or the file does not exist. Make sure you use the full path and that your file extension is .txt or .csv.

I used a million row, comma delimited subset of allcountries.zip. If you want to use OdbcConnection instead, you’ll have to modify your $connstring to “Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=$datasource;” Note that OdbcConnection’s connectionstring does not support Extended Properties, so your first row must be your headers unless you use scheme.ini. Also, I saw no performance gains or losses using OdbcConnection.

What’s cool about this is that it’s also an efficient, fast way to get subsets of large CSV files. Just change the $sql statement to get whatever results you’re looking for.

I plan to detail this technique, formalize the script, add more automation and error handling then make it available on ScriptCenter shortly.

Chrissy is a PowerShell MVP who has worked in IT for nearly 20 years, and currently serves as a Sr. Database Engineer in Belgium. Always an avid scripter, she attended the Monad session at Microsoft’s Professional Developers Conference in Los Angeles back in 2005 and has worked and played with PowerShell ever since. Chrissy is currently pursuing an MS in Systems Engineering at Regis University and helps maintain RealCajunRecipes.com in her spare time. She holds a number of certifications, including those relating to SQL Server, SuSE Linux, SharePoint and network security. She recently became co-lead of the SQL PASS PowerShell Virtual Chapter. You can follow her on Twitter at @cl.

Posted in PowerShell
3 comments on “Quickly Find Duplicates in Large CSV Files using PowerShell
  1. Chuck Staples says:

    Just ran across this after viewing some of your other content. Talk about your real-world solutions! Can’t wait to try this one.

    (Kudos for making the script VERY readable at a first glance.)

  2. James Brown says:

    G’day Chrissy,
    Bit of a PowerShell beginner here, but trying to work out how to actually get it to write the de-duped lines to a new file. This just seems to just output the number of lines that are duplicated, but not do anything with that. Is there a way to get this to export/write out the de-duped lines?

    • Alex Crichton says:

      I am running this on a file with hundreds of millions of rows so its still running 25 min later. However Chrissy LeMaire says in her description the dupe rows are in an object called $dt have you tried $dt|out-file ‘pathwhereyouwantthefile.fileextension’?

Leave a Reply

Your email address will not be published. Required fields are marked *


Migrating SQL Server?

dbatools is an awesome PowerShell module that helps you migrate entire instances with a single command.

Available from dbatools.io and github

SqlServer Needs You

SqlServer now has a dedicated engineer and Microsoft is asking for our input!

Upvote priorities and cmdlets now


Chrissy LeMaire

Brandon Abshire
View Brandon Abshire, MCDBA's profile on LinkedIn


Chrissy has been awarded the Microsoft MVP for her work in the PowerShell community.

Join us!

Belgian PowerShell
User Group

  SQL PASS PowerShell
Virtual User Group


Upvotes Needed

Help persuade Microsoft to open source SQL Server's PowerShell module, SQLPS.

Upvote now on Microsoft Connect