Find Duplicates in PowerShell Datatable using LINQ
Update 1/4: Removed conversion of datatable to list. That just wasted memory, as this works on all enumerables.
I don't have a practical application for this method, but I wanted to figure out how to do it anyway. Doing so was frustrating but kinda fun. It forced me to learn about lamdas in PowerShell which I'll probably use at some point in the future. Nivot Ink had some straightforward syntax that helped me, along with Matt Graeber's post on PowerShellMagazine.
Esentially, I was looking to recreate the following SQL statement in LINQ: SELECT ColumnName, COUNT(*) as DupeCount FROM TableName GROUP BY ColumnName HAVING COUNT(*) > 1. The PowerShell LINQ Equivalent is this basically:
# Group By $groupbyquery = [System.Func[System.Data.DataRow, string]] { param($row) $row.Artist, $row.Album } $groupby = [System.Linq.Enumerable]::GroupBy($list,$groupbyquery)
Where, can probably use Any, too.
$wherequery = [System.Func[System.Object, bool]] { param($row) $row.Artist.count -gt 1 -and $row.Album -gt 1 } $where = [System.Linq.Enumerable]::Where($groupby,$wherequery)
Select distinct, and add a count
$selectquery = [system.func[System.Object,System.Object]] { param($row) $null = $row.Item(0)["DupeCount"] = $row.count $row.Item(0) } $select = [System.Linq.Enumerable]::Select($where,$selectquery)
Sample Project
In this sample project, we'll create some fake data, then query it using LINQ. I tested this on larger data sets and found it got exponentially slower. The OleDBConnection method I wrote about earlier still smokes this technique. Then again, when it comes to .NET, I have little idea of what I'm doing, so maybe it can be optimized. I'm trying to figure out a way to query this list with CreateQuery, too.
Create Datatable with Dupes
# Create Datatable $dt = New-Object System.Data.Datatable "Music" [void]$dt.Columns.Add("Artist") [void]$dt.Columns.Add("Genre") [void]$dt.Columns.Add("Album") [void]$dt.Columns.Add("ReleaseYear") [void]$dt.Columns.Add("AddedToCatalog")
Add data
[void]$dt.Rows.Add("Poison","Glam Metal","Open Up and Say... Ahh!",1988,1/1/2015) [void]$dt.Rows.Add("Cinderella","Night Songs","Flesh & Blood",1986,1/1/2015) [void]$dt.Rows.Add("Eazy-E","Gangsta Rap","Eazy-Duz-It",1988,1/1/2015) [void]$dt.Rows.Add("Tim Smooth","Southern Rap","I Gotsta' Have It",1/1/2015)
<# Add duplicates, but note the changed dates on AddedToCatalog. Oops, Poison's album was added again on 1/20. Eazy-E, too. Dupes were added on 1/20, and 1/21. #> [void]$dt.Rows.Add("Poison","Glam Metal","Open Up and Say... Ahh!",1988,1/20/2015) [void]$dt.Rows.Add("Eazy-E","Gangsta Rap","Eazy-Duz-It",1988,1/20/2015) [void]$dt.Rows.Add("Eazy-E","Gangsta Rap","Eazy-Duz-It",1988,1/21/2015)
Prep the dupe check
First we have to decide what columns to examine. In this example, duplicate artist+album constitutes a duplicate.
# Perform the dupe check [void][Reflection.Assembly]::LoadWithPartialName("System.Data.DataSetExtensions") [void][Reflection.Assembly]::LoadWithPartialName("System.Data.Linq")
Add a count column prior to making the dataset a LINQ list (~17 ms)
[void]$dt.Columns.Add("DupeCount")
Cast as enumerable (7ms) then convert to list (192ms)
Don't do this, it's unnecessary: $list = [System.Linq.Enumerable]::ToList($source)
$list = [System.Data.DataTableExtensions]::AsEnumerable($dt)
Group By (8ms to bind)
$groupbyquery = [System.Func[System.Data.DataRow, string]] { param($row) $row.Artist, $row.Album } $groupby = [System.Linq.Enumerable]::GroupBy($list,$groupbyquery)
Where, can probably use Any, too. (6ms to bind)
$wherequery = [System.Func[System.Object, bool]] { param($row) $row.Artist.count -gt 1 -and $row.Album -gt 1 } $where = [System.Linq.Enumerable]::Where($groupby,$wherequery)
Select distinct, and add a count (6ms to bind)
$selectquery = [system.func[System.Object,System.Object]] { param($row) $null = $row.Item(0)["DupeCount"] = $row.count $row.Item(0) } $select = [System.Linq.Enumerable]::Select($where,$selectquery)
Populate new datatable with results and clean up
The above code was prep work, now we're going to actually execute the query, and populate a datatable with the results. Unfortunately, there's no good way to convert a Linq list to a Datatable so we'll just use .Clone() to clone build the new table schema, and then perform an ImportRow.
# Build & populate new dupetable (8.22 seconds for 21981 rows with 68 dupes) $dupetable = $dt.Clone() foreach ($row in $select) { $dupetable.ImportRow($row) }
Remove count column from original datatable
[void]$dt.Columns.Remove("DupeCount")
Show your results
$dupetable | Select Artist, Album, Dupecount
And voila, the results
Want to see the code all in one shot? Check out the gist.