Find Duplicates in PowerShell Datatable using LINQ

I don’t have a practical application for this method, but I wanted to figure out how to do it anyway. Doing so was frustrating but kinda fun. It forced me to learn about lamdas in PowerShell which I’ll probably use at some point in the future. Nivot Ink had some straightforward syntax that helped me, along with Matt Graeber’s post on PowerShellMagazine.

Esentially, I was looking to recreate the following SQL statement in LINQ: SELECT ColumnName, COUNT(*) as DupeCount FROM TableName GROUP BY ColumnName HAVING COUNT(*) > 1. The PowerShell LINQ Equivalent is this basically:

Sample Project

In this sample project, we’ll create some fake data, then query it using LINQ. I tested this on larger data sets and found it got exponentially slower. The OleDBConnection method I wrote about earlier still smokes this technique. Then again, when it comes to .NET, I have little idea of what I’m doing, so maybe it can be optimized. I’m trying to figure out a way to query this list with CreateQuery, too.

Create Datatable with Dupes

Prep the dupe check

First we have to decide what columns to examine. In this example, duplicate artist+album constitutes a duplicate.

Populate new datatable with results and clean up

The above code was prep work, now we’re going to actually execute the query, and populate a datatable with the results. Unfortunately, there’s no good way to convert a Linq list to a Datatable so we’ll just use .Clone() to clone build the new table schema, and then perform an ImportRow.

And voila, the results

get-dupes

Want to see the code all in one shot? Check out the gist.

Posted in PowerShell

High-Performance Techniques for Importing CSV to SQL Server

If you’ve ever tried to use PowerShell’s Import-CSV with large files, you know that it can exhaust all of your RAM. Previously, I created a script on ScriptCenter that used an alternative tecehnique to import large CSV files, and even imported them into SQL Server at a rate of 1.7 million rows a minute. I’ve now gotten that number up to a massive 3.2 million rows a minute for non-indexed tables and 2.6 million rows a minute for tables with clustered indexes.

Note: I’ve added the VB.NET code equivalent at the end of this post, courtesy of a fellow redditor. The VB.NET executable imported about 4 million rows a minute.

Importing into SQL Server

Two of the fastest ways to import data is to use bcp or SqlBulkCopy. BCP is finicky, and requires a separate download. My intention is to create a script that can be used on any machine running PowerShell v3 and above, so these scripts will be using SqlBulkCopy. The catch is that SqlBulkCopy.WriteToServer() requires a Datatable, Datarow, or IDataReader as input. This means your plain-text CSV data must be transformed into one of these objects prior to being passed to WriteToServer().

I’ve tested three different methods for transforming CSV data to a consumable object. The first method uses SqlBulkCopy, StreamReader and DataSet batches. The second streams an OleDbconnection query result set to SqlBulkCopy. The third uses SqlBulkCopy, VisualBasic.FileIO.TextFieldParser and DataSet batches. I actually tested logparser, too, but while logparser is amazing (and way way faster) for parsing files, it doesn’t take advantage of sqlbulkcopy, so these approaches actually outperform logparser (and my tests with SSIS for that matter).

logparservsole

Setup

My lab is a VMWare virtualized lab which consists of a Windows 8.1 machine with 8GB RAM, and a SQL Server 2014 instance with 4GB RAM (poor thing). Both are on on the same ESX host which sports a directly connected Western Digital VelociRaptor 1TB. The database’s recovery model is set to SIMPLE.

The 1.1 GB CSV in this example contains 9.8 million+ rows of tab-delimited longitude and latitude data. A zip containing the CSV is available for download at geonames.org.

First, I had to create the SQL table.

Then I tested the various methods. You’ll probably notice that these scripts don’t have much error handling. I wanted to keep the code easy to read and straightforward. Finalized scripts with error handling and all that will be placed on ScriptCenter shortly.

StreamReader to DataTable batches

This one is by far the fastest, at up to 3.2 million rows per minute on a non-indexed table.

no-index2

Realistically, you’ll probably be importing to an indexed table, which performs up to 2.6 million rows per minute.

26million

Basically, the script performs the following

  1. Creates the SQL Bulk Copy connection
  2. Creates the PowerShell datatable, along with its columns
  3. Reads the CSV file using System.IO.StreamReader
  4. Using readline(), loads the CSV data row by row into the datatable
  5. Performs the bulk import every x number of rows
  6. Empties the datatable
  7. Rinse repeat starting at step 4 until the end of the CSV file

In order to avoid having to specify column names each time you import a new CSV, the SQL table column order and CSV column order must be the same. If you need to rearrange your columns, I suggest creating a view and importing to the view.

If you’re wondering why batching was used, it’s because the datatable seems to get exponentially slower as it grows above a few hundred thousand rows. I actually wasn’t able to import all 9 million rows into the datatable before my memory was exhausted. I tested various batch sizes, and 50k worked best for me.

The downside to using this script is that it doesn’t handle embedded delimiters well. So if you have a comma delimiter, and your data contains “This Works”,”This, doesn’t”, then it will fail on “This, doesn’t.” You can address this with a regex (as seen below, after the code).

The Script

As I mentioned earlier, you can use regex if you have delimiters embedded in quotes. This slows down the script by a fair degree, but here’s some code to get you started.

Streaming a 32-bit OleDbconnection

What I like about this method is that it’s still high performance (~2million rows/minute) and you can select subsets of data from CSV if you wish. Just change the CommandText to have a where clause, for example: $cmd.CommandText = “SELECT count(*) FROM [$tablename] WHERE Category=’Cajun'”. If your CSV doesn’t have column names, you can use the built-in column names F1, F2, F3, etc.

Here are the results from a million row subset:
ole

The downside to this script is that you have to use an x86 shell, but because it doesn’t use any RAM, the performance actually isn’t impacted. Switching to another shell may just get a little complex if you embed this script in a function. I’ve mitigated this to the best of my ability by telling it to call the 32-bit shell with the current scriptname and arguments.

The Script

VisualBasic.FileIO.TextFieldParser to DataTable batches

This technique is the slowest, but if you’ve got embedded delimiters and need to stay in x64, you may want to consider it.

vbsmall

The Script

VB.NET Equivalent

Using a VB.NET compiled exe imports about 4 million rows a minute.

Conclusion

There are a number of tools that can be used to import CSV files into SQL Server. If you’re looking to use PowerShell, the StreamWriter will give you the fastest results, but if your code has embedded delimiters, this may not be the method for you. Using OleDBConnection is a powerful, fast way to import entire CSVs or subsets of CSV data, but requires the use of an x86 shell, unless you download the Microsoft Access Database Engine 2010 Redistributable which provides a 64-bit text driver. Microsoft.VisualBasic.FileIO.TextFieldParser is slower, but works in 64-bit and processes embedded delimiters well.

Posted in PowerShell, SQL Server

Quickly Find Duplicates in Large CSV Files using PowerShell

Update 1/18/2015: This native PowerShell script can process over 165,000 rows a second. I’m so pumped. My million row CSV with 9k dupes processed in 6.4 seconds. I actually tried this using LINQ in PowerShell, but it seemed exponetially slower as the dataset grew.

A user on reddit’s PowerShell subreddit asked for the fastest way to search for duplicates in a CSV. This got me thinking that perhaps I could employ a technique similar to the one that I used in Import-CSVtoSQL.ps1. With this script, I’m able to import more than 1.7 million rows per minute from CSV to SQL.

I soon realized, however, that because the technique emptied the dataset, I wouldn’t be able to find duplicates within the entire CSV. So I wondered if it was possible to search a CSV using a set-based method rather than RBAR (row by agonizing row).

Finding duplicates in SQL Server using GROUP BY and HAVING is super fast because the query is set-based. Basically, set-based queries make SQL Server do work only once, whereas row-by-row based queries (such as CURSORS and UDFs) make SQL Server do work for every row. So how do I accomplish this set-based query natively, without SQL Server?

I thought it may be possible to perform a SELECT on this CSV data using XML or a datatable. At the end of the day, I ended up playing with bulkcopy, StreamReader, StringCollection, Dictionary, XmlTextWriter, XDocument, HashSet, Datatable, DataView, $dt.DefaultView.ToTable, notcontains, Select-Unique, Get-Content -ReadCount 0, and other methods. Each of these either weren’t fast enough or just didn’t work.

I read PowerShell Deep Dives which actually gave me some additional ideas on how to stream text, so I returned to Import-CSVtoSQL.ps1 to see if I could increase the performance even more using some streaming (think $bulkCopy.WriteToServer($sqlcmd.ExecuteReader())).

I ended up figuring out how to stream directly to $bulkcopy.WriteToServer() but was hugely disappointed when it actually decreased performance (go StreamReader!) But then I realized that I had actually come up with a way to process the CSV using and the results are fantastic. Here’s a screenshot of the results after parsing a one million row CSV file:

newestdupecheck

Ultimately, the 32-bit versions of OdbcConnection and OleDbConnection Text drivers did the trick. 64-bit drivers are available but you have to download them separately, and they weren’t even faster. No thanks! We’ll just use the 32-bit version of PowerShell.

Note that a couple things will impact the speed of this script. The first is the number of dupes returned. Because the duplicates are added to a datatable, the more dupes you have, the longer it will take to fill. Executing just the dupecheck query completed in 5.56 seconds.

nonquery

Also, the type of data it’s comparing seems to matter. Text fields were super fast (6 secs), whereas number fields were much slower (14 secs). This can possibly be addressed by typing via scheme.ini, but this is fast enough for me.

The Script

Note that if you get a “Cannot update. Database or object is read-only” error, your delimiter is probably wrong, or the file does not exist. Make sure you use the full path and that your file extension is .txt or .csv.

I used a million row, comma delimited subset of allcountries.zip. If you want to use OdbcConnection instead, you’ll have to modify your $connstring to “Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=$datasource;” Note that OdbcConnection’s connectionstring does not support Extended Properties, so your first row must be your headers unless you use scheme.ini. Also, I saw no performance gains or losses using OdbcConnection.

What’s cool about this is that it’s also an efficient, fast way to get subsets of large CSV files. Just change the $sql statement to get whatever results you’re looking for.

I plan to detail this technique, formalize the script, add more automation and error handling then make it available on ScriptCenter shortly.

Posted in PowerShell

Professional PowerShell-based SQL Server SMO Recipes

Last year, I finally started working with SMO (SQL Management Objects) within PowerShell and I absolutely love it. Below are a few recipes that I created and use extensively throughout my SQL Management scripts, which you can find on ScriptCenter. Some recipes show more than one way to perform a task.

These SMO recipes are particularly useful because of the extensive automation being performed. For instance, in the restore recipe, file structures are automatically determined using information contained within the given .bak file. These scripts were created and tested on Windows 8.1 with PowerShell 4 and SQL Server 2014. PowerShell v3 and above is required for many of the recipes.

SQL Administration

Programmatically change SQL Server startup parameters

Usually when we deal with SMO, we’re using the Microsoft.SqlServer.SMO. This snippet, however, uses Microsoft.SqlServer.SqlWmiManagement. This is what controls everything you usually see in SQL Server Configuration Manager.

sql-config-mgr

It’s important to note that you must use the server’s network name when working with SqlWmiManagement. This script automatically determines your SQL instance’s network name by using $server.ComputerNamePhysicalNetBIOS.

Get info similar to SQL Server Management Studio Activity Monitor Processes tab

I used this technique to collect information for my DB/Login Inventory App, Watch-DBLogins.ps1.

In my script, I added the results to a SQL table and collected the information over time.
Watch-DBLogins

Get Server Configuration (sp_cofigure)

$server.Configuration is useful for not only viewing, but modifying server configurations as well. Don’t forget the .Alter() if you make any changes.

To see more recipes, head to the PowerShell SMO Recipe index.

Posted in PowerShell, SQL Server

Enumerate file structure within detached MDF database file using SMO

While researching how connect to a detached MDF and read the internal file structure of the internal files, including LDF files, I kept seeing suggestions for using the undocumented DBCC command DBCC checkprimaryfile. Initially, my code looked like this:

After some exploration, I discovered that this task can actually be accomplished purely in SMO using EnumDetachedDatabaseFiles and EnumDetachedLogFiles.

Posted in PowerShell, SQL Server

Create Scheduled Task or Scheduled Job to Indefinitely Run a PowerShell Script Every 5 Minutes

Here are two straight-forward ways to create Scheduled Tasks within Windows using PowerShell. One script creates a PowerShell Scheduled Job, and the other creates a classic Windows Scheduled Task. You can read about the differences here. PowerShell v3 and above required.

Create a Scheduled Job

The following script creates a PowerShell Scheduled Job which shows up in the Task Scheduler GUI, and also generates files located at:

$home\AppData\Local\Microsoft\Windows\PowerShell\ScheduledJobs\\Output

I included the options that I most often use — repeat every X indefinitely, run as specified user, and run with highest privileges. Note that you should always use service accounts or accounts without expiring or changing passwords when setting up Scheduled Jobs and Scheduled Tasks.

This job will appear under Task Scheduler -> Task Scheduler Library -> Microsoft -> Windows -> PowerShell -> ScheduledJobs.

taskscheduler

These Scheduled Jobs can be managed within PowerShell using cmdlets.

Create a Scheduled Task

This script will generate a new Scheduled Task. As previously stated, I included the options that I most often use — repeat every X indefinitely, run as specified user, and run with highest privileges. Note that you should always use service accounts or accounts without expiring or changing passwords when setting up Scheduled Jobs and Scheduled Tasks.

taskscheduler2
This job will appear under Task Scheduler -> Task Scheduler Library. Also, don’t forget to return $true or $false. If you don’t, the “Last Run Result” will be 0x1.

Posted in PowerShell, Windows

Workaround for SMO Bug: urn could not be resolved at level OleDbProviderSetting

Microsoft ConnectM isn’t letting me sign in to post this workaround, but I just ran into a bug with the OleDbProviderSettings within SMO.

smobug

To get around it, I used $server.Settings.OleDbProviderSettings instead of $server.OleDbProviderSettings

Posted in PowerShell, SQL Server

NetScaler VPX Kinda Quick Start

Using a NetScaler within home lab is beneficial, mostly because you can use the same IP over and over for different services. Imagine providing VPN, Lync Autodiscover, Exchange Autodiscover, and SharePoint all over port 443. That’s what a NetScaler VPX can do for you, for free.

Below is a tutorial that will help you get started with NetScaler. For everyday admins who haven’t done this before, it takes some time to understand, but here are a few important things to note:

  • The NetScaler VPX is free, but you have to get a license and renew it annually. The license just requires you pasting in your FlexNet Host ID (which you get from command line later on in a tutorial), downloading the license, then uploading the license to your VPX.
  • You will need to allocate at least two IP’s to your NetScaler. One to manage from (MIP), and one subnet IP (SNIP). This subnet IP isn’t a subnet mask, it’s just another IP within the subnet that allows your NetScaler to monitor the servers within the subnet. In my lab, my management IP is 10.0.0.10 and my subnet IP is 10.0.0.11. If you forget to set your subnet up, your monitors will fail, and everything will show as down.

This section is mostly text. If you’d like to see a deployment with step-by-step screenshots, visit Phillip Jones’ NetScaler VPX tutorial.

Download, Deployment, Basic Configuration and License

Download
  • First, go get the Citrix NetScaler VPX Express and License.
  • Scroll down to the middle until you see Download. I downloaded: NetScaler VPX for ESX 10.5.e.
  • Once the file has finished downloading, extract it to C:\temp. Keep this tab open, we’ll return to it later.
Deployment & Basic Configuration
  • Figure out what IP you’ll give your NetScaler, then add it to your local DNS.
  • Deploy your OVF
  • Don’t Power On after Deployment
  • Right-click VM -> Edit Settings -> Remove Network Adapter 2
  • Right-click VM -> Upgrade Virtual Hardware
  • Power on
  • Wait for the NetScaler to boot up, then enter in your IP information.

    2
  • Open up a new tab in your browser, and load up your NetScaler by IP or hostname. My DNS is netscaler.ad.local, so I point to http://netscaler
  • Login with nsroot and password nsroot
  • Next, you’ll be prompted to enter a few things: Subnet IP address, Host/DNS Server/Time Zone and License. Again, my Subnet IP is just the next available IP within my subnet (MIP: 10.0.0.10, SNIP: 10.0.0.11)

    5

License
  • Once you’re finished entering your SNIP, hostname/DNS/Time Zone, we need to do two things: 1. Get to the licensing page within your browser and 2. Get the host information needed to download your license.
  • Go back to the Citrix browser tab I referenced not closing in the first section, and scroll down a little more to License. Expand the + and click Get License.

  • Then click Allocate License.
  • Next, you’ll see a popup with your serial number. Click the serial number, and you’ll be brought to the license allocation page. Click Continuewhen prompted.
  • See that blank Host ID? We need to go get it from the NetScaler.

    1

  • Fire up putty and connect to your NetScaler via SSH. This is required to get the information needed for your license.
  • Login with username nsroot and password nsroot
  • Type shell and enter.
  • Once at the shell prompt, type lmutil lmhostid -ether
  • Copy the FlexNet Host ID and go back to your browser. Paste it into the form box titled Host ID

    3

  • Go back to the Citrix tab and paste that Host ID.
    4
  • Click Continue -> Confirm -> OK
  • Your license will be downloaded and saved as a .lic file
  • Back to your NetScaler tab (http://netscaler), upload your license.
    7
  • Click reboot
  • Now that your license is setup, and you’ve rebooted, HTTPS will work! Login to your netscaler via HTTPS and change the nsroot password: System  -> User Administration -> Users -> nsroot -> Change Password.
  • Now enable some features: System -> Settings -> Configure Basic Features, then check all these boxes:

    basic-features

  • Optional: Configure Advanced Features -> Web Logging, IPv6 Protocol Translation, EdgeSight Monitoring, AppFlow, Responder, vPath.
  • Since the VPX is a virtual machine, it’s prone to time drift. Add an NTP Server to ensure AD authentication will work if you eventually set it up. System -> NTP Servers, add your DC’s NTP server. Once you’re done, hit Save*.
    NTP
    * Hit Save a lot. I once lost a few hours worth of configuration because I forgot to save and the machine shut down unexpectedly.

  • Optional: Follow this step-by-step guide to setup Active Directory authentication
  • Optional: If you’d like to replace the SSL cert on the web management interface, use this tutorial which is way easier than the method Citrix provides.

In the next post, Using NetScaler VPX in your Home Lab, I’ll demonstrate how to use Content Switching within NetScaler to support a number of services over one SSL port/IP. It may be a little while, though.

Posted in Security