T-SQL: Parse Top Level Domain from URL
I keep track of all hits to my website RealCajunRecipes.com in a SQL table called hitcounter which has columns for the user's IP, browser, referring URL and the date. Recently, I saw a surge in traffic and wanted to know which domains were sending the traffic our way. After getting tired of issuing ad-hoc queries that included WHERE clauses like "where referer like '%google%'", I created a SQL Server user-defined function (UDF) to extract the domain from the referring URL.
CREATE FUNCTION [dbo].[parseURL] (@strURL varchar(1000)) RETURNS varchar(1000) AS BEGIN IF CHARINDEX('https://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0 -- Ghetto-tastic SELECT @strURL = REPLACE(@strURL,'https://','') SELECT @strURL = REPLACE(@strURL,'https://','') SELECT @strURL = REPLACE(@strURL,'www','') -- Remove everything after "/" if one exists IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))
-- Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1) ELSE -- It's likely a google.co.uk, or google.com.au SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1) RETURN @strURL END This script does the following: 1. Checks to see if the string is an URL (example: str = https://www.search.google.com.au/?q=netnerds) 2. Removes http, https and www (str = search.google.com.au/?q=netnerds) 3. Removes everything after the slash (str = search.google.com.au) 4. Removes excessive subdomains (str = google.com.au)
The script isn't perfect; I saw things like mysearch.myway.com get by but it's good enough for general use. If you'd like to see the entire domain, just remove the 4 line chunk marked "Optional."
To call this using SQL, modify this sample script to suite your environment:
SELECT COUNT(*) as theCount, dbo.parseURL(referer) as referer FROM hitcounter WHERE referer IS NOT NULL GROUP BY dbo.parsedomain(referer) ORDER BY thecount DESC Your results should look something like this
11831 | google.com |
10542 | yahoo.com |
9101 | msn.com |
746 | google.ca |
624 | google.co.uk |
Note: NULLs aren't parsed and thsu won't kill this function..they'll just show up as NULL.