T-SQL: Parse Top Level Domain from URL

I keep track of all hits to my website RealCajunRecipes.com in a SQL table called hitcounter which has columns for the user’s IP, browser, referring URL and the date. Recently, I saw a surge in traffic and wanted to know which domains were sending the traffic our way. After getting tired of issuing ad-hoc queries that included WHERE clauses like “where referer like ‘%google%'”, I created a SQL Server user-defined function (UDF) to extract the domain from the referring URL.

parseurl.sql

CREATE      FUNCTION [dbo].[parseURL]  (@strURL varchar(1000))
RETURNS varchar(1000)
AS
BEGIN
IF CHARINDEX('http://',@strURL) > 0 OR CHARINDEX('https://',@strURL) > 0
-- Ghetto-tastic
SELECT @strURL = REPLACE(@strURL,'https://','')
SELECT @strURL = REPLACE(@strURL,'http://','')
SELECT @strURL = REPLACE(@strURL,'www','')
-- Remove everything after "/" if one exists
IF CHARINDEX('/',@strURL) > 0 (SELECT @strURL = LEFT(@strURL,CHARINDEX('/',@strURL)-1))

-- Optional: Remove subdomains but differentiate between www.google.com and www.google.com.au
IF (LEN(@strURL)-LEN(REPLACE(@strURL,'.','')))/LEN('.') < 3 -- if there are less than 3 periods
SELECT @strURL = PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
ELSE -- It's likely a google.co.uk, or google.com.au
SELECT @strURL = PARSENAME(@strURL,3) + '.' + PARSENAME(@strURL,2) + '.' + PARSENAME(@strURL,1)
RETURN @strURL
END

This script does the following:
1. Checks to see if the string is an URL
    (example: str = http://www.search.google.com.au/?q=netnerds)
2. Removes http, https and www (str = search.google.com.au/?q=netnerds)
3. Removes everything after the slash (str = search.google.com.au)
4. Removes excessive subdomains (str = google.com.au)

The script isn’t perfect; I saw things like mysearch.myway.com get by but it’s good enough for general use. If you’d like to see the entire domain, just remove the 4 line chunk marked “Optional.”

To call this using SQL, modify this sample script to suite your environment:

SELECT COUNT(*) as theCount, dbo.parseURL(referer) as referer FROM hitcounter
WHERE referer IS NOT NULL
GROUP BY dbo.parsedomain(referer)
ORDER BY thecount DESC

Your results should look something like this

11831 google.com
10542 yahoo.com
9101 msn.com
746 google.ca
624 google.co.uk

Note: NULLs aren’t parsed and thsu won’t kill this function..they’ll just show up as NULL.

Chrissy is a PowerShell MVP who has worked in IT for nearly 20 years, and currently serves as a Sr. Database Engineer in Belgium. Always an avid scripter, she attended the Monad session at Microsoft’s Professional Developers Conference in Los Angeles back in 2005 and has worked and played with PowerShell ever since. Chrissy is currently pursuing an MS in Systems Engineering at Regis University and helps maintain RealCajunRecipes.com in her spare time. She holds a number of certifications, including those relating to SQL Server, SuSE Linux, SharePoint and network security. She recently became co-lead of the SQL PASS PowerShell Virtual Chapter. You can follow her on Twitter at @cl.

Posted in SQL Server
16 comments on “T-SQL: Parse Top Level Domain from URL
  1. cm says:

    thanks -works great

  2. Eric P says:

    THis is absolutely fantastic! I am getting a curious result though, and I am not sure what to make of it.

    if I pass the URL from my logs through the function, it is successful at returning the root domain, unless the string was going to the web root.

    For example, If the entered string was http://www.google.com, the function returns a null, if the string was http://www.google.com/calendar?tab=mc , it properly returns google.com. Is the missing slash at the end of the input string the culprit?

  3. Eric P says:

    Never mind…I found that for whatever reason, my logging software is adding scads of whitespace to the right of the site if the entry goes directly to the root website causing the record to exceed the allowed length of the variable. If I use an rtrim(site), everything works as it should. I wll be contacting the programmers of the logging software.

    Thanks a million for this!!

  4. Edward says:

    Thank you very much!

  5. Jerry @ Moonvalley Software says:

    Thanks for the tips and the sample, very helpful.

  6. Robert Lorbeer says:

    Nice. Thank you for your generosity.

  7. R B says:

    merci, but how about ‘http://google.com.au’

  8. Dmitry says:

    the script is required to be improved, because it will not work on names like
    http://md.com.ua or http://mdx.com.ua or any similar…

  9. Jeff says:

    Nice – this is a great function!

  10. Thanks – just what I needed! It’s a nice idea to use PARSENAME to get the subdomains.

  11. TomM says:

    So .. for the part of this where you are triggering the function, my table name is syslogB, and I assume that goes in place of the “hitcounter” tablename, but when I run the SQL part to trigger the function (which created with no problem) I get an error….

    ‘Msg 207, Level 16, State 1, Line 2
    Invalid column name ‘referer’.
    Msg 4121, Level 16, State 1, Line 2
    Cannot find either column “dbo” or the user-defined function or aggregate “dbo.parsedomain”, or the name is ambiguous.’

    Any thoughts would be appreciated…

  12. The Dude says:

    Very cool…

  13. VALVE says:

    thanks..

  14. Grant says:

    very nice thanks.

  15. Andrew says:

    Brilliant, thank you so much for this script.

  16. harvest316 says:

    This works better:

    ALTER FUNCTION dbo.spExtractDomainFromURL ( @strURL NVARCHAR(1000) )
    RETURNS NVARCHAR(100)
    –Posted at http://stackoverflow.com/a/20808097/391101
    AS
    BEGIN
    –Strip Protocol
    SELECT @strURL = SUBSTRING(@strURL, CHARINDEX('://', @strURL) + 3, 999)

    — Strip www subdomain
    IF LEFT(@strURL, 4) = 'www.'
    SELECT @strURL = SUBSTRING(@strURL, 5, 999)

    — Strip Path
    IF CHARINDEX('/', @strURL) > 0
    SELECT @strURL = LEFT(@strURL, CHARINDEX('/', @strURL) – 1)

    –Unless you iterate through a list of TLDs, you can't differentiate between subdomain.example.com and example.com.au
    –I decided against this because it's slower, and the TLD list requires maintenance

    RETURN @strURL
    END

1 Pings/Trackbacks for "T-SQL: Parse Top Level Domain from URL"
  1. […] trying to use this script (which extracts the domain from a […]

Leave a Reply

Your email address will not be published. Required fields are marked *

*