Thursday, July 9, 2015

SQL Server - Comparing Partially Disparate Lists

What's the Situation?

I was given a source list of users and three other lists of users.  I need to find all of the users in all of the lists in order to run some analysis.  This sounds simple, but to do it by hand is a bit tedious when the lists are large and to do it in SQL Server is a bit tricky to get it just right.

The original lists were given as flat files.  Each list comprised of a first name and last name.  I did some data massaging then imported the files into SQL Server using the Import/Export wizard.  The result became 4 tables of two columns (FirstName and LastName).  Now the fun begins...

Test Case Setup

In order to setup a test case so you can follow along, create 4 tables:

CREATE TABLE TableA (
FirstName VARCHAR(255) NULL
, LastName VARCHAR(255) NULL
)
CREATE TABLE TableB (
FirstName VARCHAR(255) NULL
, LastName VARCHAR(255) NULL
)
CREATE TABLE TableC (
FirstName VARCHAR(255) NULL
, LastName VARCHAR(255) NULL
)
CREATE TABLE TableD (
FirstName VARCHAR(255) NULL
, LastName VARCHAR(255) NULL
)

Then load some dummy data:

INSERT INTO TableA VALUES ('John', 'Smith'), ('Bob', 'Gray'), ('Jane', 'Doe'), ('Gary', 'Thompson')
INSERT INTO TableB VALUES ('Bob', 'Gray'), ('Gary', 'Thompson'), ('Theodore', 'Edwards')
INSERT INTO TableC VALUES ('John', 'Smith'), ('Bob', 'Gray'), ('Gary', 'Thompson'), ('Theodore', 'Edwards')
INSERT INTO TableD VALUES ('John', 'Smith'), ('Gary', 'Thompson'), ('Theodore', 'Edwards'), ('Mike', 'Jones')

Making the Report

In order to see who belongs where, we obviously need a join.  The question becomes what join do I use and how to I use it so I get the results I need? (A list of user membership in each table with one row per user).

In this case, I only needed to join on LastName because all last names were unique.

Using Left Joins

A LEFT OUTER JOIN would suffice if we only cared about the users in the source list.  If this were the case, the query below would have been sufficient:

SELECT
   a.FirstName AS TableA_FirstName
,  a.LastName AS TableA_LastName
,  b.FirstName AS TableB_FirstName
,  b.LastName AS TableB_LastName
,  c.FirstName AS TableC_FirstName
,  c.LastName AS TableC_LastName
,  d.FirstName AS TableD_FirstName
,  d.LastName AS TableD_LastName
FROM
   TableA a
   LEFT JOIN TableB b
      ON a.LastName = b.LastName
   LEFT JOIN TableC c
      ON a.LastName = c.LastName
   LEFT JOIN TableD d
      ON a.LastName = d.LastName
ORDER BY
   COALESCE(a.LastName, b.LastName, c.LastName, d.LastName)

The problem with the left join is that we only see 4 rows instead of 6 because two of the names don't exist in TableA.


Using Full Outer Joins

What about users that belong only in one of the other three lists?  A FULL OUTER JOIN can help solve this:

SELECT
   a.FirstName AS TableA_FirstName
,  a.LastName AS TableA_LastName
,  b.FirstName AS TableB_FirstName
,  b.LastName AS TableB_LastName
,  c.FirstName AS TableC_FirstName
,  c.LastName AS TableC_LastName
,  d.FirstName AS TableD_FirstName
,  d.LastName AS TableD_LastName
FROM
   TableA a
   FULL OUTER JOIN TableB b
      ON a.LastName = b.LastName
   FULL OUTER JOIN TableC c
      ON a.LastName = c.LastName
   FULL OUTER JOIN TableD d
      ON a.LastName = d.LastName
ORDER BY
   COALESCE(a.LastName, b.LastName, c.LastName, d.LastName)

The problem now is if the user only belongs in at least two of the other lists (b, c, or d) and not the first list (a), the result becomes skewed creating a new row for every additional entry after the first one found.  Using the example data and the query above we now have 8 rows instead of 6 because Theodore Edwards exists in lists b, c, and d but doesn't exist in list a.  There's an additional row for lists c and d.


Using Full Outer Joins With Complex ON Conditions

To fix the row skew issue we can add ON conditions to TableC and TableD so all possible tables are accounted for.

SELECT
   a.FirstName AS TableA_FirstName
,  a.LastName AS TableA_LastName
,  b.FirstName AS TableB_FirstName
,  b.LastName AS TableB_LastName
,  c.FirstName AS TableC_FirstName
,  c.LastName AS TableC_LastName
,  d.FirstName AS TableD_FirstName
,  d.LastName AS TableD_LastName
FROM
   TableA a
   FULL OUTER JOIN TableB b
      ON a.LastName = b.LastName
   FULL OUTER JOIN TableC c
      ON a.LastName = c.LastName
      OR b.LastName = c.LastName
   FULL OUTER JOIN TableD d
      ON a.LastName = d.LastName
      OR b.LastName = d.LastName
      OR c.LastName = d.LastName
ORDER BY
   COALESCE(a.LastName, b.LastName, c.LastName, d.LastName)

Now the result looks the way we want it to.  Only one row per matching name across all 4 tables (lists).


Summary

Though this trick may not come in handy in day to day operations I find it to be helpful when performing ad-hoc analysis of data.  This type of analytic querying may be helpful for reviewing lists of SQL logins, lists of database users, or lists of active directory security group membership to name a few.

No comments:

Post a Comment