How To Write A Left Anti Join and Why
Previously, we’ve look at the join functionality of LEFT JOIN. If we recall from that lesson, we learned that a LEFT JOIN will join records that match and don’t match on the condition specified. Since we’re using the same example over and over again for easily remembering the result, we’ll remember that two of our records in Table1 matched with Table2, but two results didn’t match, so we saw nulls as the result. When there isn’t a match (because the records don’t exist), we get a null as the return. What if we only want to get the records in Table1 that don’t match in Table2? In the video, SQL Basics: How To Use A LEFT ANTI JOIN and Why, we look at accomplishing this challenge using the LEFT ANTI JOIN. Since not all SQL languages support LEFT ANTI JOIN, I show this syntax two different ways, one of which is generally supported.
Some questions that are answered in the video:
- Note the tables that we’re using and what values are identical based on the column names and what values differ based on the column names. We’ll be using these tables throughout these videos. As a note on aliasing tables — the first table always gets a “t1” while the second table always gets a “t2” and so on and so forth (the 5th table would get a “t5”). Finally, consider that some SQL languages may not support the JOINs that we discuss.
- As a quick reminder, not all SQL languages may support the exact syntax shown.
- In first our example, we use a LEFT ANTI JOIN without directly calling it (like we could do in some SQL languages), but by using a null filter on the LEFT JOIN condition. What is the purpose of specifying nulls in this example?
- To accomplish the same results in the first example, how can we LEFT ANTI JOIN Table1 with Table2? How does this save us time in development?
- What would happen if we used a LEFT ANTI JOIN from Table1 on Table2 for the column Letter? What would our results be? Why?
As we see, a big part of what we’re doing here is filtering out records when we have two data sets with similarities and differences. If you’ve ever completed an assignment that involved “compare” and “contrast” then you’ve done something similar in functionality as you had to identify the similarities and the differences, which is what we’re doing here with data.
As for a real world example with LEFT ANTI JOINs, consider a health example where we have populations with different traits, but all that have the same disease. We could use a LEFT ANTI JOIN to filter out these differences. This would possibly allow us to further our analysis (may or may not be useful). The key with any data set where we use a LEFT ANTI JOIN is that must have columns that will be used in our join condition where we do the filtering. In other words, we would filter on trait or multiple traits depending on how we broke down our data in the example. The same applies if we were to use LEFT ANTI JOINs with sales data, housing data, etc. We must have columns where we’ll apply our filtering condition.
T-SQL
For both the T-SQL and Spark SQL, we’ll use the following data set:
INSERT INTO Table1 VALUES (1,'A','uni2')
INSERT INTO Table1 VALUES (2,'B','uni4')
INSERT INTO Table1 VALUES (3,'C','uni6')
INSERT INTO Table1 VALUES (4,'D','uni2')
INSERT INTO Table2 VALUES (3,'E','uni1')
INSERT INTO Table2 VALUES (4,'F','uni3')
INSERT INTO Table2 VALUES (5,'G','uni5')
INSERT INTO Table2 VALUES (6,'H','uni7')
As we see in the video, there isn’t directLEFT ANTI JOIN
capability in T-SQL. However, we know that a LEFT ANTI JOIN
returns data from the initial data where the values do not exist in the joined data set. Using the id
column as the filter like we see in the example, we’ll LEFT ANTI JOIN
Table1
on Table2
to get the records from Table1
that don’t exist in Table2
:
SELECT t1.*
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID
WHERE t2.ID IS NULL
T-SQL requires that we specify the data from Table1 (in the above example, that would be the t1.*
), otherwise if I selected everything from this query (*
) that would return the NULL values from Table2
. This is worth noting with T-SQL, as this does not apply to Spark SQL.
Spark SQL
Using the same data set and same query as above, we’ll see that we can use the LEFT ANTI JOIN
functionality when we query the data and we’ll also note what happens when we SELECT *
versus what we did with T-SQL where we only selected the data from Table1
:
SELECT *
FROM Table1 t1
LEFT ANTI JOIN Table2 t2 ON t1.ID = t2.ID
As we see, only data from Table1
returns. In fact, if we tried to select a column from Table2
(ie: SELECT t2.Letter
), we would get the error “[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with the name t2.letter cannot be resolved. Did you mean one of the following? [t1.letter, t1.id, t1.val]”. The reason is that fundamentally this anti join acts as a filter — we are looking at records in Table1
that are not in Table2
, so we inherently don’t want to see records in Table2
.
Pandas
We’ll do the same thing using Python pandas. In the below code we first create two pandas’ dataframes that we’ll later left anti join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join.
tableOne = {
"id": [1,2,3,4],
"letter": ["A","B","C","D"],
"val": ["uni2","uni4","uni6","uni8"]
}
tableOneDf = pd.DataFrame(tableOne)
tableTwo = {
"id": [3,4,5,6],
"letter": ["E","F","G","H"],
"val": ["uni1","uni3","uni5","uni7"]
}
tableTwoDf = pd.DataFrame(tableTwo)
We’ll execute two LEFT ANTI JOIN
s using pandas by using the id
column first on tableOneDf
then on tableTwoDf
. What we’ll observe about a LEFT ANTI JOIN
is how it acts as a filter and how it turns the data set of the initial dataframe.
leftAntiJoinFromTableOne = tableOneDf.merge(tableTwoDf, on="id", how="left", indicator=True)
leftAntiJoinFromTableOne = leftAntiJoinFromTableOne[leftAntiJoinFromTableOne['_merge'] == 'left_only'][['id','letter_x','val_x']]
print(leftAntiJoinFromTableOne)
In the above code, the indicator will tell us the result of our merge operation which we’ll use as a filter. When we merge two dataframes where the column names are the same, we see the _x
and _y
added to these similarly named columns. The _x
columns are from the initial dataframe, which is why they are selected. Our result returns id
values of 1 and 2 from tableOneDf
because these values do not exist in tableTwoDf
.
Let’s do the same, except invert the dataframes — tableTwoDf
will be our initial dataframe:
leftAntiJoinFromTableTwo = tableTwoDf.merge(tableOneDf, on="id", how="left", indicator=True) \
.query("_merge == 'left_only'") \
[['id', 'letter_x', 'val_x']]
print(leftAntiJoinFromTableTwo)
In the above code, we use the same logic though with a different format. We merge the dataframes, query where the _merge
column equals left_only
and produce the columns from the initial dataframe. One difference that we’ll note with pandas versus Spark SQL is that we must specify the columns that should return — only from the initial dataframe. This is not the case with Spark SQL. If I were to LEFT ANTI JOIN
in Spark SQL and try to return the joined dataframe, I would get an error. This makes sense because these columns will always be NULL.
As we see, this can be a useful technique for filtering on the join condition, though I would be cautious about performance.
PySpark
Unlike T-SQL, we can write LEFT ANTI JOIN
s directly in pyspark and we’ll see how these differ from the syntax we’ve grown accustomed to in T-SQL. In the below code we create two spark dataframes that we’ll later left anti join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join. As a quick note here since we are using pyspark — we can create a pyspark dataframe from a pandas’ dataframe (in this example, we won’t do this).
tableOne = sc.parallelize([{"id": 1, "letter": "A", "val": "uni2"},{"id": 2, "letter": "B", "val": "uni4"},{"id": 3, "letter": "C", "val": "uni6"},{"id": 4, "letter": "D", "val": "uni8"}])
tableTwo = sc.parallelize([{"id": 3, "letter": "E", "val": "uni1"},{"id": 4, "letter": "F", "val": "uni3"},{"id": 5, "letter": "G", "val": "uni5"},{"id": 6, "letter": "H", "val": "uni7"}])
tableOneDf = spark.createDataFrame(tableOne)
tableTwoDf = spark.createDataFrame(tableTwo)
Here’s where we’ll see a big difference between T-SQL and pyspark in terms of syntax changes from what we would expect from an INNER
and LEFT JOIN
. In the case of a LEFT ANTI JOIN
, we’ll specify it directly — note that we’re not using a NULL
filter here:
leftAntiJoin = tableOneDf.join(tableTwoDf, "id", "leftanti")
display(leftAntiJoin)
We get back id
values of 1 and 2 from tableOneDf because the LEFT ANTI JOIN
returns all values from the intial dataframe that don’t exist in the LEFT ANTI JOIN
ed dataframe. In other words, this is the T-SQL equivalent to a NOT IN
filter. We’ll also see that we only get the columns from tableOneDf
— if you ever seen an error about a missing column, you may have erroneously specified a column from the joined dataframe (not initial) that won’t return (a LEFT ANTI JOIN
will only return columns from the initial dataframe).
What happens if we apply the LEFT ANTI JOIN
by using the letter
column instead of the id
column?
leftAntiJoin = tableOneDf.join(tableTwoDf, tableOneDf.letter == tableTwoDf.letter, "leftanti")
display(leftAntiJoin)
All values from tableOneDf
return and they should because none of the letters that exist in tableOneDf
exist in tableTwoDf
. We would get the same result if we inverted the dataframes because none of the letter values that exist in tableTwoDf
exist in tableOneDf
.
LEFT ANTI JOIN
s make useful NOT IN
filters, but I would caution developers on these because they can introduce performance challenges, especially in SQL languages where we’ll see a performance boost. It is always better to architect data in a manner that allows us to quickly filter in and out records by using ranges instead of looking at individual values. Consider a phone book that is randomly organized and imagine if you had to return everyone except people with the last name of Doe. That would be extremely difficult because of the random organization. However, if we organized the phone book alphabetically, suddenly returning everyone except the last name of Doe becomes very easy. Data modeling is similar: we want to consider how data will be queried so we organize data in a manner that results in faster query performance.
In other words, if you find that you’re using a lot of NOT IN
s/LEFT ANTI JOIN
s, you may want to reconsider how you’ve architected your data.
General Considerations
In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.
For more SQL lessons, you can continue to watch the SQL Basics series.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.