How To Write A LEFT SEMI JOIN
When we INNER JOIN from one table to another table with a column or set of columns and we select all columns, we’ll get the results returned for both tables. This is true for some other join types as well, but the focus of this video will be for INNER JOINs since a LEFT SEMI JOIN is useful in contexts where we want specific results. In some cases with our joins, we only want the result set for one table, not both. Suppose that we wanted the result set for the LEFT table only in the join condition? We could use the standard SQL syntax of selecting all columns in the first table with the * operator filtered on the first table. This would return all columns in the first table and we’d still have the option of selecting columns from the second (and latter) tables. However, if we want to give ourselves the option of only selecting from the first table in our join, we can use a LEFT SEMI JOIN. In the video, SQL Basics: How To Use A LEFT SEMI JOIN and Why, we look at using this functionality.
Some questions that are answered in the video:
- For this video, we use a self join table since our join will only involve one table. Since we join the table onto itself, any column that we use in the join condition will match provided that we don’t change any part of the condition (which we do in our example and I show why).
- As a quick reminder, not all SQL languages may support the exact syntax shown. Different SQL languages may have different approaches to solving the same problem or developing the same solution.
- As we see with the results of the INNER JOIN, what output issue occurs when we join Table1 and Table2 when it comes to columns? For developers with application experience, what would be the issue here? Keep in mind that with INNER JOINs, this is seldom an issue, but worth considering in this case because it does come up.
- How does a LEFT SEMI JOIN solve the above issue?
- When would we consider using a LEFT SEMI JOIN? What could be the possible issue?
Given that fat finger and code issues occur, a LEFT SEMI JOIN can help us restrict the result set. In a business situation where we want to check a table’s values against another table (or set of tables) and whether those values exist in other tables while only returning the results of the first table, a LEFT SEMI JOIN can help us. This functionality absolutely prevents us from possibly returning data in other tables, whereas using an INNER JOIN would allow us to possibly select columns from other tables than the first. For this reason, a LEFT SEMI JOIN can also be a more secure join type, if our business problem is one of filtering values in one tables that exist in other tables.
Spark SQL
Before we start this exercise, recall what we learned in how to write an inner join and why. When we have the matched data return (using T-SQL), we had the ID
, Letter
and Val
columns return from both tables. Even if the values did not match in some of the columns themselves (ie: Letter
and Val
columns did not match on the data level while the ID
column did match for values 2 and 4), they still returned based on the join condition of the ID
column. We also did not have any data return where there wasn’t a match between the tables. This key to understanding a LEFT SEMI JOIN
.
What if we wanted to INNER JOIN
our data, but only return the data from the initial table (in our example case, that would be Table1
)? This is what a LEFT SEMI JOIN
accomplishes.
To practice this exercise with LEFT SEMI JOIN
s, we’ll create the two tables that you see in the video with the test data. You can also use the query below this to create the data.
INSERT INTO Table1 VALUES (1,'A','uni2')
INSERT INTO Table1 VALUES (2,'B','uni4')
INSERT INTO Table1 VALUES (3,'C','uni6')
INSERT INTO Table1 VALUES (4,'D','uni2')
INSERT INTO Table2 VALUES (3,'E','uni1')
INSERT INTO Table2 VALUES (4,'F','uni3')
INSERT INTO Table2 VALUES (5,'G','uni5')
INSERT INTO Table2 VALUES (6,'H','uni7')
Now that we have data in our table, let’s execute a LEFT SEMI JOIN
using the same condition as the INNER JOIN
query we used in that post:
SELECT *
FROM Table1 t1
LEFT SEMI JOIN Table2 t2 ON t1.ID = t2.ID
We only get back the data in Table1
that matches with the records in Table2
. The equivalent of this in SQL languages that don’t support LEFT SEMI JOIN
s using the above code is as follows:
SELECT t1.*
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.ID = t2.ID
(Note: at the time of this post, T-SQL does not support the LEFT SEMI JOIN
syntax. The above SQL is how you would get the same result with T-SQL.)
Pandas
We’ll mirror the SQL data for python pandas. In the below code we first create two pandas’ dataframes that we’ll later join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join.
tableOne = {
"id": [1,2,3,4],
"letter": ["A","B","C","D"],
"val": ["uni2","uni4","uni6","uni8"]
}
tableOneDf = pd.DataFrame(tableOne)
tableTwo = {
"id": [3,4,5,6],
"letter": ["E","F","G","H"],
"val": ["uni1","uni3","uni5","uni7"]
}
tableTwoDf = pd.DataFrame(tableTwo)
As we saw with Spark SQL, one way to think about a LEFT SEMI JOIN
is return all the data from Table1
where the ID
values in Table1
exist in Table2
. As we’ve practiced with merge, we could do an INNER JOIN
and return all the columns from tableOneDf
:
leftSemiJoin = tableOneDf.merge(tableTwoDf, on="id", how="inner")[["id","letter_x","val_x"]]
However, we could also simply select everything from tableOneDf
where the id
columns is in tableTwoDf
:
leftSemiJoin = tableOneDf[tableOneDf["id"].isin(tableTwo["id"])]
Both are correct and it’s worth seeing the different ways that this can be done because you may see either of these approaches and it helps to see what they do.
Pyspark
We’ll do the same thing using Python pyspark. In the below code we create two spark dataframes from two data sets that we’ll later join. As we see, we’re using the same data that we used in the T-SQL and pandas demo in the above code.
tableOne = sc.parallelize([{"id": 1, "letter": "A", "val": "uni2"},{"id": 2, "letter": "B", "val": "uni4"},{"id": 3, "letter": "C", "val": "uni6"},{"id": 4, "letter": "D", "val": "uni8"}])
tableTwo = sc.parallelize([{"id": 3, "letter": "E", "val": "uni1"},{"id": 4, "letter": "F", "val": "uni3"},{"id": 5, "letter": "G", "val": "uni5"},{"id": 6, "letter": "H", "val": "uni7"}])
tableOneDf = spark.createDataFrame(tableOne)
tableTwoDf = spark.createDataFrame(tableTwo)
The syntax for a LEFT SEMI JOIN
is very similar to an INNER JOIN
— keep in mind that if we used the latter, we would get all the columns from both dataframes but with the former we only get the columns from the initial dataframe of tableOneDf
:
leftSemiJoin = tableOneDf.join(tableTwoDf, "id", "leftsemi")
We see that our result matches what we’ve seen in Spark SQL as well — the id
values of 3 and 4 return along with the data from tableOneDf
and nothing from tableTwoDf
. I understand why some people may ask, “Why not just do an INNER JOIN
and select from the initial table?” and that’s a way in which we can get the same result. However, if we have an automatic column selection in our function and we simply want to return the columns automatically without specifying them, a LEFT SEMI JOIN
allows us to use are automatic column selection without column specification. This functionality comes in handy for these purposes because we can use an automatic column selector and the join itself returns the records that we want.
Keep in mind that if we want columns from the joined dataframe (or dataframes), we want to avoid a LEFT SEMI JOIN
(this would throw an error if there were a different set of columns and we tried to select one of them).
General Considerations
In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.
For more SQL lessons, you can continue to watch the SQL Basics series.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.