How To Write A Left Join and Why
Previously, we looked at INNER JOIN
s and where we may consider using them in development — a quick example might be if we have a set of data that shares a relationship with another data set and we want to return all the data from both data sets where we have matching data. There may be situations where we want to return data that also don’t match, or we may want to use the fact that the data don’t match between data sets as a filter of sorts. There are a variety of techniques to solve this and LEFT JOINs provide us with one way to accomplish this. In the video, SQL Basics: How To Use A LEFT JOIN and Why, we look at several examples of using a LEFT JOIN and why we might consider using this, as opposed to what we learned in an early week with an INNER JOIN.
Some questions that are answered in the video:
- Note the tables that we’re using and what values are identical based on the column names and what values differ based on the column names. We’ll be using these tables throughout these videos. As a note on aliasing tables — the first table always gets a “t1” while the second table always gets a “t2” and so on and so forth (the 5th table would get a “t5”). Finally, consider that some SQL languages may not support the JOINs that we discuss.
- In first our example, what do we LEFT JOIN on? Suppose that we use a different column — how might the result set that’s returned in the first example differ?
- In our second example, we flip the tables that are being joined — how does this affect the result set (if applicable)?
- Based on the results from the first two examples, how are these examples showing the LEFT in the LEFT JOIN — consider what’s happening with the results from both tables being used.
- If records match, do we get these results with a LEFT JOIN? Considering this question, how does LEFT JOIN differ from INNER JOIN?
When it comes to applying LEFT JOINs to a problem (real world use), we can see that LEFT JOINs make it easy to determine based on linking a column or set of columns what data match between two tables and what data don’t match. Suppose that we were comparing genetics of two population groups and we wanted to see the genetic overlap between these populations, but also the differences between these groups. We could accomplish this with LEFT JOINs (not the only way to accomplish this in SQL, but one way). In the world of databases, this comes up regularly — we have a source data set and we want to compare with a destination data set. Often this will involve inserting data that doesn’t exist in the destination from the source or updating information in the destination that already exists in the source.
T-SQL
To practice this exercise, create the two tables that you see in the video with the test data. You can also use the query below this.
INSERT INTO Table1 VALUES (1,'A','uni2')
INSERT INTO Table1 VALUES (2,'B','uni4')
INSERT INTO Table1 VALUES (3,'C','uni6')
INSERT INTO Table1 VALUES (4,'D','uni2')
INSERT INTO Table2 VALUES (3,'E','uni1')
INSERT INTO Table2 VALUES (4,'F','uni3')
INSERT INTO Table2 VALUES (5,'G','uni5')
INSERT INTO Table2 VALUES (6,'H','uni7')
Once you’re done, let’s execute some queries using left join
and see the results we get. Let’s start with the below query:
SELECT *
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID
In the above query, we see all the records in Table1
along with the matched records in Table2
that have the ID
values of 3 and 4. Because the ID
values of 1 and 2 do not exist in Table2
, we get the second set of ID/Letter/Val
columns with NULL values because there’s no match. A LEFT JOIN
will return all values in the initial table along with the values that match in the joined data set. If there are values that don’t match in the joined data set, we get NULL values.
What do you think will happen if we execute the below query?
SELECT *
FROM Table2 t2
LEFT JOIN Table1 t1 ON t1.ID = t2.ID
With this query we see that we get all values from Table2
. We know that ID
values 3 and 4 match between the two tables and we see these values return from Table1
. But Table1
does not have ID
values of 5 and 6, so these values do not return. When we want to return a full data set along with another data set that may match or not, we would use a LEFT JOIN
— we see this behavior in both of these example queries.
We’ll illustrate this point with the below query. What do you think will be the result of this query?
SELECT *
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.Letter = t2.Letter
All values from Table1
return, but everything for Table2
is NULL. This is because there are no matched values for the column Letter in Table2
.
Pandas
We’ll do the same thing using Python pandas. In the below code we first create two pandas’ dataframes that we’ll later join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join.
tableOne = {
"id": [1,2,3,4],
"letter": ["A","B","C","D"],
"val": ["uni2","uni4","uni6","uni8"]
}
tableOneDf = pd.DataFrame(tableOne)
tableTwo = {
"id": [3,4,5,6],
"letter": ["E","F","G","H"],
"val": ["uni1","uni3","uni5","uni7"]
}
tableTwoDf = pd.DataFrame(tableTwo)
When we left join a data set, we will get the full data set of the initial table we select from alongside any matches on the second table. As we’ll see when we run the example below this, we’ll get all of tableOneDf
as well as the id
values for 3 and 4 that are also in tableTwoDf
alongside with NaN
values because there are no matches for id
values of 1 and 2.
leftJoin = tableOneDf.merge(tableTwoDf, on="id", how="left")
print(leftJoin)
Notice how this result (with 2 matches and 2 NaN
results) differs from the below left join:
leftJoin = tableOneDf.merge(tableTwoDf, on="letter", how="left")
print(leftJoin)
We get everything back from tableOneDf
, but have nothing return for tableTwoDf
(this is the identical result to the T-SQL in the above code).
Pyspark
We’ll build LEFT JOINs with the same data using Python pyspark. In the below code we create two spark dataframes that we’ll later left join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join. As a quick note here since we are using pyspark — we can create a pyspark dataframe from a pandas’ dataframe (in this example, we won’t do this).
tableOne = sc.parallelize([{"id": 1, "letter": "A", "val": "uni2"},{"id": 2, "letter": "B", "val": "uni4"},{"id": 3, "letter": "C", "val": "uni6"},{"id": 4, "letter": "D", "val": "uni8"}])
tableTwo = sc.parallelize([{"id": 3, "letter": "E", "val": "uni1"},{"id": 4, "letter": "F", "val": "uni3"},{"id": 5, "letter": "G", "val": "uni5"},{"id": 6, "letter": "H", "val": "uni7"}])
tableOneDf = spark.createDataFrame(tableOne)
tableTwoDf = spark.createDataFrame(tableTwo)
As we’ve seen in the above examples with T-SQL and pandas, we’ll LEFT JOIN
the two dataframes by the id column and get the same result that we would expect to get that also matches the other examples in this post:
leftJoin = tableOneDf.join(tableTwoDf, "id", "left")
display(leftJoin)
We get every record from tableOneDf
and as we know, only the id
values of 3 and 4 match between the two tables, so these values from tableTwoDf
return while the other values do not. In addition, the values from tableOneDf
that don’t have matches are NULL for the columns that should come from the other dataframe (but these don’t exist, so they’re NULLs). Let’s do the same join with the letter
column instead and see the result:
leftJoin = tableOneDf.join(tableTwoDf, tableOneDf.letter == tableTwoDf.letter, "left")
display(leftJoin)
Every record returns from tableOneDf
, but nothing returns from tableTwoDf
(ie: notice how all values are NULL) because the letters between the dataframes never match. A LEFT JOIN
will always return the initial dataframe just like we saw when we ran the T-SQL statement on the tables. What if we try to LEFT JOIN on both the letter
and id
columns?
leftJoin = tableOneDf.join(tableTwoDf, (tableOneDf["letter"] == tableTwoDf["letter"]) & (tableOneDf["id"] == tableTwoDf["id"]), "left")
display(leftJoin)
We see the same result that we saw in the previous query — no records match between the dataframes on both the id
and letter
columns.
For more SQL lessons, you can continue to watch the SQL Basics series.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.