How To Write A FULL (OUTER) JOIN

SQLInSix Minutes
6 min readJul 29, 2021

We’ve looked at the LEFT and RIGHT JOINs. These can be useful relative to the result set we want, thought RIGHT JOINs are less common in development and may cause confusion (we can invert the tables and perform a LEFT JOIN for the same results). What if we want the result set of both of these combined — in other words, the result set for the LEFT JOIN and also the result set for the RIGHT JOIN? We can achieve what we’ve seen in past videos by combining these joins with teh FULL OUTER JOIN. In the video, SQL Basics: How To Use A FULL OUTER JOIN and Why, we do exactly this with our test data set.

Some questions that are answered in the video:

  • Note the tables that we’re using and what values are identical based on the column names and what values differ based on the column names. We’ll be using these tables throughout these videos. As a note on aliasing tables — the first table always gets a “t1” while the second table always gets a “t2” and so on and so forth (the 5th table would get a “t5”). Finally, consider that some SQL languages may not support the JOINs that we discuss.
  • In comparing a LEFT JOIN and a RIGHT JOIN, how will the result of the FULL OUTER JOIN differ?
  • When we create the FULL OUTER JOIN on Id, what’s the result?
  • What would be the result if we used the column Letter?

How could this be useful? In reviewing our test data set, we see that the final result shows us where the Ids come with values on the left side and right side while the opposite side has nulls and where the values align. Look at the values 3 and 4: we have matches. However, we don’t see this for 1 and 2 (the left Ids return while the right don’t because there are no matches on the right). Likewise, the right table with 5 and 6 come with Ids while these are not present in the left table. This means that with a FULL OUTER JOIN we can get non-null values on the left side, right side and where both align. This can be extremely useful in filtering or analysis where we may want values that match and don’t match from both tables on either side. While this certainly comes up less than LEFT JOINs, it is worth knowing how to use when you need to use it.

T-SQL

To practice this exercise with FULL OUTER JOINs, we’ll create the two tables that you see in the video with the test data. You can also use the query below this to create the data.

INSERT INTO Table1 VALUES (1,'A','uni2')
INSERT INTO Table1 VALUES (2,'B','uni4')
INSERT INTO Table1 VALUES (3,'C','uni6')
INSERT INTO Table1 VALUES (4,'D','uni2')

INSERT INTO Table2 VALUES (3,'E','uni1')
INSERT INTO Table2 VALUES (4,'F','uni3')
INSERT INTO Table2 VALUES (5,'G','uni5')
INSERT INTO Table2 VALUES (6,'H','uni7')

Once you’re done, we’ll copy the query that we see in the video and perform our FULL OUTER JOIN on the id column:

SELECT *
FROM Table1 t1
FULL OUTER JOIN Table2 t2 ON t1.ID = t2.ID

As we see in our results, we get 6 records where the id columns of 3 and 4 match in both tables. As we saw in the article on INNER JOINs — How To Write An Inner Join and Why, these are the two columns that match between the tables. However, the id values between the two tables that have no matches (1,2,5,6) return with NULL values where the other table does not have matches. Let’s take the same query and change the FULL OUTER JOIN condition to use the column letter instead of id.

SELECT *
FROM Table1 t1
FULL OUTER JOIN Table2 t2 ON t1.Letter = t2.Letter

In this case, we get 8 records back and there’s no matches between the tables as we would expect. Remember that when we used an INNER JOIN with this condition on these tables, we had no records return. A FULL OUTER JOIN will return all matches and non-matches between tables.

Pandas

We’ll do the same thing using Python pandas. In the below code we first create two pandas’ dataframes that we’ll later join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join.

tableOne = {
"id": [1,2,3,4],
"letter": ["A","B","C","D"],
"val": ["uni2","uni4","uni6","uni8"]
}

tableOneDf = pd.DataFrame(tableOne)

tableTwo = {
"id": [3,4,5,6],
"letter": ["E","F","G","H"],
"val": ["uni1","uni3","uni5","uni7"]
}

tableTwoDf = pd.DataFrame(tableTwo)

When we do a FULL OUTER JOIN with pandas, we only need to specify the OUTER in the how function of a merge. Replicating what we’ve done in the T-SQL, let’s perform this OUTER JOIN with pandas using the id column between the dataframes:

outerJoin = tableOneDf.merge(tableTwoDf, on="id", how="outer")

Because the column names are identical in these dataframes, we do not need to specify each in this join. We’ll see the identical result when we print the results to what we saw in T-SQL: 6 records return with NaN values appearing where there are no matches between the dataframes. If we change our condition to use the letter column instead of the id column, we’ll get 8 results where we have NaN values for each dataframe’s data set that doesn’t not exist in the other dataframe:

outerJoin = tableOneDf.merge(tableTwoDf, on="letter", how="outer")

As we see with both specifications on the OUTER JOIN, the cartesian explosion is big. Keep this in mind when it comes to performance; you do not want to catch up on scaling a data set that will need an OUTER JOIN.

Pyspark

We’ll do the same thing using Python pyspark. In the below code we create two spark dataframes from two data sets that we’ll later join. As we see, we’re using the same data that we used in the T-SQL and pandas demo in the above code. You can take the structure of the data and create a more complex data set if you want to test a more complex join than what we test. As a quick note here since we are using pyspark — we can create a pyspark dataframe from a pandas’ dataframe using the syntax of , ourNewSparkDf = spark.createDataFrame(ourPandasDf) where ourPandasDf is the pandas’ dataframe. For this section, we will create a pyspark dataframe without pandas, but it is worth noting that if you have a data set in pandas, you can create a spark dataframe from it.

tableOne = sc.parallelize([{"id": 1, "letter": "A", "val": "uni2"},{"id": 2, "letter": "B", "val": "uni4"},{"id": 3, "letter": "C", "val": "uni6"},{"id": 4, "letter": "D", "val": "uni8"}])
tableTwo = sc.parallelize([{"id": 3, "letter": "E", "val": "uni1"},{"id": 4, "letter": "F", "val": "uni3"},{"id": 5, "letter": "G", "val": "uni5"},{"id": 6, "letter": "H", "val": "uni7"}])
tableOneDf = spark.createDataFrame(tableOne)
tableTwoDf = spark.createDataFrame(tableTwo)

We’ve already seen how to perform an OUTER JOIN in both T-SQL and pandas, so we’ll copy both possible commands — 1 that uses the id column and 1 that uses the letter column — so that we can display the results.

fullJoin = tableOneDf.join(tableTwoDf, "id", "outer") ### note that you'll get the same result if you use "full" our "fullouter"
fullJoin = tableOneDf.join(tableTwoDf, tableOneDf["letter"] == tableTwoDf["letter"], "outer")

A key observation here is that we can specify the OUTER JOIN by using outer, full or fullouter.

When might we use a FULL OUTER JOIN? One popular use case is we when we have data sets where we should have a match between them, yet a subset of the records do not match. This will allow us to match the records and show the records that don’t match so that we can adjust the data so that both match. This is not the only use case, but is one I’ve seen repeated in several instances.

General Considerations

In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.

For more SQL lessons, you can continue to watch the SQL Basics series.

Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.

--

--