How To Write An Inner Join and Why

SQLInSix Minutes
7 min readJun 9, 2021
Let’s learn about INNER JOINs!

When storing data, one technique to storing data is linking information through identification of the data. This is the dominant way that SQL engines operate and within SQL engines comes join functionality — one of which is the inner join. In the video, SQL Basics: How To Use An INNER JOIN and Why, we look at how to write an inner join with an example of where we’d use it.

Some questions that are answered in the video:

  • Note the tables that we’re using and what values are identical based on the column names and what values differ based on the column names. We’ll be using these tables throughout these videos.
  • In first our example, what do we INNER JOIN on? How might that differ if we use another column? Why?
  • In our second example, what do we INNER JOIN on? What’s the result compared with the first result? Why did this occur?
  • What is a distinct value?

One quick note about inner joins is that if one table has multiples of the join condition, the result will be multiples because of how it functions — it returns matches. In our example, if Table2 had 2 3s for Id, then we would have 2 3s as a result because one of the tables had multiples.

A practical example of an INNER JOIN would be if we had two tables, one of which stored a list of homes available for sale in an area with a unique identifier for each of the homes and another table stored the pricing history for all the unique identifiers of the homes. If we wanted to get the pricing history for homes, we would join the two tables on the unique identifier. This example also illustrates where we would use foreign key relationships, as we can’t have a price of a home that doesn’t exist — thus a home would have to first exist before we could store its pricing history. In this example, there could be a one-to-many relationship, as one home could have multiple prices listed in its history (very common, unless a home is new and hasn’t been sold).

T-SQL

To practice this exercise, create the two tables that you see in the video with the test data. You can also use the query below this.

INSERT INTO Table1 VALUES (1,'A','uni2')
INSERT INTO Table1 VALUES (2,'B','uni4')
INSERT INTO Table1 VALUES (3,'C','uni6')
INSERT INTO Table1 VALUES (4,'D','uni2')

INSERT INTO Table2 VALUES (3,'E','uni1')
INSERT INTO Table2 VALUES (4,'F','uni3')
INSERT INTO Table2 VALUES (5,'G','uni5')
INSERT INTO Table2 VALUES (6,'H','uni7')

Once you’re done, let’s execute some queries using inner join and see what results we get.

SELECT t1.ID
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.ID = t2.ID

When we execute the above T-SQL, we see that we get 2 records — ID values of 3 and 4. When we review the data, we see that these are the only 2 records that match between the tables when we join on the condition of the ID column. For an example, if we joined the tables on the Letter column, we’d get no results:

SELECT t1.Letter
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.Letter = t2.Letter

Nothing returns because there’s no overlap in the Letter column between the tables. The same would apply if we joined on the column Val because like Letter there is no overlap.

What if we try to join with two conditions — ID and Letter columns?

SELECT t1.ID
FROM Table1 t1
INNER JOIN Table2 t2 ON t1.ID = t2.ID
AND t1.Letter = t2.Letter

Notice that we get nothing. That’s because there is no overlap between the tables with this join condition, even if one of the columns has a match, the other does not. In the case of this INNER JOIN, both must be true.

In most SQL languages, the syntax will be similar to the T-SQL that we see above this. The logic is the same when it comes to matching values and conditions.

Pandas

We’ll do the same thing using Python pandas. In the below code we first create two pandas’ dataframes that we’ll later join. As we see, we match the data that we have in the SQL tables. This helps keep this demonstration easy to review; you can edit these dataframes to create a more complex join.

tableOne = {
"id": [1,2,3,4],
"letter": ["A","B","C","D"],
"val": ["uni2","uni4","uni6","uni8"]
}

tableOneDf = pd.DataFrame(tableOne)

tableTwo = {
"id": [3,4,5,6],
"letter": ["E","F","G","H"],
"val": ["uni1","uni3","uni5","uni7"]
}

tableTwoDf = pd.DataFrame(tableTwo)

Like what we did with the T-SQL, we’ll INNER JOIN on the id column in both dataframes and print our result. We will see an index column when we print these values. Python pandas adds an untitled index column to dataframes to the left of the first titled column. This index column starts at 0, so in the result you will see 0 for the first row, then 1 for the second row.

innerJoin = tableOneDf.merge(tableTwoDf, on="id", how="inner")
print(innerJoin)

As we see, the id values of 3 and 4 match between the two dataframes. Because these columns are named identical, pandas adds and underscore to the column names that are identical, thus we see letter_x, val_x, letter_y and val_y. This is because we’re printing out the full dataframe and the id column doesn’t need to be repeated since it matches (if we used a different INNER JOIN condition, the same would apply where the condition column would only be shown once).

If we try to the same join using the column letter instead, we get an empty dataframe:

innerJoin = tableOneDf.merge(tableTwoDf, on="letter", how="inner")
print(innerJoin)

The same applies if we use multiple conditions with id and letter— the dataframe will be empty because that condition results in 0 matches.

Pyspark

We’ll do the same thing using Python pyspark. In the below code we create two spark dataframes from two data sets that we’ll later join. As we see, we’re using the same data that we used in the T-SQL and pandas demo in the above code. You can take the structure of the data and create a more complex data set if you want to test a more complex join than what we test. As a quick note here since we are using pyspark — we can create a pyspark dataframe from a pandas’ dataframe using the syntax of , ourNewSparkDf = spark.createDataFrame(ourPandasDf) where ourPandasDf is the pandas’ dataframe. For this section, we will create a pyspark dataframe without pandas, but it is worth noting that if you have a data set in pandas, you can create a spark dataframe from it.

tableOne = sc.parallelize([{"id": 1, "letter": "A", "val": "uni2"},{"id": 2, "letter": "B", "val": "uni4"},{"id": 3, "letter": "C", "val": "uni6"},{"id": 4, "letter": "D", "val": "uni8"}])
tableTwo = sc.parallelize([{"id": 3, "letter": "E", "val": "uni1"},{"id": 4, "letter": "F", "val": "uni3"},{"id": 5, "letter": "G", "val": "uni5"},{"id": 6, "letter": "H", "val": "uni7"}])
tableOneDf = spark.createDataFrame(tableOne)
tableTwoDf = spark.createDataFrame(tableTwo)

As we’ve seen in the above examples, we’ll INNER JOIN the two dataframes by the id column and get the same result that we would expect to get that also matches the other examples:

innerJoin = tableOneDf.join(tableTwoDf, "id", "inner")
display(innerJoin)

We also see that we get the same result if we INNER JOIN on the letter column — there are no matches with this column:

innerJoin = tableOneDf.join(tableTwoDf, tableOneDf.letter == tableTwoDf.letter, "inner")
display(innerJoin)

What if we try to INNER JOIN on both the letter and id columns? We get no matches because while 2 id values overlap (3 and 4), none of the letter columns overlap.

innerJoin = tableOneDf.join(tableTwoDf, (tableOneDf["letter"] == tableTwoDf["letter"]) & (tableOneDf["id"] == tableTwoDf["id"]), "inner")
display(innerJoin)

As a note here — and this would apply to SQL and pandas too, though I only demonstrate the logic here in pyspark — we can add an “or” condition instead of an “and” condition in our join and we will get the two values that we’d expect to get back — id values of 3 and 4.

innerJoin = tableOneDf.join(tableTwoDf, (tableOneDf["letter"] == tableTwoDf["letter"]) | (tableOneDf["id"] == tableTwoDf["id"]), "inner")
display(innerJoin)

The reason for this is that OR means only 1 of the conditions must be true. While not as common, there may be times that we want to join with an “or” condition instead of an “and” condition. I will only note here that when writing SQL (in any language), I prefer to write this in the where clause (ie: the join only has the match conditions while the where clause will filter in the appropriate data) because it will perform better and is more intuitive to read and debug. An example of this in practice can be found here by Eric Cobb that highlights the issue in Microsoft SQL Server and will vary in other SQL languages — the key is to be aware of how this performs ahead of time.

General Considerations

In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.

For more SQL lessons, you can continue to watch the SQL Basics series.

Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.

--

--