How To Write A Self Join and Why
As we’ve seen with SQL joins, we can link data to other data using relationships between data, such as a relationship between the history of a sales’ price for a particular house and the details of the particular house. In these cases, the data have relationships with other related data (a home, its details, its price, etc). We use the variety of joins to tie these related data together based on the conditions that we specify. Specific data points also relate to themselves, even if they don’t relate to other data. For instance, if we only had data on the history of a particular home sales’ price, we would have current and past sales’ values. This means that a particular datum from that set would be a value after a previous value and possibly with a next value (if the specific data point is the most recent). When we reflect over this example, we’ll see that it applies in a similar manner to every data set. In the video, SQL Basics: How To Use A Self Join and Why, we look at joining a table on to itself and the situations where we would consider using this technique for relationships of data among themselves.
Some questions that are answered in the video:
- For this video, we use a self join table since our join will only involve one table. Since we join the table onto itself, any column that we use in the join condition will match provided that we don’t change any part of the condition (which we do in our example and I show why).
- As a quick reminder, not all SQL languages may support the exact syntax shown.
- In the example, what problem are we solving with a self join? How is this useful with any data where we’re looking at the data set alone (without any external relationships)?
- Consider a simple example of where we can see this: in a workout where we do 5 sets of squats, the amount of repetitions may differ. With a self join, we can compare the percent change (if applicable) of how many repetitions we are able to complete relative to other sets. So, if we do 10 repetitions in our 1rst set and 12 repetitions in our 2nd set, we saw an increase in 20% from the 1rst to the 2nd set. The relationship between those data points is a previous-next relationship.
All data have relationships to even the data itself; even data as simple as weather can be looked at from the view of the weather today versus the weather yesterday.
As we see with the self join, the condition matters. We can vary the condition to compare to the previous, latter or other values. We do not need to use a self join if adding derived columns from an existing column — we can simply add those in our SQL statement (I’ve seen some examples where this is done, but it’s not needed).
T-SQL
In this exercise, we’ll be using a small data set and because we are joining the table to itself, we’ll only need to create the one table.
INSERT INTO SelfTable (ID,Val)
VALUES (1,10)
, (2,12)
, (3,11)
, (4,14)
, (5,18)
A basic SELF JOIN with the above table would simply be joining the table to itself using the ID
column:
SELECT *
FROM SelfTable t1
INNER JOIN SelfTable t2 ON t1.ID = t2.ID
We get duplicate columns with this query — the ID
and Val
columns are repeated. As a note here, this is never the context in which we would use a SELF JOIN
, but does show a basic one. If we needed duplicate columns for whatever reason we could simply run a command like SELECT ID, ID AS ID…
but even that is a use case that is never needed.
However, sometime we use a SELF JOIN
when we compare data. In the below query, the ID
and Val
columns from the initial table show next to the ID
and Val
columns of the next value from the joined table, which as we know is the same table. This means that the ID
value of 1 from the initial table is shown next to the ID
value of 2 from the joined table — the same table because it’s a SELF JOIN
:
SELECT *
FROM SelfTable t1
INNER JOIN SelfTable t2 ON t1.ID = (t2.ID - 1)
We compare data values frequently, so the above use case is something worth knowing.
Pandas
For joining a data set to itself, we’ll use the below test data set for this example with pandas.
selfTable = {
"id": [1,2,3,4,5],
"val":[10,12,14,16,18]
}
selfTableDf = pd.DataFrame(selfTable)
It’s worth remembering a pattern that we’ve seen when two or more dataframes are joined and have identically named columns, as this will apply to SELF JOIN
s. Remember that the join column condition will return, whereas the non-joined columns, if named the same, will return with an underscore _x and so on depending on how many similarly named columns exist. When we SELF JOIN
a data set, we are joining a data set that is identical: itself. We’ll see this result in our first join:
selfJoin = selfTableDf.merge(selfTableDf, on="id", how="inner")
print(selfJoin)
Notice how the result is the id column plus a val_x
and val_y
column. Now, we’ll perform the same query that we used in T-SQL where we joined the table to itself except with the joined table we subtracted 1 from the id
column. This means that 1 for the initial table would equal 2 for the joined table. We’ll perform the same functionality with dataframes where the initial id
column of 1 would equal the joined id
column of 2:
selfJoinDf = selfTableDf.merge(selfTableDf.assign(id=selfTableDf["id"] - 1), on="id", how="inner")
print(selfJoinDf)
First, with pandas we must update the dataframe so that the id
column in the joined dataframe holds id
values that are 1 less than the original. To do this, we use the assign function, which will create new columns or update existing columns in our dataframe (as we see, each id value has 1 subtracted from it). We’ll note that only the id
column is affected here; nothing else is. This means that this joined dataframe will have id
values starting at 0 (can’t be inner joined because it doesn’t exist in the original dataframe) to 4. In the output, we see the val_x
and val_y
columns with the values that we’d expect to see given that the joined dataframe’s id
columns are subtracted by 1.
In practical application, a SELF JOIN
is generally used to compare values. For instance, comparing the price of a value from time period to another time period could be done in a self join relative to the data architecture. Depending on which values we want to compare, we may subtract (or add) higher ranges of values, whereas in our example case we only subtract 1.
Pyspark
Using the same data set we’ve used in the T-SQL example, we’ll create a pyspark dataframe with the two columns of id
and val
. We also create a temporary view in the below code so that you can write Spark SQL, however, this is not shown as the Spark SQL will match the T-SQL code we have above this.
tableThree = sc.parallelize([{"id": 1, "val": 10},{"id": 2, "val": 12},{"id": 3, "val": 11},{"id": 4, "val": 14},{"id": 5, "val": 18}])
tableThreeDf = spark.createDataFrame(tableThree)
tableThreeDf.createOrReplaceTempView("tableThreeDf") ### For SQL
Now that our pyspark dataframe is created, let’s do a SELF JOIN
on the id column:
selfJoin = tableThreeDf.join(tableThreeDf, "id", "inner")
Similar to what we’ve shown above this, we get the same result where we have all five records with the differing val
columns side-by-side, though in the case of pyspark we have two val
columns (it does not rename both like pandas does).
Now, we want to write a SELF JOIN
where we get the id
value of 1 from the initial dataframe and the id
value of 2 from the joined dataframe. We’ll see that the syntax for pandas and pyspark differs quite a bit. First, we’ll alias our initial dataframe with t1
(like we do with the T-SQL) and when we perform the join, we’ll alias our joined dataframe with t2
. With these aliased, the SELF JOIN
becomes intuitive.
selfJoin = tableThreeDf.alias("t1").join(tableThreeDf.alias("t2"),col("t1.id") == col("t2.id")-1,"inner")
One point here is that we are making this code more readable. One challenge with SELF JOIN
s relative to how they’re written is that they can sometimes seem counterintuitive in code. You may see comments around them because of this. That is one good practice, but another good practice is to simply be clear in your code that you are doing a SELF JOIN
.
General Considerations
In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.
For more SQL lessons, you can continue to watch the SQL Basics series.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.