The Most Common Joins Solved In SQL, Pandas and Pyspark
The most common joins I’ve seen and created involve inner joins, left joins (or left anti joins), outer joins, cross joins and left semi joins. In the below linked posts, we see three different ways using either SQL, pandas or pyspark that we can create these.
Links To SQL, Pandas and Pyspark Joins
The below posts have code that you can execute along with examples for each type of environment (SQL, Pandas, Pyspark) — and each also has a video if you want to see the code executed live.
- Inner join: Video | Post
- Left join: Video | Post
- Left anti join: Video | Post
- Outer join: Video | Post
- Left semi join: Video | Post
- Cross join: Video | Post
Ease of Development
In terms of development speed and ease (not performance):
- Pandas is by far the easiest to install and use. With python installed, you can add pandas in less than 5 minutes and be joining and working with data sets. Pandas also integrates well with machine learning libraries.
- Pyspark will require a spark environment and can easily integrate pandas or use SQL. Spark can also work with a multitude of databases, so with a spark environment along with related libraries you’ll be able to easily get data from a variety of sources.
- The SQL syntax shown is generally supported in both T-SQL and Spark SQL with some rare exceptions. SQL languages follow a standard and are generally easy to convert from one to another language when there is some nuance. However, they require database access or in the case of Spark SQL, spark.
Pandas To Pyspark Or Vice Versa
While each post has examples of both pandas and pyspark from the creation of the data to the actual join, this step is not needed if you already have the dataframe in one of these and you’ve defined your schema appropriately without NULL
values or by handling NULL
values explicitly.
- To create a pyspark dataframe from a pandas’ dataframe:
sparkDf = spark.createDataFrame(pandasDf)
- To create a pandas dataframe from a pyspark dataframe:
pandasDf = sparkDf.toPandas()
One great catch by X user @GregSWils is that NULL
values will cause issues when you go from a pandas’ dataframe to a spark dataframe. In general, it’s a good practice to be specific in definition anytime you go from one data platform to another, as they do not always read data types the same way.
SQL Language Nuance
While most of the examples will cover Spark SQL or T-SQL, there are some exceptions (at the time of this post, T-SQL does not support a left anti join). As long as you understand how to combine cartesian products for specific results, the adjustments on the join should be minor. That being said, there will always be some SQL nuances between SQL languages like T-SQL, PL/SQL, U-SQL, etc.
Performance
Joins will expand your cartesian product as you add more tables or dataframes. This will hold especially true when you do joins such as LEFT JOIN
s or CROSS JOIN
s where you could have a one-to-many situation. Relative to the data size, no language will serve you at all if you don’t scale at some point if you keep adding data. I’ve witnessed 100,000 records from an initial data set destroy the performance of top performing languages like C++ because of a LEFT JOIN
that resulted in billions of records. The fundamental problem involved scale: if you know that for reach record, you may get 10,000+ combinations, you have to scale early. Languages like python, scala, etc will not be able to keep up with these data sets if you ignore scaling. A case in point: with python pandas, multi-level analysis involving a data set less than 10 megabytes can cause a 3–5 second pause during the analysis on a fairly average laptop.
In the case of Databricks, Synapse or Snowflake, you can always vertically scale and sometimes this will help, but you will have to scale horizontally at some point if you’re data set is growing. The conditions where this doesn’t apply is if the data set only looks at “recent” without historical — this will be much less of a problem provided that the recent data set isn’t growing in volume (ie: recent daily stock prices versus last year’s customer growth).
Other Transformations
I’ve solved some of the common transformations with pandas, pyspark and SQL here that don’t involve joins but do cover the windowing and data level transforms that tend to come up frequently (varchars to ints, datetime formats, etc).
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.