How To Write A CROSS JOIN and Why

6 min readAug 5, 2021

When we look at data, we will sometimes want to look at all data points relative to all other data points. We discussed this when we talked about CROSS APPLY, but we must remember that CROSS APPLY pertains to SQL Server only. In many SQL languages we can use CROSS JOINs. From this functionality, we can look at every record relative to every other record — for instance, if we have data for 12 months, we look at each month relative to all months (including itself). In the video, SQL Basics: How To Use A CROSS JOIN and Why, we look at anexample of using this functionality.

Some questions that are answered in the video:

Note the tables that we’re using and what values are identical based on the column names and what values differ based on the column names. We’ll be using these tables throughout these videos. As a note on aliasing tables — the first table always gets a “t1” while the second table always gets a “t2” and so on and so forth (the 5th table would get a “t5”). Finally, consider that some SQL languages may not support the JOINs that we discuss.
As a quick reminder, not all SQL languages may support the exact syntax shown.
We’ve discussed CROSS APPLY and while this may be similar to CROSS JOIN, there are differences (I highlight these toward the end of the video).
CROSS JOIN, unlike CROSS APPLY, is supported in more SQL languages than SQL Server. While it does have the “multiple” result set that we can see in CROSS APPLY, it does not take in parameters like CROSS APPLY (as an example), so it is still restricted in use. It does however result in a multiple result set as we see.
Consider scenarios where we may want to use this functionality

In the example, we CROSS JOIN our two tables, which only have 2 overlapping values between tables. A large percent of the time, we’ll use CROSS JOIN on a table itself, but it’s useful to use CROSS JOIN on two tables because we’ll sometimes do this. Also, we don’t filter; for instance, if I had filtered where values equal, our result set would be different. Keep this in mind when you CROSS JOIN two tables where you need to filter strictly because of the values that you’re comparing — for instance, if we want to compare a month’s data to every other month, then we need to filter accordingly (for instance, on the first table we may only select one value and compare to every other value).

Overview

For the examples in this post, we’ll be using a different set of data. The purpose of this is to show a real-world application of where a CROSS JOIN would help. Our real world example involves three invoices and we want to concatenate these three invoices with extensions based on the number of times they invoice was adjusted before finalization. The goal here is that our final table (or dataframe) will have a range of possibilities that we would then join. This example is one in which I have used derivatives of, so this is a use-case where a CROSS JOIN can be useful to solve.

Our data will involve three invoices of A12345, B12345 and C12345. Every time an update occurs with an invoice, the extension is updated by a value of 1 with a dash before it (ie: A12345–1) and invoices are allowed to be updated up to 9 times before they are finalized. This means that an invoice might have up to 9 possibilities. We want to return all the possibilities of an invoice and one way to do this is through a CROSS JOIN. We will CROSS JOIN every invoice value with the extension of -1.

INSERT INTO tableInv (Invoice)
VALUES ('A12345')
  , ('B12345')
  , ('C12345')

INSERT INTO tableExt (Ext)
VALUES (1)
  , (2)
  , (3)
  , (4)
  , (5)
  , (6)
  , (7)
  , (8)
  , (9)

We will not necessarily need this for the object oriented approaches with pandas or pyspark, but the above code will be useful in the SQL example.

SQL

A CROSS JOIN allows us to look at every record in a table with every other record in another table without a filter or condition. If we specified a condition, like we do with an INNER or LEFT JOIN, our results would be filtered so we would look at the records from both tables that result from the filter. In our case, we will look at every record in our tableInv next to every record in tableExt. We’ll see that when we run the below code, that we get 27 results because we have 3 records in tableInv and 9 records in tableExt and without a condition in the CROSS JOIN we get every record from the joined table for each record on the initial table.

SELECT *
FROM tableInv
 CROSS JOIN tableExt t

For our business use case, each invoice can have up to 9 changes before being finalized. Our invoice format with these changes is the invoice plus a dash with the extension (up to 9). We want all the possibilities returned so a CROSS JOIN is one way that we can solve this. In the below code which we has derivatives in most SQL languages (this is done in Spark SQL), we get this result — 27 total records with the invoice plus a dash and the possible extension up to 9.

SELECT CONCAT(t1.Invoice,'-',CAST(t2.Ext AS VARCHAR(1)))
FROM tableInv t1
 CROSS JOIN tableExt t2

Remember that we are not specifying any condition in our CROSS JOIN, thus all the results in both tables return.

Pandas

We want to accomplish the same functionality using pandas. In this case, we’ll create our dataframes from two data sets. The first dataframe is created through following a format we’ve done with other joins. In the second dataframe — tableExDf — we create our dataframe by using the range function. This gives us a dataframe with the values of 1 through 9; the final number in the range is up to the number without including it (in other words, our range does not have 10).

tableInv = {
    "Invoice": ["A12345","B12345","C12345"]
}

tableInvDf = pd.DataFrame(tableInv)

tableExtDf = pd.DataFrame({'Ext': range(1, 10)})

Now that we have our dataframes, we’ll CROSS JOIN them with pandas using the merge function. Similar to the SQL, we don’t need to specify a condition here so we’ll notice the absense of ON here. There is a point that we must consider though: the values in the tableExtDf are integers not strings. If we were to concatenate these values, we would see a dataframe with erroneous values because we cannot concatenate a string and an integer. We’ll apply the function ASTYPE(STR) on this column so that the output returns a string and we can concatenate the values.

crossJoin = tableInvDf.merge(tableExtDf, how="cross")
crossJoin = crossJoin["Invoice"] + "-" + crossJoin["Ext"].astype(str)

Pyspark

Before we solve this in pyspark, let’s create our sample data that we’ll use. Remember, that we can convert pandas dataframes to pyspark dataframes, so an alternative is to create pyspark dataframes from the pandas dataframes above this. However, in this case we’ll show how we create the pyspark dataframes overtly, as this can be useful if you’re needing to contrive data sets for unit tests.

tableInv = sc.parallelize([{"Invoice": "A12345"},{"Invoice": "B12345"},{"Invoice": "C12345"}])
tableExt = sc.parallelize([{"Ext": 1},{"Ext": 2},{"Ext": 3},{"Ext": 4},{"Ext": 5},{"Ext": 6},{"Ext": 7},{"Ext": 8},{"Ext": 9}])
tableInvDf = spark.createDataFrame(tableInv)
tableExtDf = spark.createDataFrame(tableExt)

Like we’ve seen with the SQL, we want to CROSS JOIN these dataframes where each of the invoices has an extension ranging from 1 to 9 leading with a dash. Therefore, we want to first select the Invoice column, then a dash, then the Ext column within a concat() function as we want all of these values concatenated together. In pyspark, when we want to specify a string (in the case of the dash), we use the function lit() to declare that we want a literal value.

crossJoin = tableInvDf.crossJoin(tableExtDf).select(concat(col("Invoice"),lit("-"),col("Ext")))

As we see, we get 27 values with the extensions ranging from 1 to 9.

General Considerations

In general, I would avoid using Python for analyzing data in sets. While Python may have some strengths, like all object oriented programming languages, it performs extremely poorly relative to SQL languages when it comes to sets of data.

For more SQL lessons, you can continue to watch the SQL Basics series.

Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.

How To Write A CROSS JOIN and Why

Overview

SQL

Pandas

Pyspark

General Considerations

Written by SQLInSix Minutes