How To Create A Date Dimension In Python Pandas
In this post, we look at creating a date dimension using python pandas. Pandas is one of the easiest libraries to use in python and can easily be convert to a pyspark dataframe using the spark.createDataFrame
method, provided that there are no data type clashes. This latter issue won’t show in our exercise for this post, as we’ll be ensuring that our dataframe has values, but be aware of data columns that can’t be inferred when converting from one or the other.
One of the benefits to using pandas here is that we can create this dimension for a data warehouse we may be running in Azure Synapse, Databricks or from our own file system. This means that once we create a useful date dimension, we can re-use this in a variety of places without constant redevelopment. If you’ve used date or time dimensions, then you’re aware that there are generally 5–10 popular combinations of the way that these values are reported. Even when you have an exceptional format, it’s easier to convert them when you have a base to start.
For the sake of this example, we’ll be looking at a few columns that we can add to our date. I’ve seen date dimensions that had 30–40 columns, so we’ll be covering the basics here with the note that it’s fairly simple to extend what I’ve done if you want. Also, I show two routes, but will stick with the iterative route only because I’ve found that’s more intuitive for people supporting it. I actually prefer the 2nd route, but either will work and you should always use what makes the most sense.
Let’s start by using pandas to enter a valid date:
import pandas as pd
year = 2001
month = 1
day = 3
ourDf = pd.to_datetime(f'{year}-{month:02d}-{day:02d}')
print(ourDf)
We’ll get the result 2001–01–03 00:00:00 as this returns the format that we’ve specified. What if we enter an invalid value? For instance, we know that February never has 30 days, so let’s enter 30 days for February and see what we get:
year = 2001
month = 2
day = 30
ourDf = pd.to_datetime(f'{year}-{month:02d}-{day:02d}')
print(ourDf)
Error: Exception has occurred: DateParseError; day is out of range for month: 2001–02–30, at position 0; ValueError: day is out of range for month.
We get a ValueError because February doesn’t have a 30th day. What this means is that if we want to generate all days for 2022 and 2023, we could iterate over those years, and iterate over all months within those years, and iterate over all days within those months within those years. While doing this, we could pass in the date to pandas to validate our date. If it fails, we can catch it with a ValueError (because there will be some), but tell it to continue anyway without these failed values. In a sense, what we’re doing is:
for year in range(2022,2024):
for month in range(1,13):
for day in range(1,32):
try:
dateResult = pd.to_datetime(f'{year}-{month:02d}-{day:02d}')
except ValueError:
continue
Of course, we need to save our dateResult to review it, so let’s append it to an empty list and then add that list to a pandas’ dataframe:
dateResultList = []
for year in range(2022,2024):
for month in range(1,13):
for day in range(1,32):
try:
dateResult = pd.to_datetime(f'{year}-{month:02d}-{day:02d}')
dateResultList.append(dateResult)
except ValueError:
continue
pandasResult = pd.DataFrame(dateResultList,columns=["OurDate"])
print(pandasResult)
We now have 2 years worth of dates in our resulting dataframe — even though this nested loop generated invalid dates, because we handled these invalid dates with a continue, it still generated a dataframe of dates.
I demonstrate this method because sometimes a date dimension will require more numerical values in the dimension, not derived values from dates. This may be one way to achieve a date dimension. My preferred approach with pandas is the easier version of:
pandasResult = pd.DataFrame({
"OurDate": pd.date_range(start="2022-01-01", end="2023-12-31", freq="D")
})
print(pandasResult)
Cleaner and faster, plus we don’t need the try-catch. However, as I add values to the date dimension, I will have to derive these values in a dataframe with a new column. For an example, using this approach we could further define other aspects to our date dimension — in the below example, I check whether the day of the date is less than or equal to 15 and if it is then label it as “1rst” and if not “2nd” for the pay period of the month.
import pandas as pd
import numpy as np
pandasResult = pd.DataFrame({
"OurDate": pd.date_range(start="2022-01-01", end="2023-12-31", freq="D")
})
pandasResult["PayPeriod"] = np.where((pandasResult["OurDate"].dt.day <= 15), "1st", "2nd")
print(pandasResult)
(As a quick note here, we require the numpy
library when we start to apply filters for our pandas’ values.)
In my view, this flow feels more intuitive provided that you logically organize and group the columns. For instance, you cannot call a day value if you don’t have a way of getting that day value earlier in the dataframe. You’ll the columns to appear in the correct order and be grouped appropriately so that columns can be added or cut faster. This is one reason that the nested loop can seem easier, as all days, months and years are generated before you can use anything. If this route is better for you, use it. I’ll demonstrate the nested loop route, but this would also suffice.
In some cases, this can cause confusion. Either way, I show you this route because it’s another alternative that make it easy to generate what we did in the nested loop. For the sake of simplicity, I won’t use both approaches as I break down our date dimension. I am going to use the first approach because I want to include some if logic. One point to make here is that we can apply if logic with the second approach. The below will output 7 columns: a standard date format, a formatted string for the date in the MM/DD/YYYY format, the full month name, an upper case month abbreviation, a month that is formatted with a leading 0 if there’s only 1 digit, a shortened year in the format of YY, and finally an analysis of the pay period depending on the time during the month.
dateResultList = []
for year in range(2022,2024):
for month in range(1,13):
for day in range(1,32):
try:
dateResult = pd.to_datetime(f'{year}-{month:02d}-{day:02d}')
date = dateResult.strftime('%m/%d/%Y')
monthname = dateResult.strftime('%B')
monthabbrev = dateResult.strftime('%b').upper()
formatmonth = str(month).zfill(2)
shortenedyear = str(year)[-2:]
if day <= 15:
pay = "1st"
else:
pay = "2nd"
dateResultList.append([dateResult,date,monthname,monthabbrev,formatmonth,shortenedyear,pay])
except ValueError:
continue
pandasResult = pd.DataFrame(dateResultList,columns=["OurDate","Date_MMDDYYYY","MonthName","MonthAbbrev","Month","Year_YY","PayPeriod"])
print(pandasResult)
Two important lines that shouldn’t be missed is the column order within the final dataframe and append list order. We must order these two lines exactly in terms of how the values first get appended to the list, then how the columns are ordered in the panda’s dataframe:
### During the loop we're appending these values
dateResultList.append([dateResult,date,monthname,monthabbrev,formatmonth,shortenedyear,pay])
… and …
### Column order of the pandas' dataframe
pandasResult = pd.DataFrame(dateResultList,columns=["OurDate","Date_MMDDYYYY","MonthName","MonthAbbrev","Month","Year_YY","PayPeriod"])
By contrast, we do not have to worry about order if we use the alternate approach I earlier mentioned, as we label the pandas’ columns as we add derived values. Be aware of this if you make changes — adding or subtracting values.
Let’s look at some of this functionality in more details.
print("1".zfill(2))
print("11".zfill(2))
print("11".zfill(3))
print("11".zfill(4))
In the case of zfill
(which is applied to a string here), this will lead our string with a 0 relative to the number of characters and relative to the size. Notice that if we specific zfill
with 2 and we already have 2 characters in our string, this does nothing. This is because we already have 2 characters. Thus this becomes useful when we need a string to represent a month digit that leads with 0; this can’t happen with an integer (you cannot have 04
as an integer). In most cases with date dimensions, we’ll need an integer month, such as 1, 2, 3, 4, etc. I use this because I have run across situations where a client wants a month with a leading zero and this shows how this can be done in these rarer cases.
When we need to get characters from a string easily when we either need the starting or ending characters, we can splice the string. In the below python, notice the output of splicing our strings at various starting points or using inverted starting points:
print("1234"[1:]) ##Starts post first value
print("1234"[-1:]) ##Inverse start first value
print("1234"[2:]) ##Starts post first two values
print("1234"[-2:]) ##Inverse start first two values
print("1234"[3:]) ##Starts post first three values
print("1234"[-3:]) ##Inverse start first three values
print("1234"[4:]) ##Start post first four values (nothing as there isn't anything after the 4th value)
print("1234"[-4:]) ##Inverse start first four values
When I note “inverse” in the above comments, I am indicating that the starting point is at the end of the string. If you’re familiar with the LEFT
and RIGHT
functions in SQL, an “inverse” is the equivalent to RIGHT
, as it starts at the end. In this case, splicing a string is one way that we can easily get the last two characters of a YYYY string — note that in the date dimension code we are first converting the YYYY integer to a string, then splicing it.
The upper()
in the code upper cases the output (lower()
would do the inverse). Finally, when you have a date and you want to format it a certain way, using .strftime('YOURFORMAT')
provided it’s on a date field can be useful depending on the format you want.
Overall, once you make a date dimension in pandas with as many possibilities as you need, you won’t have to do much development with it from that point. You can simply subtract columns that aren’t needed and possibly add a few new derived columns. The flexibility to add this to pyspark is an added bonus.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright; all rights reserved. None of the written content in this post may be used in any artificial intelligence.