Data Notes From A Recent Solution

2 min readJul 16, 2024

Some thoughts from a recent data solution:

Without being able to test data accuracy with development (ie: baseline data source is not available), the development time will lengthen because it’s a theoretical solution. Test driven development saves time.
Know when to use ZORDER BY in Databricks, as it can be extremely useful.
Label data and infrastructure appropriately as early as possible. Avoid using overly complex terms or terms that may carry other meanings in labels. Example — if you have five notebooks for the same data set, referring to them as if they’re the same will confuse everyone. Create appropriate labels — (example) Ingestion notebook, Transformation notebook, Historical notebook, Testing notebook, Reporting notebook.
It only takes one minor detail for an AI solution to come with a huge cost. Also, related to this — people don’t like to admit they used AI. Unfortunately, we found this out later in the solution and if we had known earlier, it may have helped. This is a new business challenge that I expect to see.
The more general an error is, the more troubleshooting is involved. Make sure errors are detailed; if you’re tying out an identity in a data warehouse, but there’s a missing value causing a foreign key issue, throw that error. The reason that we can solve this fast in a database is the specificity.

Written by SqlInSix Tech Blog