Data Notes From A Recent Solution
2 min readJul 16, 2024
Some thoughts from a recent data solution:
- Without being able to test data accuracy with development (ie: baseline data source is not available), the development time will lengthen because it’s a theoretical solution. Test driven development saves time.
- Know when to use
ZORDER BY
in Databricks, as it can be extremely useful. - Label data and infrastructure appropriately as early as possible. Avoid using overly complex terms or terms that may carry other meanings in labels. Example — if you have five notebooks for the same data set, referring to them as if they’re the same will confuse everyone. Create appropriate labels — (example) Ingestion notebook, Transformation notebook, Historical notebook, Testing notebook, Reporting notebook.
- It only takes one minor detail for an AI solution to come with a huge cost. Also, related to this — people don’t like to admit they used AI. Unfortunately, we found this out later in the solution and if we had known earlier, it may have helped. This is a new business challenge that I expect to see.
- The more general an error is, the more troubleshooting is involved. Make sure errors are detailed; if you’re tying out an identity in a data warehouse, but there’s a missing value causing a foreign key issue, throw that error. The reason that we can solve this fast in a database is the specificity.