The Rapidfuzz Library As A Golden Record Solution
I recently used the python library rapidfuzz
to golden some records. This post provides some thoughts on data curation that you can use with any library where you need to golden records.
I have a few thoughts after running a few tests that you may find useful if doing something similar.
Define
A golden record could be a fully unique record or it could be a unique entity.
Think of an address: 123 My Street and 123 My St are the same entity in the context where “St” is an abbreviation. But they are both unique records because they are each ways of writing the same address.
Thus the entity overall is golden while the address formats are unique. This would be 2 rows with the same golden id.
You can image the scenario if every different combination of writing an address is what we define as golden.
We’d have many golden records because that 1 address can be written in at least 2 ways. The same would apply to other addresses. This is important because this exercise often involves handling typos or mistyped entries.
Threshold
Let’s think about typos or erroneous entries here. We’ll use our example address of 123 My Street. Is 1234 My Street a typo or an entirely different address?
Without other identifying information, we may not know. This is key!
Can we add more information to what we have to test matching and creating a golden record. For instance, can we take the street along with the jurisdiction information such as 123 My Street, Sim City Simland 00000.
It’s still possible that there’s two addresses here 123 My Street and 1234 My Street, but with more information we get closer to identifying how close a match is.
Keep in mind that human review may be required in the end. If we determine that a human review is needed, we can have that as an added step post automation.
Privacy Matters
I’ve already written on Linkability. I suggest you read and reflect over that post.
Let’s think about a situation of John James Doe and John James Doe Co. John James Doe is a person. John James Doe Co is a company. Relative to our threshold, these may match. But John James Doe the person does not treat his individual entity as his business entity.
If a person intends something, should we violate this intent? You may use a different contact method for your auto and renter’s insurance. Is it any of my business as to why you do this and should I create a golden entity that ties these together if you intend them to be separate?
You can make your own decision. If I’m making the business decision, then the answer is no. If a person intends for different entities to be unlinked, I proceed with that approach. If you link entities in violation of what a person intends and a hacker gets access to your data, then you may be directly responsible for any “goldening” that you do, if the hacker now can tie out records.
Be careful on this point. You may feel like it’s unimportant, but it could carry huge legal costs. As I keep advising, you shouldn’t be storing people’s private information anyway.
Note: all images in the post are either created through actual code runs or from Pixabay. The written content of this post is copyright SqlInSix; all rights reserved. None of the written content in this post may be used in any artificial intelligence.