So you work in analytics. You spend vast amounts of time analysing data for meaningful patterns. Over the last 3 weeks you have attended a few boring meetings, answered a dozen emails, drank 37 cups of coffee and spent the remainder of your time creating a rather amazing propensity model for the Marketing department. Your model (rather brilliantly) predicts the likelihood of your existing customers opening a new product. You pinpointed some of the key factors – age, gender, life stage, existing product holdings, spending behaviour, the list goes on. The model has consumed your time, but you were meticulous and you are confident it will bring in some decent results.
You instruct marketing to choose customers in the top decile of the model. They do this correctly and extract the mail file. The mail file is sent to the mail house. The proofs are signed-off. The letter is sent to your customers.
What could possibly go wrong?
5 Days later…
Sandra (your customer) opens the marketing offer and then gets confused when you offer her a product that she already has. She calls customer service to complain. A marketing representative comes over to your desk and asks you if it was your fault. After some investigation it is discovered that the problem is a result of Sandra having 2 customer records:
Miss Sandra Smith, Unit 2, 12 Baltimore St., Wisthorp, NH1 8DM, 12/06/1982
Mrs Sandra Channing-Smith, 2/12 Baltimore Street, Wisthorp, NH1 8DM, 12/06/1982
Well, OK, nothing you could do about that. Marketing apologises to the customer. A front-line staff member merges Sandra’s customer records and life goes on.
You are about to ignore the whole sordid incident when, after a little reflection a nasty thought hits you. What if there are more duplicates? What if there are lots of duplicates? How valid will my model be? Can we still go to the next decile in the following month if it proves a roaring success? Is all of my previous reporting correct? How do I measure response rates across multiples of the same customer? Is the information that I give to senior managers correct? Are our profit forecasts correct? Customer satisfaction surveys? Oh Gosh… I wish I didn’t have that thought… I’ve opened a can of worms!
Unfortunately I have found that data quality problems exist in all organisations. Most large organisations have been around for many years. Their legacy systems didn’t have the checks and balances that you would expect from an expensive modern CRM system. As a result many customer data records will have been input using rudimentary checking. Many systems didn’t allow the data entry clerk to check for existing customers. Most organisations also wouldn’t have invested the time to understand and correct this older data.
Do a Quick Check!
SELECT firstname, surname, DOB, COUNT(*) FROM tblCustomers GROUP BY firstname, surname, DOB HAVING COUNT(*) > 1
Are they the same people or is it merely a coincidence? You obviously need to include a few more fields to check this thoroughly.
Now, consider what you would do if:
- Date of birth was missing for one record
- The customer moved further along on the same street
- The customer has used maiden name in one record and married name in the other.
- The postcode differed
- The street name was spelt differently
- The customer name was Siobhan in one record but Shivawn in another
Fuzzy Matching Algorithms
This is where fuzzy matching comes into its own. Some clever people (like you) have already invested the time to cover off these problems. Many of you will probably have studied their work during your University years.
Levenshtein distance: The minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
Damerau-Levenshtein: As above but allows transpositions too.
Soundex: Widely used phonetic algorithm for indexing English names by sound
Metaphone, Double Metaphone and now the 3rd version – Metaphone 3: All are approximate phonetic encoding algorithms (purported to surpass Soundex)
Jaro distance: String edit distance used to link duplicate records
Jaro-Winkler distance: A measure of the similarity between two strings
The formulas look even more impressive than the definitions sound! Now you could have a go at doing this yourself using some or all of the algorithms above.
But you would also need to invest the time to:
- Understand and profile the data
- Clean and compartmentalise the address information
- Load and compile the libraries to suit your environment.
- Test the code and determine which cut-offs make the most sense for each model
- Request technology to increase your play area on the shared server because you are always running out of space!
- Sneak into the office at night to run your job when it doesn’t impact the million other things being run during the day (Yes, I know that you can schedule jobs – but I’m trying to add drama!)
Or, you could utilise the talents of someone that has already:
- Created a VBA script to test every character in every file for silly things – extra delimiters, non-ASCII chars, truncated fields, etc.
- Invested 9+ mths building bespoke code to separate address details using hundreds of lines of regular expressions
- Loaded all of the algorithms above (and many more) using SQL Server CLR integration
- Set-up 4 dedicated servers with a SQL Server environment and extensive code libraries
- Run and tested the code against millions of real customer records
That person is me! My name is James Ramsay. I have worked in Database Marketing for the last 14 years. I have managed marketing campaign analysts and senior analytics staff across several large organisations (UK and Australia). I know that my processes are truly ground-breaking and I want to prove it to you.
For further details have a look around the site, and if you want to chat you can just Phone Us.