What is file matching?
File matching involves comparing the contents of one customer data file against the contents of another customer data file to see which records contain the same or similar information.
Why match customer data files?
There are many reasons, some examples are listed below:
- As a preliminary process to identify and merge duplicate customer records
- Identify deceased customers
- Identify telephony and email opt-outs
- Identify gone-aways (i.e. customer is no longer at this address)
- Remove customers with adverse credit history
- Merge/Exclude customers from disparate systems
- Identifying duplicate customers in the same file (file is matched to itself)
- Cross-sell to customers held on another system and ensure that they don’t already hold the product
- Enable 2 organisations to share data without being given access to each others information (e.g. for cross-selling products)
- Congratulate customers that have received awards, honours or decorations
- Identify which customers have subsequently taken up a specific marketing offer
- Determine which customers that saw you at a trade fair have subsequently become a customer
Our file matching broadly fits under 2 categories:
1. Complex Rules: These are string matching routines that require full or partial matching using only exact 1:1 matches. This is where we look for exactly the same string match in each file. In most instances this type of match will be across multiple fields, or elements of multiple fields.
- title+first name+surname+street number+street+suburb+postcode
- first name+surname+date of birth+city+state+other (e.g. mobile phone)
2. Approximate string matching: Also referred to as fuzzy matching includes a number of different techniques that try and find records that are similar but not exact.
Most algorithms that are used in this category either check for phonetic differences (i.e. typing what you hear but misspelling the result) or they measure the generalised edit distance between strings, where for example a letter might be missing or reversed in one record versus another (i.e. ‘typos’ or typographic errors – duplication, omission, transposition, or substitution of a small number of characters.). We have a large library containing more than 20 different string matching algorithms and we will only pick those that suit your data and circumstances. Each time we find a record based solely on fuzzy matching we also throw in other qualifying checks (e.g. gender and date of birth). Even our qualifying checks have to undergo their own qualifying checks! If we use date of birth we have to check that it is populated and isn’t a default date of birth (e.g. ’01/01/2001′). If we use gender we make sure that the field doesn’t contain gibberish or re-purposed codes (e.g. gender in (‘m’,’f’)). Even email fields can be populated with defaults (e.g. firstname.lastname@example.org).
- fuzzy first name+surname+street number+street+suburb+postcode+(dob,mobile or email+gender)
- first name+fuzzy surname+street number+street+postcode+(dob,mobile or email+gender)
- first name+surname+street number+fuzzy street+postcode+(dob,mobile or email+gender)
How do we match files?
We prefer to have a series of tight rules that slowly whittle down the files and give us customers that we would naturally assume were the same people. As each rule is run we take a sample and do a physical check before we arrive at the final rule set. Every set of customer data is different and without this meticulous approach it is easy to make mistakes. If it wouldn’t pass visually it shouldn’t pass when our rules are created and executed.
Our file matching service goes through the following stages:
- Field standardisation (e.g. casing, symbols, letters, abbreviations, etc)
- The separation and categorisation of key address elements (e.g. Building, Block, Level, Apartment, Unit, Lot, etc)
- Data quality and formatting checks
- Identification of duplicate records
- Fuzzy matching (e.g. typos, misspellings, etc)
- Multiple match levels using different key identifying data (e.g. name, address, phone, email, account, etc).
Below are some examples of how our processes and algorithms deal with certain situations
In particular we look at common words used for business names and standardise them.
- PROPRIETARY will become PTY
- Ampersand (&) will become AND
- P/L will become PTY LTD
For customer address details, we have a lookup table that covers hundreds of different variations of a street suffix, for example:
- CRESCENT will become CRES
- ROAD will become RD
- STREET will become ST
We even break-down streets by direction, e.g. North East.
Our field standardisation covers hundreds of different variations, for different countries and types of businesses.
Compartmentalising address details
We have created our own magnificent script for identifying key information in addresses and separating it out into logical components. We do this using hundreds of lines of regular expressions embedded into SQL queries. By Splitting up address lines into logical components we have been able to find all of the following address variations for the same customer:
- UNIT 20, 5-7 AMY ROAD
- 20/5-7 AMY RD
- APARTMENT 20/5-7 AMY ROAD
Similarly, an address split across two lines and reversed can still be matched:
- 20/10 CLIFTON PLACE
- 10 CLIFTON PL, UNIT 20
We can also match addresses where one contains a lot more information than the other:
- UNIT 55 SUNSHINE VIEW APARTMENTS, 30 THOMAS BRADLEY STREET
- 55/30 THOMAS BRADLEY ST
These examples are just the tip of the iceberg, the script that separates addresses was months in the making but the results are truly impressive. If you would like to see how it works for your data then get in touch.
Data Quality and formatting checks
When it comes to email addresses and phone numbers we completely clean and standardise them first. This enables us to match more effectively. Consider a mobile number with a country code prefix matched against one without. Some numbers will have leading zeroes, others will have spaces or hyphens between text. We clean and format these fields for all records before we endeavour to match them. For example, the mobile number 0429 232 444 will match against the same number with completely different formatting +61 429-232-444
Identify duplicate records
This is a full service function. We go through all stages above but then match the file to itself to find all records where the customer looks the same.
In this example various algorithms found the similarity in firstname, but because we used fuzzy matching we also validated this customer by her date of birth and full address.
- MISS KRISTY BYRON FIDDLER
- MS CHRISTY B. FIDDLER
Similarly, with street address we can identify typos, spelling mistakes and/or letter substitution:
- TRISH PARSON, 42 RAIRBAIRN STREET
- TRISH JANE PARSON, 42 FAIRBAIRN ST
Raise Questionable Records for further investigation
We create a lot of different rules for every situation. In most cases multiple rules will find exactly the same customers, and we take a note of each rule for every match.
(e.g. ID1: 1000001 ID2: 2222222 RULES: R2,R5,R10,R12) – which would show that 4 different rules all matched the same customer.
Occasionally, only one of our algorithms will find a questionable match, but when you review it personally there are too many similarities to discount it. Consider the example below:
- MIKE PAUL GARDINER, 425 MOOKE STREET, POSTCODE: 3072, DOB: 22/01/1965
- MIKE PAUL GARDNER, 125 MOAKY ST, POSTCODE: 3072, DOB: 22/01/1965
- MIKE GARDNER, 25 MOCKI ST, POSTCODE: 3072, DOB: 22/01/1965
At first glance the addresses look too dissimilar, but when you add recurring numbers, similar sounding street names and exact matches on postcode and date of birth you would probably admit that these records warrant further investigation. Perhaps these type of records should be delivered to front-line staff to investigate further? After all many of them will be speaking to your customers on a regular basis. You could consider using a simple distribution and tracking tool to help deliver this information to front-line staff – Read more.
Which Rules to use?
Every set of customer data is different. Each organisation has their own unique systems capturing and storing data. Some will have data validation on entry, others will not. Some data-entry staff will input exactly the same customer differently to another. Thankfully, we have a large collection of fuzzy matching algorithms and will always be able to find something that fits perfectly for your circumstance. Some of our more reliable algorithms include Damerau–Levenshtein distance, DoubleMetaphone (an improvement on Soundex) and JaroWinkler. No matter what rules get written to analyse your data we will always sample the data at every stage to make sure that our results make sense. We are so confident in our matches that we will provide you examples of each rule in the final report. This report will help your organisation find ways to improve data governance for the future. It may be as simple as updating a few key fields in your front-end systems.
If you would like us to match files for you then please use the Contact us form.