Data is considered to be of a high quality if it is fit for it’s intended purpose.
Elements to consider when measuring Data Quality include:
- Completeness: Is there enough information for decision making or use (e.g. Does the address contain a postcode? Have fields been truncated?)
- Validity: Are there working validation rules for key fields (e.g. Are phone number fields correct for the purpose of calling? Would they work in an automated dialler?)
- Consistency: Has the data been entered the same way (e.g. Drop-down list for a ‘country’ field versus free-typing by a data entry clerk)
- Timeliness: How recent is the information? (e.g. Marital status is often a field that is captured when first joining as a new customer and may never be updated)
- Accuracy: Does the date of birth reflect the customers age? You will often find DOB having a larger distribution for the 1st day of each month and the 1st day of each year.
So how do we measure and pinpoint inconsistencies in your data?
A key to our approach is profiling everything first (finding the unique values in every field).
Every field that contains some form of description or non-unique value will be profiled (age, country, marital status, gender, product type, etc).
Then we will compare the differences and suggest changes where there is an obvious replacement.
Null or Blank fields
A check of every field to check how consistently it is populated. Is it entirely populated with Nulls or blanks? This helps to determine if the field can either be dropped or re-purposed. This step is an essential part of validating the data dictionary and making sure that end-users are aware of the pitfalls of trying to use certain fields in their queries.
A Null value means that we don’t know what the value is. A blank value or zero-length string needs to be handled differently.
In one instance you would use the code <field> is null, in the other <field> = ”
In a lot of databases the same field can contain nulls, blanks, and zeros and your analysts need to understand this for every field that they use or their queries will be incorrect. We can help you document these inconsistencies in data values.
Are there fields that have exceeded the maximum field width and key data is obviously truncated (shorter) from what it should be? We run automated checks to ensure that data that has been migrated and collated from different systems hasn’t led to field data being cut off. Any fields that reach their maximum record will be visually scanned to check for the existence of incomplete data.
Every organisation should have a data dictionary. A Data Dictionary is a readily available repository of information about the data contained in your database. It should include the following:
- Definition of the field
- Relationships to other data
With automation we can check that the data dictionary is consistent with the values that have been categorised and described. If you don’t have a data dictionary we can create the bare bones so that you can begin with the task of defining it.
Validation rules for data
We can check Email address formats. Which email addresses won’t send? Can any be cleaned? We can check Phone numbers – mobile, international, local. Will they work in the automated dialler of your telephony centre? Is there a high percentage that can be fixed without losing the integrity of the existing data. We can go a step further and help you pinpoint the systems that don’t have enough validation when the data is entered?
Identification of Invalid Postcodes
As well as validating postcodes/zip codes from Australia, the United Kingdom and the United States we can also validate postcodes from other countries. If you would like this done as part of your Data Quality initiative then just ask. We use regular expressions extensively for splitting up customer addresses so it doesn’t take much to extend this capability into other areas.
As part of the data analysis we will pinpoint fields that can have standardised values (e.g. gender field containing “male”,”female”,”m”,”f” could be standardised to just “male” and “female”).These will be provided as a separate file for review.
Non-Printable Characters, Symbols, Delimiters in fields
We run your entire file through a special script that we developed. It will step through every single character in your database/file to identify letters, symbols and characters that may pose problems to different operating systems that have to deal with your data. We will be able to point out every record that contains non-ASCII characters, as well as characters typically used as delimiters (commas, semi-colons, etc). These problem records might cause different elements of your business to have severe problems when interrogating the data.
Identification of Duplicate Customers
As you would suspect a key problem with master data management in most organisations is the potential for duplicate customers to appear. We discuss this in greater detail here. This is obviously part of the Data Quality Service and we have some amazing code and fuzzy matching algorithms to find every variation you could imagine. You can choose to review questionable matches too, i.e. customers that might be the same but further investigation would be required.
As well as getting a full report of the findings of the analysis you will be provided with recommendations about best-practice approaches to managing your data ongoing. You will also be given the opportunity to re-run exactly the same analysis at future dates to determine how you are tracking over time.
Data Quality Reporting Tool
As part of a full Data Quality Initiative you may want to consider licensing a copy of the Data Quality Reporting Tool. This tool allows you to address the inconsistencies found in your data by delivering the information to front-line staff, so that they can act on it and update where necessary. This tool is completely adaptable to your organisational needs, you choose what you want to display and how it is to be delivered. It also allows you to keep track of progress and provide reports of the amount of data remediated by each user. Read here for more information.