Data Quality Part II

In Part I of this series, I showed how to create rules around data in a table. This post will expand on the idea of data quality and the need for data quality integration in your systems and applications. Ideally, a good data quality solution will help to reduce the amount of ETL (and associated overhead) required to incorporate data sets between applications, and will help to create data that is both efficient and valuable to your organization.

The first step in a good data quality solution is going to be data profiling. Properly profiling data helps to assess the risk involved in integrating data, find potential errors in the data, and collect information about the data. When profiling a data set, you want to identify any problem that could corrupt your application data or cause the data integration to stop.

What are some common profiling steps that should halt integration? Below are some that I commonly use:

  • New file record count = 0
  • Compare new file to previous file(s).
    • Exact same (duplicate transmissions)
    • File is significantly larger or smaller than typical (threshold processing)
  • Upper bound & Lower bound dates
    • Dates in data contain valid dates in a specific range
  • Not allowing null key values
  • No duplicate keys
Advertisements

Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: