Thoughts on Data Quality

Data quality is an elusive goal. The more we try to keep accurate records in database systems, the more it seems that at least some of that data is just plain wrong. Why is this so?

Speed to market issues tend to hinder data quality improvement in today’s world of web-enabled, rapidly deployed systems. Instead of operating within a well-defined data architecture that mandates data modeling and logical database design, we fall prey to the opportunistic techniques of the day.

There are many barriers and inhibitors to data quality. Yes, the rapid-fire nature of project planning and application development is one of these inhibitors, but it is not the only one. The data explosion exacerbates the problem. As we add more and more data without necessarily adding more people and processes and policies to ensure the accuracy of the data, data quality will continue to degrade.

The propensity to copy data all over the place is another big problem. Instead of using database management systems to share data among multiple applications, we copy data instead. Think about it, you’ve probably done it yourself. You’ve got that customer database up on the mainframe. Maybe it exists in DB2 or perhaps even IMS. Then a new application gets funded for development and it needs customer data. But the users don’t want to access the customer data where it resides; instead they want their own copy. Maybe even on the same DBMS. Then a new project comes up and we create a copy in Oracle on Unix and those SQL Server folks need a copy, too. And it is downloaded to workstations in spreadsheets and another copy is made for the data warehouse and oh, yeah, don’t forget to copy it out to those data marts, too.

With copies of the (supposedly) same data swimming all over the place who can guarantee its accuracy? When it changes in one system, does it get changed everywhere? What about that user who has his spreadsheet up in one window and an application up in another window: who can ensure that s/he does not wreck the application data because the information in the spreadsheet is outdated?

Maintaining duplicate data implementations is like having multiple watches that do not tell the same time. Each one is just little bit different than the next one, off by a few seconds, or minutes, or even hours… And synchronizing the data can be a Herculean task.

Automation can help. ETL tools that automatically populate data changes from one database out to others can help to keep multiple copies of the same data synchronized. But the results will only be as good as the automation you put in place. It is doubtful that you will automate the modification of every copy of the data throughout your organization. A much better approach is stopping the wild copying of data and instead share data using a DBMS. That is what a DBMS is designed for, you know?

Assuring data quality can be a struggle even in a centralized and homogeneous environment. Data stewards are needed to implement programs for data quality improvement. Such programs need to include both technology and business tasks. Without executive-level business folks intimately involved in the data quality effort it will surely fail. Implementing data quality programs is costly and without upper level management buy-in, human nature will kick in and we’ll start copying data hither and yon all over again.

What is required for data quality? At a high level, there are six components: accuracy, integrity, security, accessibility, nonredundancy, and flexibility. Accuracy involves data modeling and implementing physical databases that correlate to those models. Integrity has a two-fold meaning: placing rules on the data into the DBMS that define its allowable values and ensuring the backup and recoverability of the database in the event of hardware or software failures. Security involves the protection of the data from unauthorized access. Accessibility means that the data is available in an optimal manner for those who are authorized users. Nonredundancy means that data is not stored inconsistently in multiple places. And flexibility refers to the ability to modify database structures as changes are required.

Both database administration and data administration need to be involved as well. DA sets the policies and rules for data quality and DBA helps to implement them using existing database technology. Data management tools for database change management, backup and recovery, and performance assurance can help to achieve these goals from a physical database perspective. From a logical perspective there are tools to automate and facilitate data modeling, metadata management, and data profiling. Of course, tools aren’t the only answer, but they sometimes can make an impossible task at least possible.

About craig@craigsmullins.com

I'm a data management strategist, researcher, and consultant with over three decades of experience in all facets of database systems development and implementation.

View all posts by craig@craigsmullins.com →

4 Responses to Thoughts on Data Quality

Amedar Consulting Group says:

November 29, 2012 at 12:15 am

I dugg some of you posts as I thought they were very beneficial / invaluable

Antoine Proult says:

December 6, 2012 at 8:54 am

Dealing with Data Quality is in a sense dealing with Data as a critical asset for the entreprise, assets that have differentiated lifecycles and importances.
Once everyone has a shared an understanding of this idea, than you can engage in a enterprise data management strategy and become very successfull. I’ve been through such experience, where you deal with all the ideas and concepts identified in your article. Until so, data quality remains elusive as you state.

Pingback: Data Quality | Data and Technology Today
Paras Doshi says:

December 28, 2012 at 11:42 am

Thanks for sharing your thoughts on Data Quality! And I liked the way you concluded your post: “tools aren’t the only answer, but they sometimes can make an impossible task at least possible.” – Thanks