Duplication of Data
Today’s blog entry on bad database standards is a little different than the previous three we’ve discussed here. In this case, the bad standard is that there is no formal standard. The first three entries in our on-going “bad standards” series dealt with common, existing standards that are either outdated or not well thought-out. This time, though, we tackle a standard that is “bad” because it most likely does not exist. Very few shops have written a standard like the one I’m about to outline.
It is generally agreed in the industry that data growth is happening — some may say that it is spiralling out of control. Businesses today are gathering and storing more data than ever before. And with this explosion in the amount of data being stored, organizations are relying more than ever on database management systems to get a handle on corporate data and extract useful business information from that raw data. But rarely is there a uniform, guiding data infrastructure in place that dictates when, why, how, and where data is to be stored.
But I’m getting ahead of myself a bit here. The missing standard that I am proposing is one that limits copies of the same data. One of the biggest contributors to data growth is that we copy and store the same data over and over and over again. It may reside in the production system in a DB2 database on the mainframe (and, oh yes, it was copied from an IMS database that still exists because there are a business critical transactions that have yet to be converted, and may not benefit from being converted). And then it is copied to the data warehouse (perhaps running Oracle on a Unix server), an operational data store, several data marts, and maybe even to an ad hoc SQL Server database in the business unit… and don’t forget those users who have the same data in an Excel spreadsheet (or even an Access database) on their desktop. This wanton copying has got to stop!
A DBMS is a viable, useful piece of software because it enables multiple users and applications to share data while ensuring data integrity and control. But human nature being what it is, everyone wants their own copy of the data to “play with” and/or manipulate. But at what cost? Data storage requirements are but one, small piece of the cost. The bigger cost is the data integrity problems that are created. If you have customer data (for example) spread across 5 platforms, 4 database systems, and 3 different locations what do you think the chances are that all of that data will be accurate? My guess would be that there is a zero percent chance!
So we need to create standards that control, prohibit, and limit the mass duplication of data that is rampant within today’s companies. Of course, to do so requires a data management discipline to be enacted such that data is available and accurate to all potential consumers. If the data can be accessed efficiently from a single location, or at least fewer locations, we can reduce the amount of data we need to manage and improve data quality.
Doesn’t that sound like a win/win scenario?