Bad Database Standards, Part 4

Duplication of Data

Today’s blog entry on bad database standards is a little different than the previous three we’ve discussed here. In this case, the bad standard is that there is no formal standard. The first three entries in our on-going “bad standards” series dealt with common, existing standards that are either outdated or not well thought-out. This time, though, we tackle a standard that is “bad” because it most likely does not exist. Very few shops have written a standard like the one I’m about to outline.

It is generally agreed in the industry that data growth is happening — some may say that it is spiralling out of control. Businesses today are gathering and storing more data than ever before. And with this explosion in the amount of data being stored, organizations are relying more than ever on database management systems to get a handle on corporate data and extract useful business information from that raw data. But rarely is there a uniform, guiding data infrastructure in place that dictates when, why, how, and where data is to be stored.

But I’m getting ahead of myself a bit here. The missing standard that I am proposing is one that limits copies of the same data. One of the biggest contributors to data growth is that we copy and store the same data over and over and over again. It may reside in the production system in a DB2 database on the mainframe (and, oh yes, it was copied from an IMS database that still exists because there are a business critical transactions that have yet to be converted, and may not benefit from being converted). And then it is copied to the data warehouse (perhaps running Oracle on a Unix server), an operational data store, several data marts, and maybe even to an ad hoc SQL Server database in the business unit… and don’t forget those users who have the same data in an Excel spreadsheet (or even an Access database) on their desktop. This wanton copying has got to stop!

A DBMS is a viable, useful piece of software because it enables multiple users and applications to share data while ensuring data integrity and control. But human nature being what it is, everyone wants their own copy of the data to “play with” and/or manipulate. But at what cost? Data storage requirements are but one, small piece of the cost. The bigger cost is the data integrity problems that are created. If you have customer data (for example) spread across 5 platforms, 4 database systems, and 3 different locations what do you think the chances are that all of that data will be accurate? My guess would be that there is a zero percent chance!

So we need to create standards that control, prohibit, and limit the mass duplication of data that is rampant within today’s companies. Of course, to do so requires a data management discipline to be enacted such that data is available and accurate to all potential consumers. If the data can be accessed efficiently from a single location, or at least fewer locations, we can reduce the amount of data we need to manage and improve data quality.

Doesn’t that sound like a win/win scenario?

Advertisements

About craig@craigsmullins.com

I'm a strategist, researcher, and consultant with nearly three decades of experience in all facets of database systems development.
This entry was posted in data integrity, standards. Bookmark the permalink.

4 Responses to Bad Database Standards, Part 4

  1. Garnie Bolling says:

    hmmm… the age old question…

    Governance = politics

    access = my own domain / my data

    coming to some agreement on controlling duplication… we need the Jedi mind trick…

  2. Keep up the excellent work.

    I’ve had one client that really structured with primary keys. There were times when they mixed natural keys along with surrogate keys on the same row, cases where they throw in an artificial timestamp that had no business value only to make the row unique, and many cases in which they refused to use sequences instead went with select count(*) FROM where MATCH_BY_PK(s)

    I’d be interested in hearing your thoughts on this subject matter.

    Cheers,
    Richard L. Burton III

    • Sounds like a mess to me, Richard! I am a proponent of using natural keys when possible – – especially so when the key values need to be entered by end users. Surrogate keys are difficult to use because they have no inherent business value. The “value” of surrogate keys may be experienced if you’ve got a very large number of columns in a multi-column “natural” key and the code therefore becomes unwieldy.

      So, as with most things, the “almost never say always or never” maxim applies, I generally tend to favor natural keys whenever possible.

  3. Trinity says:

    wow, awesome blog post. Really looking forward to read more. Cool.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s