Using Access Auditing to Classify Database Data

According to a study conducted by the University of California at Berkeley, each year approximately 5 exabytes (1018 bytes) of new information is produced. And ninety-two percent of that information is stored on magnetic media, mostly hard disks. Indeed, businesses today are gathering and storing more data than ever before. Multi-terabyte databases supporting data warehouses (and even OLTP systems) are common. Some organizations are even approaching a petabyte.

 Table 1. Storage Abbreviations

Abbrev.   Term Amount
KB Kilobyte 1,024 bytes
MB Megabyte 1,024 KB
GB Gigabyte 1,024 MB
TB Terabyte 1,024 GB
PB Petabyte 1,024 TB
EB Exabyte 1,024 PB
ZB Zettabyte 1,024 EB
YB Yottabyte 1,024 ZB

There are multiple factors contributing to this explosion of data. On the one hand, technology drives this demand for storage. Database technology has advanced to be better able to store and manage unstructured data. Whereas structured data is the data that most database professionals are familiar with (numbers, characters, and dates), unstructured data is basically anything else. This can mean everything from text files to images, audio files, and videos. When unstructured data is added to the mix data management becomes more complicated because unstructured data is usually large and unwieldy as compared to numeric and alphanumeric data. Additionally, different mechanisms are required to access and modify this data, complicating program development and database administration.

And technology does not stand still, so we continue to see more advances in the amount and type of data that we can store and manage. One such example is managing data streams. Basically, it involves the ingestion of data – structured or unstructured – from arbitrary sources and the processing of it without necessarily persisting it. Any digitized data is fair game for stream computing. As the data streams it is analyzed and processed in a problem-specific manner. The “sweet spot” applications for stream computing are situations in which devices produce large amounts of instrumentation data on a regular basis. The data is difficult for humans to interpret easily and is likely to be too voluminous to be stored in a database somewhere. Examples of types of data that are well-suited for stream computing include healthcare, weather, telephony, stock trades, and so on.

By analyzing large streams of data and looking for trends, patterns, and “interesting” data, stream computing can solve problems that were not practical to address using traditional computing methods. Another useful way of thinking about this is as RTAP – real-time analytical processing (as opposed to OLAP – online analytical processing).The data management problems posed by streaming data is ensuring the that the stream is vigilantly monitored for specific situations – and storing only what is needed to turn the data into intelligence.

Another newer technology threatening to add to the increasing volume of data we must manage is RFID, which promises to generate more data needing to be stored and managed. RFID, or Radio Frequency Identification, technology is a means of communicating without requiring direct contact or line of sight. Commonly implemented on chips, which are applied to materials that need to be tracked, RFID technology is currently being adopted by many large companies to track inventory.  The data management challenge posed by RFID technology is its ability to store highly granular information. Regarding the data generation potential of RFID, estimates range from between 10- to 100-times the data volume of today’s conventional UPC bar code technology. Clearly, as RFID becomes a source of data that needs to be integrated into enterprise data warehouses, storage growth will follow.

Yet technological change is not the only force contributing to data growth – government regulations are complicit, too. Regulatory compliance places stringent rules on data retention and management. Legislation such as Sarbanes-Oxley, which enforces financial accountability, Gramm-Leach-Bliley, which governs how financial institutions should protect personal information, and HIPAA, which protects the privacy of health-related information, are the most prominent of the recent regulations. For organizations to be in compliance with these laws usually requires retaining additional data and possibly capture more details about how, when, and by whom the data is being used. And the amount of database access information required for regulatory compliance is so large that this additional data can create a new need in terms of storage management.

So the amount of data under management is expanding, and as it does, cost expands, too. You have to store that data somewhere and there will be an associated cost. This cost includes the initial cost of the storage device coupled with the additional cost to manage it. All of these factors cause organizations to look for new methods of minimizing cost.

Some organizations undertake the difficult process of determining what data can be eliminated. But this can be a dangerous route to follow. Enterprise data growth is going to happen, you can’t stop it and you probably can’t or don’t want to slow it down because as organizations become more analytically mature they discover new kinds of data they would like to capture and analyze.

Another, more realistic tactic, is to categorize data based on its usage and importance, and then deploy it on the most cost-effective storage medium for its type. Such an approach is referred to as Information Lifecycle Management, or ILM. But there are many issues that need to be resolved to make such an approach effective:

  • A useful categorization technique needs to be defined
  • A method of measuring data such that it can be categorized is required
  • All of the available storage options need to be aligned with the data categories
  • And a technique needs to be created for moving the data to the appropriate storage devices

Categorizing Data – Multi-Temperature Data

The first step to ILM is to develop a data classification scheme. Teradata has defined one approach called multi-temperature data. This technique deploys four categories: Hot, Warm, Cool, and Dormant. The temperature of data is defined as a function of the access rate for queries, updates, and data maintenance. In other words, hotter data is accessed more frequently than warm data which is accessed more frequently than cool data. Finally, dormant data is rarely updated or queried and it is part of a static data model.

Data Access Auditing

Certain characteristics of your data can help to determine its appropriate temperature. For example, data tends to cool over time. And often the temperate of the data correlates well with data volume; the larger the volume, the cooler the data. But these are not hard and fast rules. What is needed is a technique to audit the frequency and type of data access for every piece of data in the database.

Generating an accurate view of all data access has traditionally been a very difficult exercise. The difficulty has always been a combination of the fact that databases store huge volumes of data, have very high degrees of complexity, and that mapping of data access must be done in a non-intrusive manner. Downtime is not an option and any solution that affects the performance or reliability of applications is irrelevant.

Data access auditing solutions are available that can provide this type of information. The best of these solutions perform non-intrusive enterprise data access classification by monitoring the database server and inspecting all SQL requests and responses. Using this non-intrusive database monitoring technology one can quickly get a view of all data access at a very granular level. By monitoring queries using network inspection, it is now possible to create a detailed classification of which data is being used, how it is being used, and when it is being used – with minimal impact on the database or the application.

This information can be used to classify the data using multiple dimensions – such as time, usage, and frequency – and then to translate that into the temperature of the data so it can be placed on the appropriate storage mechanism.

Examples of vendors offering data access auditing solutions include IBM’s Guardium, Imperva’s SecureSphere, Lumigent’s Audit DB, and Application Security’s DbProtect.

Storage Options

The next step is to categorize your storage devices for use with each data temperature. Hot data that is I/O intensive and requires high availability should be placed on storage devices offering high performance, reliability, advanced features and large capacity. This typically means high performance disk drives such as Fibre Channel, SAS or SCSI devices. An example of such devices would be EMC Symmetrix and CLARiiON. Some hot data can even be stored in main memory for optimal speed.

Warm data is less frequently accessed than hot data and often is read more than it is modified. Such data can be placed on less expensive disk devices that offer good performance and reliability, but are not top-of-the-line. Storage devices for warm data should be lower cost than for hot data but with better performance and accessibility than cool or dormant data. SATA and FATA configured storage from EMC and others can work well for warm data.

Cool data is not dormant, but is not accessed often. Such data usually still needs to reside on direct access storage devices. For such data you consider using the lowest cost disk devices available; perhaps NAS (network attached storage) or object based storage such as EMC Centera or StorageTek LFCM 100.

Dormant data, that is, data which has not been accessed for a long time (years) and whose data model is stable, can be moved to offline storage systems such as intelligent tape or optical disk. Alternately, dormant data can be archived to a separately managed archive data store so it no longer impacts operational systems but where it can be safely accessed when needed.

ILM: Information Lifecycle Management

With both a data classification scheme and a storage classification scheme in place, an organization can optimize their storage usage while controlling cost. This process is known as Information Lifecycle Management, or ILM. An intelligent ILM scheme requires monitoring your data for temperature change, continually grading new storage options and managing your storage classification over time. Further, you will need to adjusts the data as its temperature and the storage classifications change. In so doing your organization can be better prepared to intelligently manage the ongoing data onslaught that promises only to expand over time.

Advertisements

About craig@craigsmullins.com

I'm a strategist, researcher, and consultant with nearly three decades of experience in all facets of database systems development.
This entry was posted in data, Data Growth, ILM and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s