In addition to managing data, corporations must to be able to manage and control the definition of the data elements used in databases. Without an understanding of the structure, limitations, definition, and description of data, it is likely that data will be misinterpreted or misused; further, data that is not well-defined can cause database integrity problems. This is a metadata issue.
But What is Metadata?
Have you ever watched the Antiques Roadshow program on television? In this show people bring items to professional antique dealers to have them examined and evaluated. The participants hope to learn that their items are long-lost treasures of immense value. The antique dealers always spend a lot of time talking to the owners about their items. They always ask questions like “Where did you get this item?” and “What can you tell me about its history?” Now, the item is sitting right there in front of them, yet they ask these questions. Why? Because these details provide knowledge about the authenticity and nature of the item. The dealer also carefully examines the item looking for markings and dates that provide clues to the item’s origin.
Users of data must know what the data is before it becomes useful as information. Information about data is referred to as metadata. The simplest definition of metadata is “data about data.” But, to be a bit more precise, metadata describes data, providing information like data type, length, textual description, and other characteristics of the data. So, for example, metadata allows the user to know that the customer number is a five digit numeric field, whereas the data itself might be 56789.
So, using our Antiques Roadshow example, the item being evaluated is the “data.” The answers to the antique dealer’s questions and the marking on the item are the “metadata.” Value is assigned to an item only after the metadata about that item is discovered and evaluated.
Metadata characterizes data. It is used to provide documentation such that data can be understood and more readily consumed by your organization. Metadata answers the who, what, when, where, why, and how questions for users of the data.
From Data to Knowledge and Beyond
The basic building block of knowledge is data. Data is a fact represented as an item or event out of context and with no relation to other things. Examples of data are 27, 010110, and JAN. Without additional details we know nothing about any of these three pieces of data. Consider:
- Is 27 a number in base ten, or is it in octal (which would translate to 23 in base ten)?
- If 27 is a number in base ten what does it represent? Is it an age, a dollar amount, an IQ, a shoe size, or something else entirely?
- What about 010110? Is it a binary number? Or is it a representation of a date, perhaps January 1, 1910? January 1, 2010? Or something else entirely?
- Finally, what does JAN represent? Is it a woman’s name (or a man’s name)? Or does it represent the first month of the year? Or perhaps it is something else entirely?
All of these are examples of data because of the lack of context. Information, on the other hand, adds context through relationships between data, and possibly other information. Data in context with metadata makes information. The relationships may represent information, yet the relations do not actually constitute information until they are understood. Also, the relationships that represent data have a tendency to be limited in context, mostly about the past or present, with little if any implication for the future.Webster’s New Collegiate Dictionary defines knowledge as “the fact or condition of knowing something with familiarity gained through experience or association.” Knowledge adds understanding and retention to information. It is the next natural progression after information. To have “knowledge” requires information in conjunction with patterns between data, information, and other knowledge. SO knowledge couples data with understanding and cognition.
The final step would be to move from knowledge to wisdom. Wisdom can be thought of as knowledge applied. You may have the knowledge that fatty foods are bad for you, but if you eat it anyway, you are not wise.
So. In order for data to be anything more than simply data, metadata is required. Without metadata, data has no identifiable meaning – it is merely a collection of digits, characters, or bits. Metadata gives data its form and makes it usable by information professionals. Furthermore, metadata management is a prerequisite for truly treating data as a corporate asset.
Types of Metadata
Even though all metadata describes data, there are many different types and sources of metadata. On one level, though, all metadata boils down to one of two types: technology metadata or business metadata. Technology metadata describes the technical aspects of the data as it relates to storing and managing the data in computerized systems. Business metadata, on the other hand, describes aspects of how the data is used by the business, and is needed for the data to have value to the organization. So, knowing that the LICNO column is a positive integer between 1 and 9,999,999 is technology metadata. Of course, this information is useful to, and required by the business user, too. Knowing that the LICNO column is the practitioner license number for certified course instructors, must be unique and every instructor can have one and only one license number is business metadata (though, these details also are useful to the DBA in order to create the database appropriately and effectively).
For DBAs, the DBMS itself is a good source of metadata. The system catalog (or perhaps, data dictionary, depending on your particular DBMS of choice) is used to store information about database objects and is a vital store of metadata – mostly technology metadata. DBAs and developers make regular use of the metadata in the DBMS system catalog to help them better understand about database objects and the data contained therein. Depending on the DBMS, the user can write queries against the system catalog tables or views, or can execute system-provided stored procedures to return metadata from the system catalog tables. Just about any type of descriptive information about the composition of the data may be found in the system catalog. For example, most DBMSs store all of the following metadata in the system catalog:
- The names of every database, table, column, index, view, relationship, stored procedure, trigger, and so on.
- The primary key for each table and any foreign keys that refer back to that primary key.
- Which tables are in which views.
- The data type, length, and constraints for each column of every table.
- The names of the physical files used to store database data, as well as information about file storage, extents, and disk volumes.
- Authorization and security information detailing which users have what type of authority on which database objects.
- The date and time of the last database definition change, as well as the ID of the user who implemented the DDL for the change.
- Database organization information.
The DBMS system catalog is a particularly effective source of metadata because it is active, integrated, and nonsubvertible. The system catalog is active because the metadata is automatically built and maintained as database objects are created and modified. As the DBA creates databases, the DBMS automatically collects and populates metadata in the system catalog. The integration of the system catalog and the DBMS, coupled with the active nature of the system catalog, keeps the technology metadata in the system catalog accurate and up-to-date. Additionally, the DBMS system catalog is nonsubvertible, meaning that normal DBMS operations are the only mechanism for populating the system catalog. Of course, the subvertibility of the system catalog will differ from DBMS to DBMS. Some DBMSs provide options to enable direct updates to the system catalog – but such an option is to be used only in emergency situations and generally under the direction of the DBMS vendor’s technical support personnel.
Although a wealth of metadata can be found in the system catalog, this DBMS metadata usually is insufficient to fully describe data. For example, descriptions of database objects are not commonly found in the DBMS system catalog. Some DBMSs provide system catalog description columns that can be populated at the DBA’s discretion. But many DBAs avoid doing so for fear of disorganizing the system catalog or perhaps just because descriptions for the database objects were not available when the objects were created. Additional metadata that is useful, but not found in the system catalog, includes:
- Metadata for non-database files (flat or sequential files).
- Modification information regarding when data in the database was last changed? And by whom?
- Copybook information for the database table (or non-database file), as well as which programs use that information.
- Information on batch jobs and transactions that access the data.
- Operational metadata on IT infrastructure components.
- Data model metadata describing the logical database design and how it maps to the physical database implementation.
- Data warehousing and ETL metadata defining data source(s), system of record, a date and timestamp when the data was last updated, and other analytical information.
- Data ownership and stewardship metadata.
Of course, this is an incomplete list. A myriad of different metadata types and purposes exists that can be cataloged and managed. Capturing and maintaining metadata better documents databases and systems, thereby making them easier to use. The more metadata available to business users, the more value they will be able to extract from their information systems.
Where Is Metadata Stored?
A repository can be used to store information about an organization’s data assets. In other words, repositories are used to store metadata. Repository technology can be quite useful when implemented properly. A correctly implemented repository stores all pertinent metadata for the corporation. It can act as a single, centralized mechanism to assist in the migration of data from the multiple sources to a data warehouse.
But the days of the monolithic, capital R Repository for all metadata are mostly over. Sure, there are some organizations that still rely on Repository offerings such as those from CA or ASG, but these are less common than in the past. Today there are multiple repositories used by various technologies and applications that are used to house metadata.
Data dictionaries were the precursors to repository technology. Data dictionaries were popular in the 1980s, and some organizations still use data dictionaries (often, of the homegrown variety). The purpose of a data dictionary is to manage data definitions. In general, they offer little automation – the user had to manually key in the definitions. In some cases the data dictionary was integrated into the DBMS and databases could be defined using the metadata in the data dicitonary – but this was pre-relational; before DBMSs had system catalogs.
The Bottom Line
So the struggle today is trying to organize the cadre of repositories and attempting to rationalize all of the disparate locations and sources where metadata resides. Some shops are better than others at doing this. But I firmly believe that only those shops who understand the importance of their metadata can truly thrive as leaders… I mean, when you think about it, if you don’t know your metadata, you don’t know your data… and if you don’t know your data, how can you conduct business effectively or effeciently?