Optimizing Database Performance, part 2: Denormalization and Clustering

In Part 1 of our series introducing database performance topics and considerations we tackled the issues of partitioning data and creating indexes. In Part 2, today, we will introduce the topic of denormalization and clustering.

Denormalization

Another way to optimize the performance of database access is to denormalize the tables. So before we address denormalization, let’s first briefly talk about normalization.

Normalization is the process of putting each fact in the most appropriate place. A normalized database implementation minimizes integrity problems and optimizes updating (perhaps at the expense of retrieval). When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This can slow the retrieval process. Updating is quicker, however, because the fact you’re updating exists in only one place. A brief introduction to normalization and the normal forms can be found here.

Most applications require very rapid data retrieval. Some applications require specific tinkering to optimize performance at all costs. To accomplish this, sometimes the decision is made to denormalize the physical database implementation. Just as normalization is the process of assembling data in an organized manner to eliminate redundancies, denormalization is the process of deliberately introducing redundancy to your data. In other words, denormalization can be thought of as the process of putting one fact in numerous places. This can have the effect of speeding up the data retrieval process, usually at the expense of data modification.

So briefly defined, denormalization, the opposite of normalization, is the process of putting one fact in many places. Denormalizing tables can be a good decision when a completely normalized design does not perform optimally. The only reason to ever denormalize a relational database design is to enhance performance. 

So the basic rule of thumb is to never denormalize data unless a performance need arises or your knowledge of the way your DBMS operates overrides the benefits of a normalized implementation. Individual DBMS products may have specific deficiencies and inefficiencies that may necessitate denormalizing for performance reasons. You should always consider the following issues before denormalizing:

  • Can the system achieve acceptable performance without denormalizing?
  • Will the performance of the system after denormalizing still be unacceptable?
  • Will the system be less reliable due to denormalization?

If the answer to any of these questions is yes, you should avoid denormalization because the benefits typically will not exceed the cost. If, after considering these issues, you decide to denormalize, be sure to adhere to the general guidelines that follow.

If enough disk space is available, consider creating two sets of tables: one set fully normalized and another denormalized. Populate the denormalized versions by querying the data in the normalized tables and loading or inserting it into the denormalized tables. Your application can access the denormalized tables in a read-only fashion and achieve performance gains, while at the same time modifying the normalized version and avoiding integrity problems in the base data. However, it is important to set up a controlled and scheduled population function to synchronize the normalized table with the denormalized.

If sufficient disk space is not available for two complete sets of tables, implement only the denormalized tables and maintain them programmatically. Be sure to update each denormalized table representing the same entity at the same time, or use database triggers to keep the redundant data synchronized.

When a column is replicated in many different tables, always update it everywhere simultaneously—or as close to simultaneously as possible—given the physical constraints of your environment. Triggers can be helpful to accomplish this. If the denormalized tables are ever out of sync with the normalized tables, be sure to inform end users that batch reports and online queries may not contain completely up-to-date data; if at all possible, this should be avoided. And be sure to design the application so that it can easily be converted from using denormalized tables to using normalized tables.

Every denormalization decision should be documented, including the reason behind the decision and the exact changes made from the normalized logical data model. Such a record will help to ensure that future database changes are made with appropriate knowledge. Documentation will also make it clear that you didn’t simply make a design or implementation mistake.

Types of Denormalization

If you must denormalize, there are several options that you can consider:

  • Prejoined tables — when the cost of joining is prohibitive.
  • Report table — when specialized critical reports are too costly to generate.
  • Mirror table — when tables are required concurrently by two types of environments.
  • Split tables — when distinct groups use different parts of a table. The split could be row-wise or column-wise depending upon the needs of the accessing applications.
  • Combined tables — to consolidate one-to-one or one-to-many relationships into a single table.
  • Speed table — to support hierarchies like bill-of-materials or reporting structures.
  • Physical denormalization — to take advantage of specific DBMS characteristics.

You might also consider

  • Storing redundant data in tables to reduce the number of table joins required.
  • Storing repeating groups in a row to reduce I/O and possibly disk space.
  • Storing derivable data to eliminate calculations and costly algorithms.

Clustering

There are multiple meaning for the term clustering when it comes to database management systems. The clustering context that we will discuss here pertains to how data is stored on disk.

A clustered table will store its rows physically on disk in order by a specified column or columns. Clustering usually is enforced by the DBMS with a clustering index. The clustering index forces table rows to be stored in ascending order by the indexed columns. The left-to-right order of the columns as defined in the index, defines the collating sequence for the clustered index. There can be only one clustering sequence per table (because physically the data can be stored in only one sequence).

cluster

Figure 1. Clustered versus Unclustered Data

The figure above (Figure 1) demonstrates the difference between clustered and unclustered data and indexes; the clustered index is on top, the unclustered index is on the bottom. As you can see, the entries on the leaf pages of the top index are in sequential order — in o­her words, they are clustered. Clustering enhances the performance of queries that access data sequentially because fewer I/Os need to be issued to retrieve the same data.

Depending on the DBMS, the data may not always be physically maintained in exact clustering sequence. When a clustering sequence has been defined for a table, the DBMS will act in one of two ways to enforce clustering:

  1. When new rows are inserted, the DBMS will physically maneuver data rows and pages to fit the new rows into the defined clustering sequence; or
  2. When new rows are inserted, the DBMS will try to place the data into the defined clustering sequence, but if space is not available on the required page the data may be placed elsewhere.

The DBA must learn how the DBMS maintains clustering. If the DBMS operates as in the second scenario, data may become unclustered over time and require reorganization. A detailed discussion of database reorganization appears later in this chapter. For now, though, back to our discussion of clustering.

Clustering tables that are accessed sequentially is good practice. In other words, clustered indexes are good for supporting range access, whereas unclustered indexes are better for supporting random access. Be sure to choose the clustering columns wisely. Use clustered indexes for the following situations:

  • Join columns, to optimize SQL joins where multiple rows match for one or both tables participating in the join
  • Foreign key columns because they are frequently involved in joins and the DBMS accesses foreign key values during declarative referential integrity checking
  • Predicates in a WHERE clause
  • Range columns
  • Columns that do not change often (reduces physically reclustering)
  • Columns that are frequently grouped or sorted in SQL statements

In general, the clustering sequence that aids the performance of the most commonly accessed predicates should be used to for clustering. When a table has multiple candidates for clustering, weigh the cost of sorting against the performance gained by clustering for each candidate key. As a rule of thumb, though, if the DBMS supports clustering, it is usually a good practice to define a clustering index for each table that is created (unless the table is very small).

Clustering is generally not recommended for primary key columns because the primary key is, by definition, unique. However, if ranges of rows frequently are selected and ordered by primary key value, a clustering index may be beneficial.

Page Splitting

When the DBMS has to accommodate inserts, and no space exists, it must create a new page within the database to store the new data. The process of creating new pages to store inserted data is called page splitting. A DBMS can perform two types of page splitting: normal page splits andmonotonic page splits. Some DBMSs support both types of page splitting, while others support only one type. The DBA needs to know how the DBMS implements page splitting in order to optimize the database.

split

Figure 2. How a typical page split works

Figure 2 (above) depicts a normal page split. To accomplish this, the DBMS performs the following tasks in sequence:

  1. Creates a new empty page in between the full page and the next
    page
  2. Takes half of the entries from the full page and moves them to the empty page
  3. Adjusts any internal pointers to both pages and inserts the row accordingly

A monotonic page split is a much simpler process, requiring only two steps. The DBMS

  • Creates a new page in between the full page and the next page
  • Inserts the new values into the fresh page

Monotonic page splits are useful when rows are being inserted in strictly ascending sequence. Typically, a DBMS that supports monotonic page splits will invoke it when a new row is added to the end of a page and the last addition was also to the end of the page.

When ascending rows are inserted and normal page splitting is used, a lot of space can be wasted because the DBMS will be creating half-full pages that never fill up. If the wrong type of page split is performed during database processing, wasted space will ensue, requiring the database object to be reorganized for performance.

Summary

Understanding normalization, denormalization techniques and clustering will help you as a DBA or performance analyst to be able to deliver much better database performance for your clients and applications.

This series of blog posts is excerpted from Craig Mullins’ classic book Database administration: The complete guide to DBA practices and procedures. Consult the book for more in-depth information on any of these topics – and much, much more.

Advertisements

About craig@craigsmullins.com

I'm a strategist, researcher, and consultant with nearly three decades of experience in all facets of database systems development.
This entry was posted in DBA and tagged , , , , . Bookmark the permalink.

6 Responses to Optimizing Database Performance, part 2: Denormalization and Clustering

  1. Pingback: DB2 Hub | Optimizing Database Performance, part 2: Denormalization and Clustering

  2. Mike Krafick says:

    Craig, I am a Senior Level DBA and IBM Champion and I am ashamed to stay I always struggled with this concept, when and why to use denormalization, and the different ways. This is awesome.

    I love when you write cross platform. Thank you!

  3. Pingback: Optimizing Database Performance, part 3: Free Space and Compression | Data and Technology Today

  4. Pingback: Optimizing Database Performance, part 4: File Placement and Page Sizing | Data and Technology Today

  5. Pingback: Optimizing Database Performance, part 5: Database Reorganization | Data and Technology Today

  6. Pingback: Optimizing Database Performance, part 6: Automation | Data and Technology Today

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s