Optimizing Database Performance, part 2: Denormalization and Clustering

In Part 1 of our series introducing database performance topics and considerations we tackled the issues of partitioning data and creating indexes. In Part 2, today, we will introduce the topic of denormalization and clustering.

Denormalization

Another way to optimize the performance of database access is to denormalize the tables. So before we address denormalization, let’s first briefly talk about normalization.

Normalization is the process of putting each fact in the most appropriate place. A normalized database implementation minimizes integrity problems and optimizes updating (perhaps at the expense of retrieval). When a fact is stored in only one place, retrieving many different but related facts usually requires going to many different places. This can slow the retrieval process. Updating is quicker, however, because the fact you’re updating exists in only one place. A brief introduction to normalization and the normal forms can be found here.

Most applications require very rapid data retrieval. Some applications require specific tinkering to optimize performance at all costs. To accomplish this, sometimes the decision is made to denormalize the physical database implementation. Just as normalization is the process of assembling data in an organized manner to eliminate redundancies, denormalization is the process of deliberately introducing redundancy to your data. In other words, denormalization can be thought of as the process of putting one fact in numerous places. This can have the effect of speeding up the data retrieval process, usually at the expense of data modification.

So briefly defined, denormalization, the opposite of normalization, is the process of putting one fact in many places. Denormalizing tables can be a good decision when a completely normalized design does not perform optimally. The only reason to ever denormalize a relational database design is to enhance performance. 

So the basic rule of thumb is to never denormalize data unless a performance need arises or your knowledge of the way your DBMS operates overrides the benefits of a normalized implementation. Individual DBMS products may have specific deficiencies and inefficiencies that may necessitate denormalizing for performance reasons. You should always consider the following issues before denormalizing:

  • Can the system achieve acceptable performance without denormalizing?
  • Will the performance of the system after denormalizing still be unacceptable?
  • Will the system be less reliable due to denormalization?

If the answer to any of these questions is yes, you should avoid denormalization because the benefits typically will not exceed the cost. If, after considering these issues, you decide to denormalize, be sure to adhere to the general guidelines that follow.

If enough disk space is available, consider creating two sets of tables: one set fully normalized and another denormalized. Populate the denormalized versions by querying the data in the normalized tables and loading or inserting it into the denormalized tables. Your application can access the denormalized tables in a read-only fashion and achieve performance gains, while at the same time modifying the normalized version and avoiding integrity problems in the base data. However, it is important to set up a controlled and scheduled population function to synchronize the normalized table with the denormalized.

If sufficient disk space is not available for two complete sets of tables, implement only the denormalized tables and maintain them programmatically. Be sure to update each denormalized table representing the same entity at the same time, or use database triggers to keep the redundant data synchronized.

When a column is replicated in many different tables, always update it everywhere simultaneously—or as close to simultaneously as possible—given the physical constraints of your environment. Triggers can be helpful to accomplish this. If the denormalized tables are ever out of sync with the normalized tables, be sure to inform end users that batch reports and online queries may not contain completely up-to-date data; if at all possible, this should be avoided. And be sure to design the application so that it can easily be converted from using denormalized tables to using normalized tables.

Every denormalization decision should be documented, including the reason behind the decision and the exact changes made from the normalized logical data model. Such a record will help to ensure that future database changes are made with appropriate knowledge. Documentation will also make it clear that you didn’t simply make a design or implementation mistake.

Types of Denormalization

If you must denormalize, there are several options that you can consider:

  • Prejoined tables — when the cost of joining is prohibitive.
  • Report table — when specialized critical reports are too costly to generate.
  • Mirror table — when tables are required concurrently by two types of environments.
  • Split tables — when distinct groups use different parts of a table. The split could be row-wise or column-wise depending upon the needs of the accessing applications.
  • Combined tables — to consolidate one-to-one or one-to-many relationships into a single table.
  • Speed table — to support hierarchies like bill-of-materials or reporting structures.
  • Physical denormalization — to take advantage of specific DBMS characteristics.

You might also consider

  • Storing redundant data in tables to reduce the number of table joins required.
  • Storing repeating groups in a row to reduce I/O and possibly disk space.
  • Storing derivable data to eliminate calculations and costly algorithms.

Clustering

There are multiple meaning for the term clustering when it comes to database management systems. The clustering context that we will discuss here pertains to how data is stored on disk.

A clustered table will store its rows physically on disk in order by a specified column or columns. Clustering usually is enforced by the DBMS with a clustering index. The clustering index forces table rows to be stored in ascending order by the indexed columns. The left-to-right order of the columns as defined in the index, defines the collating sequence for the clustered index. There can be only one clustering sequence per table (because physically the data can be stored in only one sequence).

cluster

Figure 1. Clustered versus Unclustered Data

The figure above (Figure 1) demonstrates the difference between clustered and unclustered data and indexes; the clustered index is on top, the unclustered index is on the bottom. As you can see, the entries on the leaf pages of the top index are in sequential order — in o­her words, they are clustered. Clustering enhances the performance of queries that access data sequentially because fewer I/Os need to be issued to retrieve the same data.

Depending on the DBMS, the data may not always be physically maintained in exact clustering sequence. When a clustering sequence has been defined for a table, the DBMS will act in one of two ways to enforce clustering:

  1. When new rows are inserted, the DBMS will physically maneuver data rows and pages to fit the new rows into the defined clustering sequence; or
  2. When new rows are inserted, the DBMS will try to place the data into the defined clustering sequence, but if space is not available on the required page the data may be placed elsewhere.

The DBA must learn how the DBMS maintains clustering. If the DBMS operates as in the second scenario, data may become unclustered over time and require reorganization. A detailed discussion of database reorganization appears later in this chapter. For now, though, back to our discussion of clustering.

Clustering tables that are accessed sequentially is good practice. In other words, clustered indexes are good for supporting range access, whereas unclustered indexes are better for supporting random access. Be sure to choose the clustering columns wisely. Use clustered indexes for the following situations:

  • Join columns, to optimize SQL joins where multiple rows match for one or both tables participating in the join
  • Foreign key columns because they are frequently involved in joins and the DBMS accesses foreign key values during declarative referential integrity checking
  • Predicates in a WHERE clause
  • Range columns
  • Columns that do not change often (reduces physically reclustering)
  • Columns that are frequently grouped or sorted in SQL statements

In general, the clustering sequence that aids the performance of the most commonly accessed predicates should be used to for clustering. When a table has multiple candidates for clustering, weigh the cost of sorting against the performance gained by clustering for each candidate key. As a rule of thumb, though, if the DBMS supports clustering, it is usually a good practice to define a clustering index for each table that is created (unless the table is very small).

Clustering is generally not recommended for primary key columns because the primary key is, by definition, unique. However, if ranges of rows frequently are selected and ordered by primary key value, a clustering index may be beneficial.

Page Splitting

When the DBMS has to accommodate inserts, and no space exists, it must create a new page within the database to store the new data. The process of creating new pages to store inserted data is called page splitting. A DBMS can perform two types of page splitting: normal page splits andmonotonic page splits. Some DBMSs support both types of page splitting, while others support only one type. The DBA needs to know how the DBMS implements page splitting in order to optimize the database.

split

Figure 2. How a typical page split works

Figure 2 (above) depicts a normal page split. To accomplish this, the DBMS performs the following tasks in sequence:

  1. Creates a new empty page in between the full page and the next
    page
  2. Takes half of the entries from the full page and moves them to the empty page
  3. Adjusts any internal pointers to both pages and inserts the row accordingly

A monotonic page split is a much simpler process, requiring only two steps. The DBMS

  • Creates a new page in between the full page and the next page
  • Inserts the new values into the fresh page

Monotonic page splits are useful when rows are being inserted in strictly ascending sequence. Typically, a DBMS that supports monotonic page splits will invoke it when a new row is added to the end of a page and the last addition was also to the end of the page.

When ascending rows are inserted and normal page splitting is used, a lot of space can be wasted because the DBMS will be creating half-full pages that never fill up. If the wrong type of page split is performed during database processing, wasted space will ensue, requiring the database object to be reorganized for performance.

Summary

Understanding normalization, denormalization techniques and clustering will help you as a DBA or performance analyst to be able to deliver much better database performance for your clients and applications.

This series of blog posts is excerpted from Craig Mullins’ classic book Database administration: The complete guide to DBA practices and procedures. Consult the book for more in-depth information on any of these topics – and much, much more.

Posted in DBA | Tagged , , , , | Leave a comment

Optimizing Database Performance, Part 1: Partitioning and Indexing

Today’s post is the start of a three-part series offering up an overview of the most important components of database performance that must be monitored and managed for any database implementation.

Database performance focuses on tuning and optimizing the design, parameters, and physical construction of database objects, specifically tables and indexes, and the files in which their data is stored. The actual composition and structure of database objects must be monitored continually and changed accordingly if the database becomes inefficient. No amount of SQL tweaking or system tuning can optimize the performance of queries run against a poorly designed or disorganized database.

Techniques for Optimizing Databases

The DBA must be cognizant of the features of the DBMS in order to apply the proper techniques for optimizing the performance of database structures. Most of the major DBMSs support the following techniques although perhaps by different names. Each of the following techniques can be used to tune database performance and will be discussed in subsequent sections.

  • Partitioning — breaking a single database table into sections stored in multiple files.
  • Raw partitions versus file systems — choosing whether to store database data in an OS-controlled file or not.
  • Indexing — choosing the proper indexes and options to enable efficient queries.
  • Denormalization — varying from the logical design to achieve better query performance.
  • Clustering — enforcing the physical sequence of data on disk.
  • Interleaving data — combining data from multiple tables into a single, sequenced file.
  • Free space — leaving room for data growth.
  • Compression — algorithmically reducing storage requirements.
  • File placement and allocation — putting the right files in the right place.
  • Page size — using the proper page size for efficient data storage and I/O.
  • Reorganization — removing inefficiencies from the database by realigning and restructuring database objects.

Partitioning

A database table is a logical manifestation of a set of data that physically resides on computerized storage. One of the decisions that the DBA must make for every table is how to store that data. Each DBMS provides different mechanisms that accomplish the same thing — mapping physical files to database tables. The DBA must decide from among the following mapping options for each table:

  • Single table to a single file. This is, by far, the most common choice. The data in the file is formatted such that the DBMS understands the table structure and every row inserted into that table is stored in the same file. However, this setup is not necessarily the most efficient.
  • Single table to multiple files. This option is used most often for very large tables or tables requiring data to be physically separated at the storage level. Mapping to multiple files is accomplished by using partitioned tablespaces or by implementing segmented disk devices.
  • Multiple tables to a single file. This type of mapping is used for small tables such as lookup tables and code tables, and can be more efficient from a disk utilization perspective.

Partitioning helps to encourage parallelism. Parallelism is the process of using multiple tasks to access the database in parallel. A parallel request can be invoked to use multiple, simultaneous read engines for a single SQL statement. Parallelism is desirable because it can substantially reduce the elapsed time for database queries.

Multiple types of parallelism are based on the resources that can be invoked in parallel. For example, a single query can be broken down into multiple requests each utilizing a different CPU engine in parallel. In addition, parallelism can be improved by spreading the work across multiple database instances. Each DBMS offers different levels of support for parallel database queries. To optimize database performance, the DBA should be cognizant of the support offered in each DBMS being managed and exploit the parallel query capabilities.

Raw Partition vs. File System

For a UNIX-based DBMS environment, the DBA must choose between a raw partition and using the UNIX file system to store the data in the database. A raw partition is the preferred type of physical device for database storage because writes are cached by the operating system when a file system is utilized. When writes are buffered by the operating system, the DBMS does not know whether the data has been physically copied to disk or not. When the DBMS cache manager attempts to writes the data to disk, the operating system may delay the write until later because the data may still be in the file system cache. If a failure occurs, data in a database using the file system for storage may not be 100% recoverable. This is to be avoided.

If a raw partition is used instead, the data is written directly from the database cache to disk with no intermediate file system or operating system caching. When the DBMS cache manager writes the data to disk, it will physically be written to disk with no intervention. Additionally, when using a raw partition, the DBMS will ensure that enough space is available and write the allocation pages. When using a file system, the operating system will not preallocate space for database usage.

From a performance perspective, there really is no advantage to having a secondary layer of caching at the file system or operating system level; the DBMS cache is sufficient. Actually, the additional work required to cache the data a second time consumes resources, thereby negatively impacting the overall performance of database operations.

Indexing

Creating the correct indexes on tables in the database is perhaps the single greatest performance tuning technique that a DBA can perform. Indexes are used to enhance performance. Indexes are particularly useful for:

  • Locating rows by value(s) in column(s)
  • Making joins more efficient (when the index is defined on the join columns)
  • Correlating data across tables
  • Aggregating data
  • Sorting data to satisfy a query

Without indexes, all access to data in the database would have to be performed by scanning all available rows. Scans are very inefficient for very large tables.

Designing and creating indexes for database tables actually crosses the line between database performance tuning and application performance tuning. Indexes are database objects created by the DBA with database DDL. However, an index is built to make SQL statements in application programs run faster. Indexing as a tuning effort is applied to the database to make applications more efficient when the data access patterns of the application vary from what was anticipated when the database was designed.

Before tuning the database by creating new indexes, be sure to understand the impact of adding an index. The DBA should have an understanding of the access patterns of the table on which the index will be built. Useful information includes the percentage of queries that access rather than update the table, the performance thresholds set within any service level agreements for queries on the table, and the impact of adding a new index to running database utilities such as loads, reorganizations, and recovery.

One of the big unanswered questions of database design is: “How many indexes should be created for a single table?” There is no set answer to this question. The DBA will need to use his expertise to determine the proper number of indexes for each table such that database queries are optimized and the performance of database inserts, updates, and deletes does not degrade. Determining the proper number of indexes for each table requires in-depth analysis of the database and the applications that access the database.

The general goal of index analysis is to use less I/O to the database to satisfy the queries made against the table. Of course, an index can help some queries and hinder others. Therefore, the DBA must assess the impact of adding an index to all applications and not just tune single queries in a vacuum. This can be an arduous but rewarding task.

An index affects performance positively when fewer I/Os are used to return results to a query. Conversely, an index negatively impacts performance when data is updated and the indexes have to be changed as well. An effective indexing strategy seeks to provide the greatest reduction in I/O with an acceptable level of effort to keep the indexes updated.

Some applications have troublesome queries that require significant tuning to achieve satisfactory performance. Creating an index to support a single query is acceptable if that query is important enough in terms of ROI to the business (or if it is run by your boss or the CEO). If the query is run infrequently, consider creating the index before the process begins and dropping the index when the process is complete.

Whenever you create new indexes, be sure to thoroughly test the performance of the queries it supports. Additionally, be sure to test database modification statements to gauge the additional overhead of updating the new indexes. Review the CPU time, elapsed time, and I/O requirements to assure that the indexes help. Keep in mind that tuning is an iterative process, and it may take time and several index tweaks to determine the impact of a change. There are no hard and fast rules for index creation. Experiment with different index combinations and measure the results.

When to Avoid Indexing

There are a few scenarios where indexing may not be a good idea. When tables are very small, say less than ten pages, consider avoiding indexes. Indexed access to a small table can be less efficient than simply scanning all of the rows because reading the index adds I/O requests.

Index I/O notwithstanding, even a small table can sometimes benefit from being indexed — for example, to enforce uniqueness or if most data access retrieves a single row using the primary key.

You may want to avoid indexing variable-length columns if the DBMS in question expands the variable column to the maximum length within the index. Such expansion can cause indexes to consume an inordinate amount of disk space and might be inefficient. However, if variable-length columns are used in SQL WHERE clauses, the cost of disk storage must be compared to the cost of scanning. Buying some extra disk storage is usually cheaper than wasting CPU resources to scan rows. Furthermore, the SQL query might contain alternate predicates that could be indexed instead of the variable-length columns.

Additionally, avoid indexing any table that is always accessed using a scan, that is, the SQL issued against the table never supplies a WHERE clause.

Index Overloading

Query performance can be enhanced in certain situations by overloading an index with additional columns. Indexes are typically based on the WHERE clauses of SQL SELECT statements. For example, consider the following SQL statement.

select emp_no, last_name, salary
from employee
where salary > 15000.00
;

Creating an index on the salary column can enhance the performance of this query. However, the DBA can further enhance the performance of the query by overloading the index with the emp_no and last_name columns, as well. With an overloaded index, the DBMS can satisfy the query by using only the index. The DBMS need not incur the additional I/O of accessing the table data, since every piece of data that is required by the query exists in the overloaded index.

DBAs should consider overloading indexes to encourage index-only access when multiple queries can benefit from the index or when individual queries are very important.

Be sure to stay tuned for the next installment of this series on database performance, which will cover denormalization and clustering.

Posted in DBA | Tagged , | 2 Comments

You Can’t Force Data Quality

How many times have you surfed the Web only to encounter a form that requests a slew of personal information before you are permitted to go any further? You know what I’m talking about. A company markets a white paper or poll results or something else that intrigues you, so you click on the link, and bang, there you are. You don’t have the information you wanted yet, but if you just fill out this form then you will be redirected to the information.

Makes you want to scream, doesn’t it? Some folks just close their browser or move on to something else. Some folks enter partially accurate information to see how little they need to provide without getting rejected. And some folks just provide bogus information.

Now sometimes completely bogus information won’t work. Maybe the form requires an email address to which the information will be sent. But hey, that’s what Gmail and Yahoo Mail were made for, right? Just create a new address, fill in the form using it, collect the information, then shut down, or ignore that email account for the rest of your life.

Of course, if you’re trying to sign up for a webinar this might not work because many companies remove generic email addresses. Who can blame them? They’re conducting a webinar to drum up business and gather leads. If you provide a generic email address the organizer will assume that you aren’t a good lead or maybe even a competitor trying to gather intelligence.

Then there’s the phone number. I almost never supply an accurate phone number. If the form allows, I type in “do not call me” as my phone number. The better web sites will not permit characters in a phone number though. So then I am forced to use the information number, 555-1212 (with my area code) or another bogus number. I get more than enough cold calls for things I don’t need already, thank you.

The point I’m trying to make is that these marketing tactics are responsible for the creation of a lot of bad quality data. But at least some of the data must be useful or the marketers wouldn’t use these tactics. And who can fault marketers for actually trying to target prospective customers? After all, that’s their job. And the information was evidently interesting enough to get you to click to it, right?

So what’s my point? Well, I have a couple of them. The first is that these Web forms need to be more stringently developed. For example, you should never be able to type characters into a phone number field. I’m talking about basic edit checks that every programmer should have been taught to do in Coding 101.

You also can check for and reject commonly submitted bogus items. For example, Mickey Mouse will never be your customer. And an address of 1313 Mockingbird Lane may be good for the Munsters, but not your customers. And while you’re at it, any phone number with a 555 prefix can be summarily rejected, too.

If you’re really interested in accurate data, take the time to do some more robust edit checking. Do the area code and zip code entered actually exist? Do they match the city and state that was entered? For example, if someone enters the 512 area code (Austin, TX) but enters Pittsburgh, PA for the city and state, you know the data is bogus. Or at least suspect … after all, people do move and take their mobile phone number with them. I have a friend who moved from Chicago to Florida to New York to Texas and he still has a mobile phone with Chicago’s 630 area code.

And if you want to go even further, you can match up company names to known addresses for that company to verify that an actual, accurate company name is being provided. Of course, there are exceptions here, too. Maybe you work from a home office and you’ve provided a legitimate address.

The bottom line is that organizations can do better at verifying data in their customer-facing Web applications. But even then, you just can’t force data quality. There will still be people “out there” (like me) who find ways to enter good enough data to prevent someone emailing them or calling them, trying to sell them something all the time.

And the data quality fight continues …

Posted in Data Quality | 1 Comment

A Few Database/DBMS Definitions

Just a quick post today to inform my readers of several technical definitions that I have written for TechTarget’s WhatIs? site. If you are not familiar with that site, click on the link and investigate it. It is a great way to hunt around and educate yourself on terms that may not be familiar to you.

Anyway, lately I’ve been working with the site to help develop some of their definitions in the data and database management realm. Here are a few of the ones I’ve participated in developing:

Of course, there are many other data and database-related definition up on WhatIs?, as well as tons of other IT definitions. Be sure to check it out if you haven’t already done so!

Posted in DBA | Tagged , , , , , | 1 Comment

The Surprising Things You Don’t Know About Big Data

 

I found this infographic to be informative and entertaining, so I’m sharing it here for the readers of my blog. Let me know what you think… should I post more things like this in the future?

The Surprising Things You DonYou can also find more infographics at Visualistan

Posted in Big Data, Data Growth | 1 Comment

Programmer Makes Excuses, Too!

In our last post (DBA Excuses… and advice to programmers for overcoming them!) we examined some of the bigger excuses used by DBAs to avoid problems and work. But poor excuses are not the exclusive domain of the DBA; far from it! Application developers and programmers rely on their fair share of excuses, too. And if you’re a DBA you’ve probably heard most of them. Let’s break down the top few programmer excuses and see what can be done to avoid them in the future.

The number one programmer excuse is to blame the DBMS.  If you’re a programmer, chances are high that you’ve either said something like the following (or at least thought it): “There’s something wrong with DB2 (or Oracle, or insert your favorite DBMS here)!” The basic mentality is that the DBMS is guilty until proven innocent; the programmer will never run up to the DBA and say “there’s a problem with this horrible code I wrote, can you help me fix it?”

Blaming the DBMS is never a helpful strategy. Oh, yes, in some rare instances there will be a problem or bug in the DBMS itself, but those instances are very rare. Most “database problems” can be tracked back to programming problems. By keeping this in mind at all times everyone will be better off – the problem will get fixed sooner and you will not alienate your DBAs by constantly blaming the DBMS.

Another common excuse is known as the Copied Code Syndrome. As most programmers know, copying working code from one program to another is an efficient way of quickly developing programs. But with database development you have to be careful to make sure that what you copy is really what you need. Here’s how this excuse works: when the programmer is confronted with a problem in his program he simply says “That can’t be a problem because I copied from another program and that program works.”  Well, that may be true, but many things can go wrong with copied SQL. Maybe you copied something that is 95% what you need, but you didn’t modify the code for your purposes. Or maybe something is different about the rest of the code in your program that makes the copied code in effective. Or maybe you aren’t totally sure of what each and every statement and parameter that you copied does?

A corollary to the Copied Code Syndrome is the It worked yesterday excuse. But today is another day, and if it ain’t working today, it ain’t working. Many things can change from day-to-day that can cause working code to become problematic – not the least of which is that code itself. Programmers make so many changes to code as a requirement of their job that sometimes you can just forget that something did indeed change. The bottom line is to work on solutions to problems instead of excuses to deflect blame. Blame is counter-productive to resolving problems.

Yet another excuse bandied about by developers is the Better Mousetrap excuse. The best approach to developing programs using a relational database is to put as much work as possible into the SQL and let the DBMS optimize the access. But there is always that Wile E. Coyote developer who says “But I can do that better in C or Java (or insert your favorite programming language here).” Doing it in SQL puts the work on the DBMS – and there is a much better chance for the DBMS to be bug-free than whatever code you cobble together.

The final programmer excuse I’ll mention today is the Time Is Running Out excuse. This can best be summarized as “There is always time to do it over later, but never enough time to do it right the first time.” Usually this excuse comes to light when you hear that magic phrase “It’s too late in the project to re-write that.”  But the problem caused by the code continues to exist – the programmer just wants some magic to occur to fix it that does not require coding changes. Won’t happen! There is no magic button out there! Sometimes the code has to change to solve the problem.

In the end, the biggest thing you can do as an application programmer is to research and understand any issue before you go running to the DBA.  If your program fails, find the SQLCODE (or SQLSTATE) and any associated reason code and try to fix it yourself first.  If you don’t understand something, read the manual before going to the DBA. There is no reason why everyone shouldn’t have their own set of manuals; most of them can be downloaded for free from the DBMS vendor’s web site.

To conclude, if I had a nickel for every time someone tried to use one of these excuses on me, I’d be a wealthy man. But life does not work that way. So maybe we can all climb back into the trenches and vow to avoid using all of these excuses — both DBAs and programmers… it’ll makes working with your databases a lot easier!

Posted in DBA, SQL | 1 Comment

DBA Excuses… and advice to programmers for overcoming them!

Let’s face it; sometimes it is easier to give an excuse than it is to take the time to answer a technical question. Especially if it’s the third time you’ve heard the same question in the last hour. And dealing with databases is complex and time-consuming, so questions are always coming up. And programmers are always coming to DBAs for help with technical problems. If you’re reading this column right now you’ve probably been on the receiving or giving end of one of these excuses at one time in your life. Let’s see how many of them you recognize.

The number one all-time DBA excuse can be broken down into two words – “it depends.” DBAs are notorious for answering every question with those same two words, “it depends.” And they are right, it does “depend,” but how does that answer help the situation? Oh, it might cause the programmer who deigned to ask the question to slink away, and maybe that is what the DBA was shooting for anyway. But if you are that programmer do not just go away. Whenever the DBA says it depends” do not take that for a final answer… instead, ask the follow up question “on what?”

Any DBA worth their salary will be able to run down some of the issues involved in the matter and point you to the proper manual or code to help you figure out your dilemma. Which brings me to the second biggest DBA excuse, which is RTFM. As everyone knows, this stands for Read The Friendly Manual.

I know, fellow DBAs, it gets old very fast when people keep asking you questions that are covered in the software manuals. But DBAs know the DBMS manuals much better than the application programmers, end users, and systems analysts who are asking the questions. It really shouldn’t be too hard to answer a simple question instead of barking out “RTFM” all the time. At least tell them in what chapter of TFM they can R it! And when you are a developer asking a question of the DBA, have you ever thought about phrasing it like this – “Where can I get more information on <Topic>?” instead of asking to be spoonfed answers… it might make a world of difference in the DBA’s attitude!

Another classic DBA excuse is to blame the DBMS vendor. It is always easier to claim that some piece of sage advice being offered came from the vendor because it lends legitimacy to the advice while at the same time relieving the DBA of any burden of proof. You know the situation I’m talking about. You’re a DB2 programmer (say) and you encounter a problem causing you to go to the DBA for guidance. And then the DBA says something like “Well, IBM says to …” Don’t accept this type of excuse either. IBM is a company and it doesn’t “say” anything. Same goes for Oracle and Microsoft and SAP and <insert your favorite DBMS provider here>.

Instead, find out who at IBM (or Oracle or…) said it and in what context. Or, perhaps no one “said” it but the DBA read it in an article or a manual or a book. That is great, but many times written advice is mangled and twisted when it is repeated orally. When this type of “excuse” is attempted you should obtain enough details from the DBA to seek out the source materials (or person) to make sure it truly applies in your situation.

Another popular excuse is the phrase “It is working as designed.” Well, that might be the case, but if the way it is working isn’t what you need it to do, then you still have a problem. If the DBA tells you that it is working as designed, he is basically telling you that you do not understand the way the DBMS works. And that may be true, but then he should really try to help you find an alternate solution — that does what you want — while also working in the database “as designed.”

The final DBA excuse I’ll talk about here is over-reliance on company standards. Some DBA groups grunt out mounds of steaming standards and guidelines that they then attempt to enforce for every project. Standards can be useful, but they shouldn’t become a crutch. When a specific standard ceases to make development and administration easier, or performance better, then it is time to make an exception to that standard… or even time to remove or re-write that standard.

If you come up with an elegant solution that works better or performs better than the “standard,” do not just accept it when the DBA says, “that doesn’t fit our standards.” Make sure that you both understand each other’s position. Perhaps there is a valid reason why the standard should apply – but make sure the DBA explains it to you if your solution really is easier. And as a programmer, make sure that you really have to deviate from a standard before making a big deal about it. That way, when you have a valid case, the DBA will be more apt to listen and compromise.

If I had a nickel for every time a DBA used the excuses in this article I’d be retired and living on an island somewhere. But you’d still be hearing the excuses!

Maybe if we all pay a little more attention to working together for the benefit of our organization, instead of hiding behind excuses, the workplace might be a more enjoyable and productive place to be!

Posted in DBA | 2 Comments

On DBA Tools, Utilities, Suites and Solutions

“Vendor speak” can be a difficult thing to decipher. And that is particularly true within the realm of DBA tools vendors. It would be easier if every vendor followed the same terminology… but, of course, they do not. And there is no way to force them to do so. But we can adopt a rigorous lexicon to describe the software offerings and use it ourselves… and analyze all software that we review using this simple descriptive lexicon.

Here is what I propose.

First, let’s be clear on what is a utility and what is a tool.

  • A database utility is generally a single purpose program for moving and/or verifying database pages; examples include load, unload, import, export, reorg, check, dbcc, copy and recover. There may be others I am missing, but these are functions that are frequently bundled with a DBMS, but also may be sold as an independent product.
  • A database tool is a multi-functioned program designed to simplify database monitoring, management, and/or administrative tasks. So performance monitor, change manager, data modeling tool, recovery/performance/SQL analyzers, etc. are all examples of DBA tools. Again, the list is not intended to be an exhaustive one (for a more in-depth list check out this blog post).

OK, it would be simple enough if these were the only two types of products we had to concern ourselves with, but they are not. Vendors also talk about solutions and suites. Sometimes, these two terms are used interchangeably, but I contend that they should not be. What I propose it his:

  • solution is a synergistic group of tools and utilities designed to work together to address a customer’s business issue.
  • suite is a group of tools that are sold together, but are not necessarily integrated to work with each other in any way.

So, a solution is a suite, but a suite is not necessarily a solution. Solutions are designed to simplify things for the customer in terms of usage and efficiency. Suites are designed to help the vendor and its salespeople sell more.

Now don’t get me wrong… I am not saying that you should not buy suites of DBA tools and utilities. If the price point is good and you really want or need all (or most) of the bundled software, then it can make sense. But know what you are buying! Understand all of the components of the suite and the terms and conditions of the agreement.

Solutions, on the other hand, should work together seamlessly and you may not even know that there are multiple underlying products. If that is not the case, it isn’t really a “solution” is it?

Of course, these are just my definitions. But I think these are useful definitions that make it easier to review, analyze and discuss DBA products and programs.

What do you think?

Posted in backup & recovery, change management, data modeling, performance, tools | 1 Comment

A Brief Introduction to Data Normalization

Normalization is a series of steps followed to obtain a database design that allows for efficient access and storage of data in a relational database. These steps reduce data redundancy and the chances of data becoming inconsistent. A table is said to be normalized if it satisfies certain constraints. Codd’s original work defined three such forms but there are now five generally accepted steps of normalization. The output of the first step is called First Normal Form (1NF), the output of the second step is Second Normal Form (2NF), etc.

A row is in first normal form if and only if all underlying domains contain atomic values only. 1NF eliminates repeating groups by putting each into a separate table and connecting them with a one-to-many relationship. A row is in second normal form if and only if it is in first normal form and every non-key attribute is fully dependent on the key. 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. A row is in third normal form if and only if it is in second normal form and every non-key attribute is non-transitively dependent on the primary key. 3NF eliminates functional dependencies on non-key fields by putting them in a separate table. At this stage, all non-key fields are dependent on the key, the whole key and nothing but the key.

But normalization does not stop with 3NF. Additional normal forms have been identified and documented. However, normalization past 3NF does not occur often in normal practice because most tables in 3NF are usually also in 5NF. The additional normal forms are:

  • Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. Indeed, in his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if and only if every determinant is a candidate key. Most entities in 3NF are already in BCNF.
  • An entity is in Fourth Normal Form (4NF) if and only if it is in 3NF and has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship within an entity if the one-to-many attributes are independent of each other.
  • Fifth Normal Form (5NF) specifies that every join dependency for the entity must be a consequence of its candidate keys.

For more information on normalization consult the following links:

A comparison of BCNF and 3NF is given here:
http://www.cs.sfu.ca/CC/354/zaiane/material/notes/Chapter7/node12.html

A complete definition of 4NF is given here:
http://www.cs.sfu.ca/CC/354/zaiane/material/notes/Chapter7/node16.html

Posted in data modeling, database design, normalization | 4 Comments

Everyone in IT Should Serve A Mainframe Tour of Duty

Today’s post is an update on a post I first wrote 10 years ago on my DB2portal blog. The general idea is that everybody in IT would be well-served by learning about mainframes and their robust management environment.

Mainframe developers are well aware of the security, scalability, and reliability of mainframe computer systems and applications. Unfortunately, though, the bulk of new programmers and IT personnel are not mainframe-literate. This should change. But maybe not for the reasons you are thinking.

Yes, I am a mainframe bigot. I readily admit that. In my humble opinion there is no finer platform for mission critical software development than the good ol’ mainframe. And that is why every new programmer should have to work a tour of duty on mainframe systems and applications as soon as they graduate from college.

You may note that I use the word mainframe, instead of the z Systems or z Server terms that IBM is using these days. Nothing wrong with the z thing, but I think there is nothing wrong with the term mainframe!

Why would I recommend a  mainframe tour of duty for everybody?

Well, due to the robust system management processes and procedures which are in place and working at every mainframe shop in the world. This is simply not the case for Windows, Unix, and other platforms. Of course, I don’t want to overly disparage non-mainframe systems. Indeed, much of the credit for the mainframe’s superior management lies in its long legacy. Decades of experience helped mainframers build up the systems management capabilities of the mainframe.

But by working on mainframe systems, newbies will finally begin to learn the correct IT discipline for managing mission critical software. The freedom that is allowed on non-mainframe systems helps folks to learn – but it is not conducive to the creation of hardened, manageable systems.

No longer is it okay to just insert a CD download something from the web and install new software willy-nilly onto a production machine. Mainframe systems have well-documented and enforced change management procedures that need to be followed before any software is installed into a production environment.

No longer is it okay to just flip the switch and reboot the server. Mainframe systems have safeguards against such practices. Months, sometimes years, can go by without having to power down and re-IPL the mainframe.

And don’t even think about trying to get around security protocols. In mainframe shops there is an entire group of people in the operations department responsible for protecting and securing mainframe systems, applications, and data.

Ever wonder why there are no mainframe viruses? A properly secured operating system and environment make such a scenario extremely unlikely.

Project planning, configuration management, capacity planning, job scheduling and automation, storage management, database administration, operations management, and so on – all are managed and required in every mainframe site I’ve ever been involved with. When no mainframe is involved many of these things are afterthoughts, if they’re even thought of at all. Sure, things are getting better in the distributed world – at least better than they were 10 years ago – but it is still far from perfect!

Growing up in a PC world is a big part of the problem. Although there may be many things to snark about with regard to personal computers, one of the biggest is that they were never designed to be used the way that mainframes are used. Yet we call a sufficiently “pumped-up” PC a server – and then try to treat it like we treat mainframes. Oh, we may turn it on its side and tape a piece of paper on it bearing a phrase like “Do Not Shut Off – This is the Production Server”… but that is a far cry from the glass house that we’ve built to nourish and feed the mainframe environment.

IMG_0503

And it is probably unfair to criticize PCs for not being mainframes because the PC was not designed to be a mainframe… but over the years people have tried to use them for enterprise production workloads… sometimes successfully. Sometimes not.

The bottom line is that today’s applications and systems do not always deliver the stability, availability, security, or performance of mainframe systems. A forced tour of duty supporting or developing applications for a mainframe would do every IT professional a whole world of good!

Posted in enterprise computing, mainframe | 3 Comments