Things break. When a new database is created or a new application goes online, everything is fresh and new—and running as designed. But the environment changes over time. New hardware and software is added, along with more users, more data, more requirements—more, more, more. Systems wear down as they are taxed on a daily basis to deliver service.
The DBA must be prepared for situations where a failure impacts the availability, integrity, or usability of a database. Reacting to failures and service disruptions is a key component of the DBA’s job. The ability of the DBA to react accordingly depends directly on his having a well-planned approach to database backup and recovery.
I know that many DBAs believe that ensuring optimal database and application performance is the most important task in their list of job responsibilities, but it is not true. These DBAs are confusing frequency with importance.
Most DBAs probably are conducting performance-related tasks more often than building backup plans – and they better be managing performance more frequently than they are actually recovering their databases or their company has big problems! But recoverability should be at (or near) the very top of the DBA task list, definitely before performance. Why? Well, if you cannot recover your databases after a problem then it won’t matter how fast you can access them, will it? Anybody can deliver fast access to the wrong information. It is the job of the DBA to keep the information in our company’s databases accurate, secure, and accessible.
So what do DBAs need to do to assure the availability and accuracy of our database data? This chapter will outline the basics of database backup and recovery and provide guidance on building a robust backup and recovery plan for your databases.
Preparing for Problems
Numerous daily hazards can cause system failures. As you plan your database backup and recovery strategy, be sure to consider all of these various threats to database integrity and availability. Of course, it is wise to take precautionary measures to prevent failures. Techniques such as UPS systems, mirrored disks, and failover technology can minimize the need to recover, but no amount of planning and regulation can prevent unexpected failures.
Database failures that may require recovery can be divided into three categories:
- Instance failures are the result of an internal exception within the DBMS, an operating system failure, or other software-related database failure. In some cases, an instance failure can result in corruption of data that requires a recovery, but usually such failures do not damage data, so the DBMS simply needs to be restarted to reestablish normal operations.
- Application (or transaction) failures occur when programs or scripts are run at the wrong time, using the wrong input, or in the wrong order. An application failure usually results in corrupt data that requires a database restore or recovery. The sooner an application failure is identified and corrected, the smaller the amount of damage to the database will be.
- Media failure is likely to damage data, too. Media failure includes damage to disk storage devices, file system failures, tape degradation or damage, and deleted data files. Although less common in practice, damaged memory chips also can cause data corruption. After a media failure, the database will likely be in a state where valid data is unreadable, invalid data is readable, or referential integrity is violated. Outages due to media failures can often be avoided by implementing modern disk technologies such as RAID.
Businesses today are demanding higher throughput and around-the-clock availability even as they increase the amount of stored and processed data. Not too long ago, we talked about data in terms of gigabytes. Now it is common for organizations to manage a terabyte or more of data on a single database server. Therefore, more data needs to be constantly available and has to be processed faster than ever before. Organizations rely on data to conduct business, so it is imperative that you are prepared with a plan to counteract failures. A sound backup and recovery plan can be thought of as an insurance policy for your data.
Image Copy Backups
A fundamental component of a database backup and recovery plan is creating backup copies of data. When an error occurs that damages the integrity of the database, a backup copy of the data can be used as the basis to recover or restore the database. However, the full story on backing up a database is not quite that simple.
Backing up databases involves making consistent copies of your data, usually in the form of image copies, which are the output of a COPY utility. The name of the copy utility will vary from DBMS to DBMS. Common names for the backup utility include BACKUP, COPY, DUMP, and EXPORT. Some DBMSs rely on the native operating system’s file system commands for backing up data. However, even if the DBMS supplies an internal backup option, the DBA may choose to use facilities that operate outside the realm of the DBMS.
Full vs. Incremental Backups
Two types of image copy backups can be taken: full and incremental. As a DBA, you will need to learn the difference between the two and implement the proper image copy backup strategy based on application needs and database activity.
A full image copy backup is a complete copy of all the data in the database object at the time the image copy was run. An incremental image copy backup, sometimes referred to as a differential backup, contains only the data that has changed since the last full or incremental image copy was made. The advantage of taking an incremental backup rather than a full backup is that it can sometimes be made more quickly, and it occupies less space on disk or tape. The disadvantage is that recovery based on incremental copies can take longer because, in some cases, the same row is updated several times before the last changes are restored.
If the DBMS supports incremental image copy backups, it may also support incremental copy merging. A merge utility, sometimes referred to as MERGE or MERGECOPY, can be used to combine multiple incremental image copy backups into a single incremental copy backup, or to combine a full image copy backup with one or more incremental image copy backups to create a new full backup.
Some DBMSs support making backup copies of indexes. Indeed, some DBMSs require indexes to be backed up, whereas index backup is optional for others. Index backup can be optional because the DBMS can rebuild an index from the table data. Therefore, a typical recovery scenario would involve recovering tables or tablespaces and then using the table data to rebuild the indexes. If the DBMS supports index backups, you can choose to copy indexes and then recover using the image copy backups of the indexes.
As a DBA, though, you will need to examine the trade-offs of copying indexes if your DBMS supports index backup. The question DBAs must answer for each index is “Rebuild or recover?” The more data that must be indexed, the longer an index rebuild will require in a recovery situation. For larger tables, backing up the index can result in a much quicker recovery—although at the expense of the increased time required for backup. When multiple indexes exist on the large table, backing them up, again, leads to faster recovery. However, keep in mind that index backups will require additional time to execute during your regular backup process. As a DBA, you will need to weigh the cost of recovery versus the cost of backup in making your decision.
Be sure your backup plan creates a consistent recovery point for the database object. In order to ensure backup consistency, you need to be aware of all relationships between the database objects being backed up and other database objects. This includes application-enforced relationships, referential constraints, and triggers. If you use an image copy backup to recover a database object to a previous point in time, you will need to recover any related database objects to the same point in time. Failure to do so will most likely result in inconsistent data.
If your DBMS provides a QUIESCE utility, use it to establish a point of consistency for all related database objects prior to backing them up. The QUIESCE utility will halt modification requests to the database objects to ensure consistency and record the point of consistency on the database log. Use the QUIESCE utility even when some database objects do not need to be copied, so that you can provide a consistent point of recovery for all referentially tied tablespaces.
If the DBMS does not provide a QUIESCE option, you will need to take other steps to ensure a consistent point for recovery. For example, you can place the database objects into a read-only mode, take the database objects offline, or halt application processes—at least those application processes that update the related database objects.
Document Your Backup Strategy
Once your backup strategy has been established and implemented, the backup system can run for a long time without any DBA intervention required. Such automation is a mixed blessing, though. Over time, things can be forgotten and the DBA staff can change, both of which can cause confusion during a hectic database recovery. For this reason it is imperative that the backup and recovery strategy, implementation, and procedures be thoroughly tested and documented by the DBA.
The most important aspect of any backup plan is to match it against your recovery requirements and options. More on that in the next section.
When problems impact the database, the DBA can use the image copy backups and the database log to recover the database. Whatever the cause of the problem, the DBA must be able to recover data quickly so that the business can continue to operate. When data is unavailable, your company may be losing thousands or even millions of dollars. Recognizing the need for a database recovery is quite different from actually performing a recovery in a speedy and proper fashion. Database recovery can be a very complex task that is prone to errors and difficult to manage.
Recovery involves much more than simply restoring an image of the data as it appeared at some earlier point in time. A database recovery will involve bringing the data back to its state at (or before) the time of the problem. Often a recovery involves restoring databases and then reapplying the correct changes that occurred to that database, in the correct sequence.
Simply stated, a successful recovery is one where you get the application data to the state you want it—whether that state is how it was last week, yesterday, or just a moment ago. If you planned your backup strategy appropriately, you should be able to recover from just about any type of failure you encounter.
Determining Recovery Options
When a failure occurs, the DBA will need to ascertain whether recovery is required. If recovery is required, you will need to determine what resources (backup copies) are available and how best to perform that recovery.
Of the different types of recovery performed, the first one that usually comes to mind is a recover to current, to handle some sort of disaster. This disaster could be anything from a media failure to a natural disaster destroying a data center. Applications are completely unavailable until the recovery is complete. To successfully recover to current, the recovery process must be able to reset the contents of the database to the way it looked just at (or right before) the point of failure. To recover to current, the recovery process must find a valid, full image copy backup and restore that image copy. Then the recovery will roll forward through the database log, applying all of the database changes.
If the last full image copy is lost or destroyed, it may still be possible to recover if a previous image copy exists. The recovery process could start with the older backup copy, apply any incremental copies, and then roll forward through the archived and active logs. Of course, more database logs will be required in such a case, so the recovery process will take longer.
If no image copy is available as a starting point, it may be possible to recover the database object using just the database log. If the data was loaded and the load process was logged, recovery may be able to proceed simply by applying log records.
Another traditional type of recovery is point-in-time (PIT) recovery, which is usually done to deal with an application-level problem. PIT recovery is sometimes referred to as partial recovery because only part of the existing data will remain after recovery. Recovery to a point in time removes the effects of all transactions that have occurred since that specified point in time.
To perform a PIT recovery, an image copy backup is restored and then changes are applied by rolling forward through the database log (or log backups). However, only the log records up to the specified time are processed. Sometimes the recovery point is specified as an actual date and time; sometimes it is specified using a relative byte address on the database log.
To successfully recover to a point in time, the recovery must be able to reset the contents of the database to the way it looked at a prior consistent point. The key is to retain all of the good changes to the data, while removing all of the “bad.” PIT recovery can be accomplished in one of two ways, depending on the features of the DBMS and the amount of data to be recovered. It could:
- Restore the image copy by rolling forward through the logs and applying the database changes up to the recovery point, or;
- Not restore the image copy, instead rolling backward through the logs and removing the database changes that occurred after the recovery point.
If the DBMS supports both types of recovery, the DBA should choose to deploy the one that creates the least downtime. If a significant number of changes need to be removed, then restoring and rolling forward usually results in the least downtime. If the number of changes that must be removed are minimal, then rolling backward through the logs should result in less downtime. If your DBMSs does not support backward log rolling, third-party products may be available that can make it a viable technique.
Transaction recovery is a third type of recovery; it addresses the shortcomings of the traditional types of recovery: downtime and loss of good data. Thus, transaction recovery is an application recovery whereby the effects of specific transactions during a specified timeframe are removed from the database. Third-party software is required to perform a transaction recovery.
Traditional types of recovery, both recovery to current and PIT, recover at the database object level. In direct contrast to this level of granularity, transaction recovery allows a user to recover a specific portion of the database based on user-defined criteria. This can be at a transaction or application program level. In this context, a transaction is defined by the user’s view of the process. This might be the set of panels that comprise a new-hire operation, or the set of jobs that post to the general ledger. The important point is that there may or may not be a correlation between the transactions you are trying to fix and transactions (or units of recovery) in the DBMS.
Once you have identified the transaction to recover, you have three recovery options:
- PIT recovery. You can try to identify all of the database objects impacted by the application and perform traditional point-in-time recovery to remove the effects of the transactions. You would then manually rerun or reenter work that was valid.
- UNDO recovery. Remove only the effects of the bad transactions.
- REDO recovery. Remove all the transactions after a given point in time, and then redo the good transactions only.
PIT recovery was discussed earlier, so we will not cover it again here. The other two forms of transaction recovery are SQL-based application recovery techniques. The basic idea is to read the log and generate the proper SQL statements that can be run to achieve the desired recovery results.
Let’s first examine an UNDO recovery. UNDO recovery is the simplest version of SQL-based transaction recovery because it involves only SQL. To accomplish an UNDO recovery, the database logs must be scanned for the identified transaction and anti-SQL is produced. Anti-SQL reverses the affect of SQL by
- Converting inserts into deletes
- Converting deletes into inserts
- Reversing the values of updates (e.g., UPDATE “A” to “X” becomes UPDATE “X” to “A”)
Once the anti-SQL is generated, it is run using an interactive SQL script to perform an UNDO recovery.
A REDO recovery is a combination of PIT recovery and UNDO recovery, with a twist. Instead of generating SQL for the bad transaction that we want to eliminate, we generate the SQL for the transactions we want to save. Then we do a standard point-in-time recovery to eliminate all the transactions since the recovery point. Finally, we reapply the good transactions captured in the first step.
Unlike the UNDO process, which creates SQL statements designed to back out all of the problem transactions, the REDO process creates SQL statements that are designed to reapply only the valid transactions from a consistent point in time. Since the REDO process does not generate SQL for the problem transactions, performing a recovery and then executing the REDO SQL can restore the database object to a current state that does not include the problem transactions.
So, what is the best recovery strategy? It depends… on things like the amount of data impacted, subsequent modifications, log availability, and so on.
It is important that the DBA thoroughly test each different type of recovery in a test environment. Be sure that you can recover from a media failure, an instance failure, and several types of application failures. Document the type of backup taken for each database object, along with a schedule of when each is backed up. Be sure that all of your databases can be recovered and that all DBAs on-site have firsthand experience at database recovery. The DBA group should schedule periodic evaluations of the backup and recovery plans for every production database.
Every DBA should prepare a recovery plan for each database object and test it frequently. A recovery plan describes the procedures you will use to recover in the event of hardware failure or a local site disaster, such as fire.
To develop your recovery plan:
- Write all aspects of the recovery plan out in detail, documenting each step.
- Include all the scripts required to back up and recover each database object.
- Review the plan with everyone who may be called on to implement it.
- Include a contact list with names and phone numbers of everyone who may be involved in the recovery.
- Keep the recovery plan up-to-date by modifying the plan to include every new database object that is created.
Testing your recovery procedures on a regular basis includes running regular recovery tests for individual database objects, databases, and the entire database system. By testing recovery procedures, the DBA assures that the backup and recovery scripts work and that every database object is indeed recoverable. Furthermore, regular recovery testing is on-the-job training for the inevitable, stress-filled production database recovery. A DBA that tests his recovery plan will be more familiar with the tools and scripts needed to perform recovery.
That’s enough for one blog post. Happy backing up and recovery testing… but hopefully no actual recovery needed!