This chapter first covers two of the most important tasks related to system administration: backup and recovery. Backup refers to the process of making copies of the database(s) and/or transaction logs to separate media that can later be used for recovery, if necessary. Recovery is the process of using the backup media to replace uncommitted, inconsistent, or lost data.
System availability refers to keeping the downtime of the database system as low as possible. In this chapter we will describe in detail the following options available for system availability: failover clustering, database mirroring, log shipping and high availability and disaster recovery (HADR). Also, the benefits and disadvantages of each option will be discussed. At the end of the chapter, Maintenance Plan Wizard is discussed.
The wizard provides you with the set of basic tasks needed to maintain a database. Therefore, it can be used, among other things, to backup and restore user databases.
Performing backups is a precautionary measure that you have to take to prevent data loss. The reasons for data loss can be divided into the following groups:
- Program errors
- Administrator (human) errors
- Computer failures (system crash)
- Disk failures
- Catastrophes (fire, flood, earthquake) or theft
During execution of a program, conditions may arise that abnormally terminate the program. Such program errors affect only the database application and usually have no impact on the entire database system. Because these errors are based on faulty program logic, the database system cannot recover in such situations. The recovery should therefore be done by the programmer, who has to handle such exceptions using the COMMIT and ROLLBACK statements (see Chapter “Concurrency Control”).
Another source of data loss is human error. Users with sufficient permissions (DBA, for instance) may accidentally lose or corrupt data (people have been known to drop the wrong table, update or delete data incorrectly, and so on). Of course, ideally, this would never happen, and you can establish practices that make it unlikely that production data will be compromised in this way, but you have to recognize that people make mistakes, and data can be affected. The best that you can do is to try to avoid it, and be prepared to recover when it happens.
A computer failure may occur as a result of various different hardware or software errors. A hardware crash is an example of a system failure. In this case, the contents of the computer’s main memory may be lost. A disk failure occurs either when a read/ write head of the disk crashes or when the I/O system discovers corrupted disk blocks during I/O operations.
In the case of catastrophes or theft, the system must have enough information available to recover from the failure. This is normally done by means of media that offer the needed recovery information on a piece of hardware that is stored separately and thus has not been damaged or lost by the catastrophe or theft.
For most of the errors just described, backups, discussed next, can provide a recovery solution.