Planning for Disaster Recovery
You need to create a disaster recovery plan in order to ensure that all your systems and data can be quickly restored to normal operation in the event of a natural disaster (for example, a fire) or a technical disaster (for example, a two-disk failure in a RAID-5 array). When you create a disaster recovery plan, you prepare all the actions that must occur in response to a catastrophic event. It is recommended that you verify your disaster recovery plan through the simulation of a catastrophic event.
Consider disaster recovery planning in light of your own environment and business needs. For example, suppose a fire occurs and wipes out your 24-hour data center. Are you certain you can recover? How long will it take you to recover and have your system available? How much data loss can your users tolerate?
Ideally, your disaster recovery plan states how long recovery will take and the final database state the users can expect. For example, you might determine that after the acquisition of specified hardware, recovery will be completed in 48 hours, and data will be guaranteed only up to the end of the previous week.
A disaster recovery plan can be structured in many different ways and can contain many types of information, including:
- A plan to acquire hardware.
- A communication plan.
- A list of people to be contacted in the event of a disaster.
- Instructions for contacting the people involved in the response to the disaster.
- Information on who owns the administration of the plan.
Running a Base Functionality Script
Usually, you include a base functionality script as part of your disaster recovery plan in order to confirm that everything is working as intended. The base functionality script provides a dependable tool for the system administrator or database administrator to be able to see that the database is back in a viable state, without depending on end users for verification. Most commonly, this is an .sql file with batched SQL statements run into the server from osql. For other applications, a .bat file is more appropriate because it can contain bcp and osql commands. This base functionality script is very application specific, and it can take many different forms. For example, on a decision support/reporting system, the script may merely be a copy of several of your key reporting queries. For an online transaction processing (OLTP) application, the script may execute a batch of stored procedures that execute INSERT, UPDATE, and DELETE statements.
Preparing for a Disaster
To prepare for disaster, it is recommended that you periodically perform the following steps:
- Perform regular database and transaction log backups to minimize the amount of lost data. It is recommended that both system and user databases be backed up.
- Maintain system logs in a secure fashion. Keep records of all service packs installed on Microsoft® Windows NT® 4.0 or Windows® 2000 and Microsoft SQL Server™. Keep records of network libraries used, the security mode, and the sa password.
- Maintain a base functionality script for quickly assessing minimal capability.
- Assess the steps you need to take to recover from a disaster ahead of time on another server, and amend the steps as necessary to suit your environment.
Recovering from a Disaster
To recover from a disaster, perform the following steps after acquiring suitable replacement hardware:
- Install Windows NT 4.0 or Windows 2000, and apply the appropriate service pack. Verify that appropriate domain functionality exists.
- Install SQL Server, and apply the appropriate service pack. Restore the master and msdb database backups. Restart the server after restoring the master database.
- Reconfigure the server for the appropriate network libraries and security mode.
- Confirm that SQL Server is running properly by checking SQL Server Service Manager and the Windows application log. If the Windows NT 4.0 or Windows 2000 name was changed, use sp_dropserver and sp_addserver to match it with the SQL Server computer name.
- Restore and recover each database according to its recovery plan.
- Verify the availability of the system. Run a base functionality script to ensure correct operation.
- Allow users to resume normal usage.