Problem
SQL Server 2005 has given us a lot of options on implementing high availability and disaster recovery. More than the technologies themselves, it is important to come up with the proper procedures as we manage different disaster recovery scenarios. How do we come up with procedures for various SQL Server disaster recovery scenarios?
Solution
This series of articles will look at different disaster recovery scenarios and the procedures involved in your recovery plan. When considering disaster recovery options for your SQL Server 2005 database, you should include as many technologies as you can so you'll have a pool of options to choose from if a disaster arises. While having these technologies prove to be important, it is the process that goes with it that makes it effective. For this article, let's take a simple scenario where a user accidentally dropped or truncated a table about 5 hours after a database backup was taken. Restoring from a database backup would mean losing 5 hours worth of data. And for most companies, they would rather opt for loss of time than loss of data. Plus, if this was a very large database, it would take quite some time to recover and bring it online. We'll take this scenario to build a procedural approach to recover the database as quickly as possible while minimizing data loss. We will use the Northwind database to demonstrate the process. Remember to change Northwind's database recovery model to FULL before working through the steps below.
1) Make sure you have a very good backup
For the scenario above, let's say you have a daily backup running at 6:00 AM and there are no database snapshots created on a regular basis. Your database is configured to use a single MDF and LDF file, which is not quite right for disaster recovery. Let's generate a full database backup for our Northwind database which will be the starting point for database recovery. Here is that code:
USE master GO BACKUP DATABASE Northwind TO DISK = N'D:\DBBackup\NorthwindBackup.bak' WITH NAME = N'Full Database Backup', DESCRIPTION = 'Starting point for recovery', INIT, STATS = 10 GO |
Looking at the Northwind database schema, you won't easily be able to drop theProducts, Orders and Customers table due to foreign key constraints defined by other tables like the Order Details table. But I bet you can easily drop the Order Detailstable. Let's simulate that disaster by dropping the table at around 11:00AM.
DROP TABLE [Order Details] |
2) Contain the problem
Since the database only has a single MDF and LDF file, we couldn't do much. All we can do is take the database offline by setting it to restricted user mode
USE master |
This will effectively take the database offline, terminating all active connections. This is where the clock starts ticking and we need to take action. Keep in mind your RPO and RTO while proceeding with the next steps.
3) Backup the transaction log
A good DBA would know that the first thing to do when disaster strikes is to backup the transaction log - assuming that your database is set to FULL recovery model. This is to make sure you still have all the active transactions in the log since your last backup. In our scenario, since the last backup - FULL backup, in this case - occurred at 6:00AM.
BACKUP LOG Northwind |
4) Restore the database to a known good point-in-time
Now, any user who accidentally dropped a table or caused any damage on a database will not tell you immediately. Sometimes, you may have to dig it up for yourself but that would take quite a lot of time. Since we wanted to bring the database back online as fast as we can, let's just assume a "known good" point-in-time and leave the digging for a later time. In the script below, I have chosen to use 10:42AM in my STOPAT parameter as my "known good" point-in-time for demonstration purposes.
RESTORE DATABASE Northwind |
Although we have restored the database to a "known good" point-in-time, we don't really know how much data we've lost. We need to find out the exact time when the very lastINSERT statement occurred prior to the DROP TABLE statement so that we can recover as much data as we possibly can. But we don't want to do this directly on the server as we need to get the database back online as quickly as we can. This is where the succeeding steps would prove to be very valuable
You can validate whether the dropped table has been restored by running a query against it.
SELECT * |
5) Create a snapshot of the restored point
We will create a database snapshot of the restored database for further processing. This database snapshot will be our reference in recovering data to the exact time prior to theDROP TABLE statement.
USE master |
Depending on the table schema, we can opt to leave it as it is, as what we will do for theOrder Details table, or do a few more tasks. If the table has an existing IDENTITYcolumn, we need to create a gap between the maximum value for our IDENTITY column and assumed number of rows which we need to recover. This, of course, will depend on the number of transactions that occur on the server. To identify the maximum value of the IDENTITY column, you can run the DBCC CHECKIDENT command as shown below:
DBCC CHECKIDENT ('tableName') |
This will return the maximum value of the IDENTITY column. Let's say that the number of transactions on this table per day is around 4000 records, we can create a gap between the maximum value and the next value. If the maximum value for the IDENTITY column is25000, we need to add 4000 to it and run the DBCC CHECKIDENT command again with the RESEED parameter (we are simply assuming that you can recover the lost data within a day, thus, the value 400):
DBCC CHECKIDENT ('tableName', RESEED, 29000 ) |
6) Bring the database back online
Once you have managed to do that, change the database option to bring it online and allow users to connect back and run their applications.
USE master |
Now the database is back online and the dropped table has been restored. Although everyone is happy by now, our job as a DBA does not stop here. Remember that we still need to recover the lost data from the "known good" point-in-time to the time before theDROP TABLE command was executed. That is the only way we can recover as much data as we can. Though there are a few third-party tools we can use to read the transaction log and recover the data by replaying those transactions back, most of us do not have the luxury of playing around with those tools. So our next best bet would be to use the RESTORE with STOPAT option. It's a bit tedious and sometimes very stressful as one mistake would plunge you back into repeating the entire process. All we need here is to find out the times we did our backups until the end of the tail (transaction log) backup. In our scenario, the last backup occurred at 6:00AM and your "known good" point-in-time is at 10:42.44AM. Therefore, you can start doing a RESTORE with the STOPAT option from 10:42.44AM and change the STOPAT time value to maybe a second. If you are not quite sure when the last backup occurred, you can always query theMSDB database.
SELECT * |
Note the backup_finish_date and the type columns as this will give you an idea on the time you need to consider for your STOPAT value in your RESTORE command.
7) Restore another copy of the damaged database with a different name for investigation
Restoring another copy of the damaged database with a different name will allow us to work on the restoration of the data without worrying about the availability as we've already managed to bring up the production database. Just make sure you select a different name and database file location for the restored database or you'll end up damaging the already brought up database. We will just repeat what we did in step #4but with a twist - different name and database file location.
RESTORE DATABASE Northwind_recover RESTORE LOG Northwind_recover |
Document the value of your STOPAT parameter as this will be the most critical parameter you'll ever work with during this process. Since we just repeated the process in step#4, we know for a fact that the DROP TABLE command has not been executed at this time.
8) Restore the transaction log by increasing the value of the STOPAT parameter
We run another RESTORE LOG command, increasing the STOPAT parameter value by a minute - from 10:42:44.00 to 10:43:44.00
RESTORE LOG Northwind_recover |
This is the part where it becomes iterative. Don't be frustrated at this point as it will be really tedious. You can increase the value by a minute, 5 minutes, 10 minutes and document the time. Remember to run a test query on the dropped object after running the RESTORE LOG command. I would recommend creating a table for this activity that would look something like this.
TIME | OBJECT EXISTED? |
10:43:44.00 | YES |
10:48:44.00 | YES |
10:58:44.00 | YES |
11:03:44.00 | NO |
With this information, you know for a fact that the table was dropped between10:58:44.00 to 11:03:44.00. You can repeat step #8 and increase the value of theSTOPAT parameter by a minute or even a second if you may since you already have a smaller interval to work with. If you find yourself overshooting the time value of theSTOPAT parameter, go back to step #7 armed with the tabular information you've documented in step #8, making the restore process a bit faster. Just remember to use the WITH RECOVERY option at the last RESTORE LOG statement like this
RESTORE LOG Northwind_recover |
Once you've managed to restore the database to the time before the DROP TABLEcommand was executed, you can now do a comparison between what was restored on the production database and what was recovered. You can do this in a number of different ways. Since we already have a database snapshot created earlier, we will use that together with the TableDiff utility. Although the tool was designed for replication, we can use it for disaster recovery as well. A previous tip on SQL Server 2005 tablediff command line utility can give you an overview on how to use this tool but just to highlight that your source database will be the one that you recovered and the destination database will be your database snapshot. This is where your database snapshot would prove to be very important especially if you are dealing with more than one object which is normally the case. If you are not comfortable with command-line utilities, a GUI version was created by the guys from SQLTeam.com. You might want to check that out as well and include it in your DBA toolbox
You can also do an INSERT/SELECT where you insert records on the production database based on a query on the recovered database. Since our Order Details table does not have an IDENTITY column, we can create our own by inserting the records in a temporary table and using the ROW_NUMBER() function:
--This inserts records in a temporary table and assigns a dummy identity value for reference |
Notice that we used our database snapshot to identify the difference between what we managed to restore and what we have recovered.
Next Steps
- It is important to have a disaster recovery plan in place and the procedures necessary for recovery. While it is impossible to come up with procedures for almost every type of disaster, it would help if you start listing the possible disasters that may happen and prepare a disaster recovery plan with procedures and document accordingly.
- Simulate this particular process by going thru the steps outlined above.
- You can download the Northwind database used in the sample here.
1 comment:
Is't this the same article as http://www.mssqltips.com/tip.asp?tip=1595
?
Post a Comment