Robert's Blog


Wednesday, January 16, 2008

Aggressive DB2 Disaster Recovery

Back in the mid-1990s, when I was on the DB2 for z/OS National Technical support team at IBM's Dallas Systems Center, I did some consulting work for a major insurance company. For most of one day during an on-site working session, we reviewed the company's documented disaster recovery (DR) procedure for their very large, very high-volume, DB2-based core application system. This organization had put together one of the most comprehensive and sophisticated DR action plans I'd ever seen, characterized by extensive use of automation (a given step in the DR procedure would call for the submission of job X, which would submit for execution jobs A, B, and C - each of which would complete several tasks related to an aspect of the overall DR process). By following their procedure, the systems staff of the insurance company were pretty confident that, were the primary data center to be incapacitated by a disaster-scope event, they could get the mainframe DB2 database and associated core application system restored and ready for work at the backup site within two to three days.

Two to three days! And that was really good if you used what was then about the only DR method available to mainframe DB2 users, called the "pickup truck" method by my Dallas Systems Center colleague Judy Ruby-Brown: once or twice a day, you had your DB2 image copy and archive log tapes loaded onto a truck and transported to your DR site. To recover the database at the backup location, you had to restore all those image copies to disk and then recover all of the tablespaces to as current a state as possible using the archive log tapes available to you. And after that you had to rebuild all the indexes. And before all that you had to do a conditional restart of the DB2 subsystem. And before that you had to get the z/OS system (called OS/390 back then) restored so that you could start DB2 (and that could take several hours). Consider all this in light of the relatively slow (by today's standards) disk storage systems and mainframe engines in use a dozen years ago and you can see why DR was a very time-consuming endeavor (the aforementioned 2-3 days looks great when you consider that some companies estimated that it could take a week to get their big mainframe DB2 databases and applications ready for work after a disaster event).

Times have changed, technology has changed, users' expectations have changed, and now many organizations would be in a world of hurt if their core applications were to be out of commission for several days. So much business is conducted online - between individual consumers and companies, and between companies and other companies (e.g., between an automobile manufacturer and its suppliers). Fortunately, modern computer hardware and software offerings have enabled organizations to slash the time required to recover from a disaster-scope event impacting a data center (as well as greatly reducing the amount of data lost as a result of such calamities). Of great importance in this area was the emergence a few years ago of remote mirroring features in enterprise-class disk storage subsystems. Given various names by different vendors (from IBM, Metro Mirror and Global Mirror; from EMC, SRDF and SRDF-A; and from Hitachi Data Systems, TrueCopy), these features let a company maintain, at a backup data center site, a copy of disk-stored data (all data - not just a database such as DB2) that is either exactly in synch (if the backup site is within, say 25 or so fiber-link miles from the primary site) or very nearly in synch (if the backup site is far from the primary site) with respect to the disk-stored data that supports the production application system. Because DB2 for z/OS active log data sets can be mirrored in this way, along with the DB2 catalog and directory and all of the application DB2 database objects, remote-site recovery of a knocked-out mainframe DB2 system is not much different from an in-place recovery/restart following a temporary loss of power at the main site: you basically get the z/OS system back up and then issue the -START DB2 command.

When an organization implements a remote disk mirroring solution for DR purposes, how quickly can it get a mainframe DB2-based application system up and ready for work at a DR site following a primary-site disaster? This is, to me, a very interesting question. I don't know what the record is (and I'm talking about terabyte-plus-sized databases here, with peak-time workload volumes of at least a few hundred transactions per second), but I'll put this on the table: I think that if a company's systems people are good - really good - they can get a big and busy DB2 for z/OS database and application system online and work-ready at a DR site within 20-30 minutes of a primary-site disaster. What do you think? Feel free to post a comment to this effect.

The DB2 for LUW people may have a DR speed advantage because of the High Availability Disaster Recovery (HADR) feature introduced with DB2 V8.2. HADR is, conceptually, like log shipping taken to the extreme, with data change-related log records sent in real time to a remote DB2 for LUW system and "played" there, so that the buffer pool of the secondary DB2 database is kept "warm" with respect to that of the primary DB2, insofar as data and index pages affected by updates (and deletes and inserts) are concerned. This being the case, the time required by the secondary DB2 system to take over for the primary DB2 is really small: generally speaking, less than 30 seconds (the "warm" buffer pool on the secondary DB2 makes roll-forward recovery unnecessary, and slashes the elapsed time of in-flight UR backout processing). Now, having the database ready to go in a half-minute or less isn't the whole story (there are application and Web servers to deal with, and network stuff such as rerouting transactions that would normally flow to the primary site), but I'm willing to put another declaration on the table: given a DB2 HADR configuration, I believe that a crack IT crew could get a large and busy DB2 for LUW database and application up and work-ready at a DR site within 10-15 minutes of a primary-site disaster. Do any of you feel likewise? Even better, have any of you done this, either for real or in a full-scale "drill" exercise? I'd welcome your comments.

In a few years, will DB2 people shake their heads at the recollection that companies used to feel that a 30-minute DR recovery time objective was world-class? How far will technology take us? I think it'll be a fun ride.

1 Comments:

Blogger guna said...

Brief but sound DR plan.

February 4, 2009 at 10:14 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home