Robert's Blog


Tuesday, December 16, 2008

DB2 DR: Tape as a Bulwark Against Human Error

In an entry posted to this blog almost a year ago, titled "Aggressive DB2 Disaster Recovery", I wrote about the tremendous - and very positive - impact of remote disk mirroring technology on organizations' DB2 disaster recovery capabilities: companies that had developed DB2 DR procedures with dozens of steps, and which aimed to have systems restored and ready to take application traffic within 2-3 days (or more) of a disaster event with a loss of the most recent 24 hours (or more) of database updates could, upon implementing a remote disk mirroring solution, realistically expect to have DB2-based application systems up and running at a DR site within 1-2 hours of a disaster event with a loss of maybe a very few seconds of previously-committed database changes (for asynchronous remote mirroring) or even zero data-update loss (in the case of a synchronous remote disk mirroring configuration, which has a practical distance limitation of around 20 straight-line miles or so between the primary and DR sites).

The quantum leap forward in DB2 DR capabilities delivered by remote disk mirroring motivated a number of companies - particularly financial services firms - to deploy the technology very soon after it became available back in the mid-1990s. The cost of these solutions, including the bandwidth needed to make them work, has declined over the years, and the cost of downtime and data loss have increased, so adoption has become more and more widespread. Many people are quite enamored with remote disk mirroring technology (I'm very big on it myself), and it is understandable that enthusiasts might come to have a "Forget about it!" attitude towards tape-based DR.

Such a dismissal of tape-based DR solutions might be ill-advised - a point that was brought home to me by a recent question I saw from a DB2 professional in Europe. The question had to do with a process for getting the latest DB2 for z/OS active log data archived and sent off-site for DR purposes in a parallel sysplex/data sharing environment (more on the particulars of the question momentarily). In providing an answer, I asked the questioner why he was interested in sending DB2 archive log files to a DR site, given that his company had implemented a remote disk mirroring solution (which included the all-important mirroring of the DB2 active log files). It turns out that this person's organization was required to have in place a DR procedure that could be used in case all data on disk - at both the primary and the DR sites - were to be lost. He went on to say that this requirement was a direct result of an incident - involving another company but of which his company had become aware - that caused an organization with a remote disk mirroring system in place to lose access to all data on disk at both the originating and replicate-to sites because of a human error (an accidental configuration change).

That brought to my mind a somewhat similar situation I read about some 15 years ago, in which a systems person made a mistake when entering a console command and all labels (containing control information) on all disk volumes were wiped out. The data was still there, but the system couldn't find it. A long outage ensued - lasting a day or two, as I recall. Ironically, this happened in a state-of-the-art data center that had been constructed so as to survive all manner of external disaster-type events. How do you protect mission-critical systems against internal threats of the human-error variety? Yes, you can (and should) put in place safeguards to help ensure that such errors will not occur (the WITH RESTRICT ON DROP option of the DB2 statement CREATE TABLE is an example of such a safeguard), but can you be certain that these measures are fail-safe? Can you anticipate any potentially devastating mistake that any person with access to your system might make (including your hardware vendors' maintenance and repair technicians)? Wonderful as remote disk mirroring is, you might sleep better at night knowing that a tape-based DR procedure
(and "tape" is used loosely here - files could go to disk and be electronically transmitted to the DR site, perhaps to be transferred there to offline media) is documented, tested, and in operation (with respect to the regular sending of DB2 table backup and archive log files to the DR site). Hope that you won't ever have to use it (as previously mentioned, it does elongate disaster recovery time, and it risks loss of data changes made since the most recent log-archive operation), but know that elongated recovery with some data-update loss is way better than going out of business should your front-line DR solution fail you.

Now, about the particulars of that DB2 for z/OS DR question to which I referred: in a DB2 data sharing environment, tape-based recovery typically involves the periodic issuing of an -ARCHIVE LOG SCOPE(GROUP) command. This causes all active DB2 members in the data sharing group to truncate and archive the current active log data set. Output of command execution includes an ENDLRSN value (referring to a timestamp-based point in the log) from each DB2 subsystem, indicating the end-point of the just-archived log files. If you had to use these files to recover the data sharing group at the DR site, you'd use the smallest of these ENDLRSN values in a conditional restart to truncate all members' logs to the same point in time (important for data integrity). Suppose that you do the -ARCHIVE LOG SCOPE(GROUP) every hour to minimize potential data loss (assuming that you either don't use remote disk mirroring or you're establishing a safety net - as herein advocated - in case of a failure of the disk mirroring system). What if a DB2 member has to be down for several hours for some reason, so that the ENDLRSN of the last archive log from that member sent to the DR site is hours older than the end-points of the other DB2 members' log files at the DR site? Do you have to toss all of those more-current archive log files and use the oldest ENDLRSN value (the one for the most recent archive log of the member that's been down for hours) for your conditional restart? In fact, you don't have to do this, if you take the proper steps in shutting the member down. Here's what you'd do to shut down member A (for example): first, quiesce the subsystem, so that
there are no in-flight, in-doubt, in-abort, or postponed-abort units of recovery. Then, just before shutting the subsystem down, do an -ARCHIVE LOG (and subsequently send that archived log file to the DR site). What you will then end up with on the new current active log data set for member A is just a shutdown checkpoint record, and that's not needed for recovery at the DR site (you could run DSN1LOGP off the end of that last member A archive log, generated via the just-before-shutdown -ARCHIVE LOG operation, to verify that there were no incomplete URs on member A at the time of the shutdown). With member A shut down in this way, you would NOT have to throw out the more-current log data from the other DB2 data sharing group members when recovering at the DR site.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home