Robert's Blog


Tuesday, April 22, 2008

For Ultra-Fast DB2 Recovery, Don't Recover

Over the past decade or so, DB2 disaster recovery (DR) has come a long way, especially where the mainframe platform is concerned. At least the DB2 for Linux/UNIX/Windows (LUW) folks had log shipping, which enabled organizations to get the backup data server at the DR site up and ready to handle application requests in less than an hour following a primary-site disaster event (and without too much data loss, to boot - maybe 30 minutes or less, depending on the frequency of inactive log file backup operations).

For the mainframe DB2 crowd, it was a different story. Up until around the mid-1990s, people would take their most recent image copy and archive log files and send them, maybe once or twice each day, to the DR site. [It was common for these files to be transported to the DR site in TAPE form in a TRUCK - keep in mind that the volume of data to be transported could be quite large, and the capacity of a "high bandwidth" T1 line was almost two orders of magnitude less than what you get with an OC-3 circuit today, so electronic transmission was not always feasible.] Cue the primary-site disaster, and what happened next? The DB2 subsystem had to be brought up on the DR site mainframe via a conditional restart process (lots of fun), and then all the tablespace image copies had to be restored and then all the subsequent table updates (i.e., subsequent to the image copy times) had to be applied from the logs and then all the indexes had to be rebuilt (this was before one could image copy an index). I mean, you were often looking at several DAYS of work to get the database ready for the resumption of application service, and on top of that you might be without the last 24 or more hours of database updates made prior to the disaster event.

What happened in the mid-1990s? Disk array-based replication came on the scene. This made mainframe DB2 people very happy, because it VASTLY simplified and accelerated the DB2 DR process, and it slashed the amount of data loss with which one had to deal in the event of a primary-site disaster situation. [When disk array-based replication is operating in synchronous mode (feasible if the primary and backup data center sites are within about 20 straight-line miles of each other), there will ZERO loss of committed database changes if a disaster incapacitates the primary-site data center. If asynchronous mode is in effect (the right choice for long-distance data change replication), data loss will likely be measured in seconds.] This goodness is due to the fact that disk array-based replication (also known as remote disk mirroring) mirrors ALL data sets on associated disk volumes - not just those corresponding to DB2 tablespaces and indexes. That means the DB2 active log data sets are mirrored, and THAT means that the DB2 DR process amounts to little more than entering -START DB2 on a console at the DR site. The subsystem restarts as it would if you restarted DB2 in-place following a power failure at the primary site, and there you are, ready to get application work going again.

In recent years, array-based replication has become an option for DB2 for LUW users (the disk subsystems equipped with remote mirroring capability - from various vendors including IBM, EMC, and HDS - initially supported only mainframe attachment). Another big step forward in DB2 for LUW DR was the introduction of the HADR (High Availability Disaster Recovery) feature with DB2 for LUW Version 8.2. HADR is conceptually like log shipping taken to the extreme, with log records pertaining to data change operations being transmitted to and applied on a secondary DB2 instance in real time as they are generated on the primary instance. Like remote disk mirroring, HADR can make DB2 for LUW DR a quick and simple process, with zero or just a few seconds of data loss (depending on whether HADR is running in one of its two synchronous modes or in asynchronous mode).

So, whether the DR-accelerating technology of choice is remote disk mirroring (DB2 on any platform) or HADR (DB2 for LUW), DB2 users can get their DR time down to a pretty impressive figure. What kind of DR time am I talking about here? Well, as I pointed out in a previous post to my blog, I believe that a really good group of systems people could get a DB2 for z/OS
database and application system online and work-ready at a DR site within 20-30 minutes of a primary-site disaster. If the OS platform is Linux, UNIX, or Windows, and if DB2 is running in an HADR cluster, I believe that the DR time could be taken down to maybe 10-15 minutes (less than the best-in-class DB2 for z/OS time because HADR eliminates the need for the roll-forward phase of DB2 for LUW restart and dramatically speeds the rollback phase).

There is only one problem with these impressive DR times: for some organizations, they are not fast enough. Suppose you need a DR time that is so short that from an application user's perspective the system is NEVER down, even if, behind the scenes, one of your data centers is incapacitated because of a disaster event? Pulling that off would probably require that you get your DR time down to a minute or less (and keep in mind that I'm not just talking about getting the database ready for work - I mean getting the application and all the network-related stuff taken care of, too). I doubt seriously that this could be pulled off if you were to go about DR in the traditional way. The solution, then, is to go the non-traditional route, and by that I mean, simply: don't recover.

How do you do DR with no DB2 recovery in the traditional sense? It starts with you getting the notion of primary and backup data centers out of your mind. Instead, you run with two or more "peer" data centers, each running a COMPLETE instance of the database and the application system and each handling a portion of the overall application workload (e.g., transactions initiated by users in one geography could be routed to data center A, while other transactions are routed to data center B). Bi-directional database change propagation implemented through replication software tools (available from vendors such as Informatica, GoldenGate, and DataMirror - the latter now a part of IBM) keeps the different copies of the database located at the different data centers in near-synch with each other. If site A is incapacitated because of a disaster event, you don't try to recover the site A system at site B; rather, you move the work that had been routed to site A over to site B, and you just keep on trucking. At a later time, when you can get site A back online, you re-establish the copy of the database at that site, re-synch it with the copy at site B, and go back to the workload split that you had going before the site A disaster event.

I've just used one paragraph to describe an ultra-high availability strategy that is relatively simple in concept and quite challenging with respect to design and implementation. I'll be explaining things more fully in a presentation - "Ultra-Availability - How Far Can You Go?" - that I will be delivering at the International DB2 Users Group (IDUG) North American Conference in May, and again at the IDUG European Conference in October. I hope to see you in Dallas or Warsaw (or both). In the meantime, don't listen to people who tell you that the kind of ultra-high availability I'm talking about can't be achieved. It can. You just have to think outside of the box.

2 Comments:

Anonymous Anonymous said...

Hi Rob
Interesting article.
One question: Using two identical data repositories would allow the non-DR aspect to be successful and also share the workload during the other 99.99% availability to give better load balancing.
If one of the systems becomes busier than the other however, is there any way the calling application can check the 'health' of each data server before sending the request in? Almost like a WLM environment which sits on the application server and 'talks' to the WLM environments on each of the data servers.
Thanks for any advice
Simon

June 8, 2008 at 9:38 AM  
Blogger Robert Catterall said...

That's an excellent question, and it speaks to the complexity of making the "ultra-availability" system work. I believe that the functionality about which you're inquiring could be built into the system. Presumably, the network routers that get requests to either site A or site B could do so based on an algorithm provided by the user. It could be that one could update a factor in this routing equation by way of a message sent to the routers, and it's possible that the factor adjustment could be automated through programs running on app servers that would poll back-end data servers to check on their load status (I have to believe that there's a workload management API on the mainframe, and their may be such an API for LUW platforms, as well). If the DB2 data server's "degree of busyness" gets above some threshold, the app server sends a message to the router, and the request-routing is altered accordingly.

Sorry to be a little short on details here. I'm thinking that a good bit of user programming would be required here, but it may be that some router and/or app server vendors have done at least some of this for you.

June 9, 2008 at 3:25 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home