Robert's Blog


Wednesday, September 19, 2007

Time to Say Goodbye to Batch?

How's that for an opening? Actually, this post is not intended to be as controversial as the title might indicate. I don't mean to suggest that all database processing should be online. Instead, I'll argue that the term "batch," as a name for a type of computer workload, is outdated. On top of that, I feel that it's time to think of what we've long called "batch processing" in new ways.

OK, first the word. I'm fine with "batch" when used by, say, my mother-in-law, as in, "Robert, I made a batch of cookies" (Ginny makes really awesome cookies). I don't like "batch" in an IT context because I feel that it can constrain your thinking. To get a sense of what I mean by that, think about some brain-teaser you might have tried to solve at some time. I seem to recall one that instructed you to draw four straight lines so as to end up with two polygons, without lifting your pencil and without having any lines cross other lines (or something like that). You can do that by drawing a triangle with the first three lines, and then drawing the fourth line from the point at which you finished the triangle to the opposite side of the triangle. Some people would say of the solution, "That's cheating! You can't have the two polygons share a side!" Oh, really? Where does it say that you can't share a side? It doesn't say that anywhere, but we have a tendency to assume restrictions that in fact do not exist.

Thus it can be in the world of what we generally refer to as batch processing. You may work for an organization that has an application that is characterized by a large batch workload. Client companies send big files of data to your company on a daily basis. You process these at your site, and the output is in the form of more files that your company in turn sends somewhere else (or perhaps back to the originator). You do it this way, and your clients do it this way, because, well, it's always been done that way. Oh, sure, technology has intervened in some ways, most notably in that the old tape files have been replaced by electronic transmissions, and the application programs interact with a relational database management system, but the basic processing remains pretty much as its been for years (maybe decades). With online transactions going 24X7 these days, and with ever-increasing volumes of data coming in and going out, the old batch processing system is starting to crack at the seams, but what can you do? You've already reduced elapsed time by splitting incoming files into smaller files and processing them in parallel. What's left?

OK, time to think about breaking some of the rules that you've always imagined are there, but which in fact don't exist. Suppose your organization has developed an application that is great at transaction processing. Why not feed data received in files into that transactional system? Are you thinking that you can't do that? Who says? Does the idea of transactionalizing your batch workload violate some real constraint, or just an assumed restriction? You might think that you can't run batch stuff in a transactional fashion because it comes in files and transaction programs operate on an individual piece of input. Well, isn't a batch file just a big collection of individual input records? Couldn't you take a file and pre-process it with a program that puts the individual data records on a queue (perhaps a WebSphere MQ queue)? Having done that, couldn't queue "listener" programs take the messages (records) off the queue and feed them through a transactional system? At the back end of the system, the application output records could be placed on another queue, and a program could take them from there and build the output file expected by the organization (again, perhaps the originating client) that receives same today.

At this point, you might be thinking that a transformation of this nature sounds like a lot of work without a lot of benefit, since you just end up with the same output file you're generating today. Think further, however, and you might see advantages both tactical and strategic in nature. Tactical benefits include the following:
  • Better use of server resources - Parallel batch processing, in the traditional sense, is nice, but it tends to be rather inflexible. With individual records fed via a queue into a transactional system, you have the potential for really dynamic multi-threading. You can vary the number of concurrently executing programs that pull input messages off of a queue (and it can take quite a few such programs to pull records off of a queue as fast as an upstream program can put them there) so as to throttle up the processing during periods of low system demand, and you can just as easily throttle things down during peak-use hours.
  • Better coexistence with your traditional transactional workload - For years, people have fretted about batch workloads interfering with online workloads. Well, if everything is processed as transactions, where's the conflict?
On the strategic side, the potential benefits are very cool. With a queue between incoming data files and the transaction processing system, you're ready to approach your clients with an interesting proposition: "Why don't you send those input records to us as they're generated on your system, instead of batching them up and sending them in files?" You could make that a particularly compelling idea if you externalize your input-receiving queue, perhaps in the form of a Web service. And how about on the output end - would they like to get the information from you as your transactional system generates it, instead of waiting for your batch file to arrive? If so, you're ready to do that (remember, you have a queue on the tail end of your transactional system) as soon as they are ready to receive the record-at-a-time output from you (suggestion: they can externalize an output-record-receipt process as a Web service). Now, the whole end-to-end system becomes a dynamic thing of beauty (hey, beauty is in the eye of the beholder, and I'll admit to getting pretty geeked up when I think about this stuff). Think of the possibilities for your client's clients (much faster turnaround times for their application requests), and the compelling pitch that your organization can make to potential customers.

This is not the stuff of dreams. There are organizations that have already transactionalized processes that were once handled in traditional batch mode. Your organization can do it, too, and you can be a part of it.

Oh, and about that word "batch." Leave it for cookies and such, and use a different term for your newly transactionalized workload. Personally, I like "offline processing." Sounds better than "transbatchional." You can come up with your own term. Use your imagination. Creativity is a good thing.

1 Comments:

Anonymous Anonymous said...

That's pretty much what our shop has been doing with newer applications, but the success is very limited as far as I'm concerned. It is a very costly method: little or no use of prefetch during mass processing, an enormous lot of synchronous IO instead (and it's really random), much more complex concurrency issues cause it needs "scaling" to be able to approach a certain throughput, more than 50-60% overhead for transaction handling, several MQ issues (queue depth and stuff), abend-storms, little or no oversight on the progress of mass (and other) deliveries when there is a limited window (mainly for legal reasons) available for response on a mass delivery etc. The theory sounds nice, but I think it's wise to be careful: only a limited amount of processes that are now being serviced through batch can be taken into consideration for this. There's no magic ...

November 19, 2007 at 12:47 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home