Time to Say Goodbye to Batch?
How's that for an opening? Actually, this post is not intended to be as controversial as the title might indicate. I don't mean to suggest that all database processing should be online. Instead, I'll argue that the term "batch," as a name for a type of computer workload, is outdated. On top of that, I feel that it's time to think of what we've long called "batch processing" in new ways.
OK, first the word. I'm fine with "batch" when used by, say, my mother-in-law, as in, "Robert, I made a batch of cookies" (Ginny makes really awesome cookies). I don't like "batch" in an IT context because I feel that it can constrain your thinking. To get a sense of what I mean by that, think about some brain-teaser you might have tried to solve at some time. I seem to recall one that instructed you to draw four straight lines so as to end up with two polygons, without lifting your pencil and without having any lines cross other lines (or something like that). You can do that by drawing a triangle with the first three lines, and then drawing the fourth line from the point at which you finished the triangle to the opposite side of the triangle. Some people would say of the solution, "That's cheating! You can't have the two polygons share a side!" Oh, really? Where does it say that you can't share a side? It doesn't say that anywhere, but we have a tendency to assume restrictions that in fact do not exist.
Thus it can be in the world of what we generally refer to as batch processing. You may work for an organization that has an application that is characterized by a large batch workload. Client companies send big files of data to your company on a daily basis. You process these at your site, and the output is in the form of more files that your company in turn sends somewhere else (or perhaps back to the originator). You do it this way, and your clients do it this way, because, well, it's always been done that way. Oh, sure, technology has intervened in some ways, most notably in that the old tape files have been replaced by electronic transmissions, and the application programs interact with a relational database management system, but the basic processing remains pretty much as its been for years (maybe decades). With online transactions going 24X7 these days, and with ever-increasing volumes of data coming in and going out, the old batch processing system is starting to crack at the seams, but what can you do? You've already reduced elapsed time by splitting incoming files into smaller files and processing them in parallel. What's left?
OK, time to think about breaking some of the rules that you've always imagined are there, but which in fact don't exist. Suppose your organization has developed an application that is great at transaction processing. Why not feed data received in files into that transactional system? Are you thinking that you can't do that? Who says? Does the idea of transactionalizing your batch workload violate some real constraint, or just an assumed restriction? You might think that you can't run batch stuff in a transactional fashion because it comes in files and transaction programs operate on an individual piece of input. Well, isn't a batch file just a big collection of individual input records? Couldn't you take a file and pre-process it with a program that puts the individual data records on a queue (perhaps a WebSphere MQ queue)? Having done that, couldn't queue "listener" programs take the messages (records) off the queue and feed them through a transactional system? At the back end of the system, the application output records could be placed on another queue, and a program could take them from there and build the output file expected by the organization (again, perhaps the originating client) that receives same today.
At this point, you might be thinking that a transformation of this nature sounds like a lot of work without a lot of benefit, since you just end up with the same output file you're generating today. Think further, however, and you might see advantages both tactical and strategic in nature. Tactical benefits include the following:
This is not the stuff of dreams. There are organizations that have already transactionalized processes that were once handled in traditional batch mode. Your organization can do it, too, and you can be a part of it.
Oh, and about that word "batch." Leave it for cookies and such, and use a different term for your newly transactionalized workload. Personally, I like "offline processing." Sounds better than "transbatchional." You can come up with your own term. Use your imagination. Creativity is a good thing.
OK, first the word. I'm fine with "batch" when used by, say, my mother-in-law, as in, "Robert, I made a batch of cookies" (Ginny makes really awesome cookies). I don't like "batch" in an IT context because I feel that it can constrain your thinking. To get a sense of what I mean by that, think about some brain-teaser you might have tried to solve at some time. I seem to recall one that instructed you to draw four straight lines so as to end up with two polygons, without lifting your pencil and without having any lines cross other lines (or something like that). You can do that by drawing a triangle with the first three lines, and then drawing the fourth line from the point at which you finished the triangle to the opposite side of the triangle. Some people would say of the solution, "That's cheating! You can't have the two polygons share a side!" Oh, really? Where does it say that you can't share a side? It doesn't say that anywhere, but we have a tendency to assume restrictions that in fact do not exist.
Thus it can be in the world of what we generally refer to as batch processing. You may work for an organization that has an application that is characterized by a large batch workload. Client companies send big files of data to your company on a daily basis. You process these at your site, and the output is in the form of more files that your company in turn sends somewhere else (or perhaps back to the originator). You do it this way, and your clients do it this way, because, well, it's always been done that way. Oh, sure, technology has intervened in some ways, most notably in that the old tape files have been replaced by electronic transmissions, and the application programs interact with a relational database management system, but the basic processing remains pretty much as its been for years (maybe decades). With online transactions going 24X7 these days, and with ever-increasing volumes of data coming in and going out, the old batch processing system is starting to crack at the seams, but what can you do? You've already reduced elapsed time by splitting incoming files into smaller files and processing them in parallel. What's left?
OK, time to think about breaking some of the rules that you've always imagined are there, but which in fact don't exist. Suppose your organization has developed an application that is great at transaction processing. Why not feed data received in files into that transactional system? Are you thinking that you can't do that? Who says? Does the idea of transactionalizing your batch workload violate some real constraint, or just an assumed restriction? You might think that you can't run batch stuff in a transactional fashion because it comes in files and transaction programs operate on an individual piece of input. Well, isn't a batch file just a big collection of individual input records? Couldn't you take a file and pre-process it with a program that puts the individual data records on a queue (perhaps a WebSphere MQ queue)? Having done that, couldn't queue "listener" programs take the messages (records) off the queue and feed them through a transactional system? At the back end of the system, the application output records could be placed on another queue, and a program could take them from there and build the output file expected by the organization (again, perhaps the originating client) that receives same today.
At this point, you might be thinking that a transformation of this nature sounds like a lot of work without a lot of benefit, since you just end up with the same output file you're generating today. Think further, however, and you might see advantages both tactical and strategic in nature. Tactical benefits include the following:
- Better use of server resources - Parallel batch processing, in the traditional sense, is nice, but it tends to be rather inflexible. With individual records fed via a queue into a transactional system, you have the potential for really dynamic multi-threading. You can vary the number of concurrently executing programs that pull input messages off of a queue (and it can take quite a few such programs to pull records off of a queue as fast as an upstream program can put them there) so as to throttle up the processing during periods of low system demand, and you can just as easily throttle things down during peak-use hours.
- Better coexistence with your traditional transactional workload - For years, people have fretted about batch workloads interfering with online workloads. Well, if everything is processed as transactions, where's the conflict?
This is not the stuff of dreams. There are organizations that have already transactionalized processes that were once handled in traditional batch mode. Your organization can do it, too, and you can be a part of it.
Oh, and about that word "batch." Leave it for cookies and such, and use a different term for your newly transactionalized workload. Personally, I like "offline processing." Sounds better than "transbatchional." You can come up with your own term. Use your imagination. Creativity is a good thing.
1 Comments:
That's pretty much what our shop has been doing with newer applications, but the success is very limited as far as I'm concerned. It is a very costly method: little or no use of prefetch during mass processing, an enormous lot of synchronous IO instead (and it's really random), much more complex concurrency issues cause it needs "scaling" to be able to approach a certain throughput, more than 50-60% overhead for transaction handling, several MQ issues (queue depth and stuff), abend-storms, little or no oversight on the progress of mass (and other) deliveries when there is a limited window (mainly for legal reasons) available for response on a mass delivery etc. The theory sounds nice, but I think it's wise to be careful: only a limited amount of processes that are now being serviced through batch can be taken into consideration for this. There's no magic ...
Post a Comment
Subscribe to Post Comments [Atom]
<< Home