Robert's Blog

Thursday, September 27, 2007

The Most Important DB2 for z/OS Performance Metric

I don't mean to imply that effective DB2 performance monitoring and tuning comes down to tracking one number - there are about a dozen or so DB2 monitor fields on which I tend to focus in my work. I do feel that one metric stands alone as being the most useful and most illuminating with respect to the analysis of a DB2 subsystem's performance: GETPAGEs.

First, some review:
  • A GETPAGE is a request by DB2 to examine the contents of a database page. A GETPAGE is NOT equivalent to an I/O request. If DB2 issues a request to look at a particular page, and that page is already in one of the subsystem's buffer pools, the GETPAGE counter is incremented by one and the I/O counter is not incremented.
  • If DB2 issues a request to look at ANY kind of database page - a data page, an index page (leaf or non-leaf), a space map page - that counts as a GETPAGE.
You can find GETPAGE information in both the statistics and accounting reports that most DB2 monitor products can generate. You can also find the numbers in online monitor displays, where "statistics" might be referred to as "subsystem activity," and "accounting data" may be called "thread detail data" (it may sound old-fashioned, but I prefer hard copy reports over online displays for DB2 performance analysis - I like to underline things and write down comments in the margins).

The GETPAGE numbers that are of the most interest to me are found in the buffer pool section of a DB2 accounting report (preferably
a report called "accounting detail" or perhaps "accounting report - long," depending on the vendor of the monitor product). I'm often working to analyze the performance of a particular DB2 application or some component thereof, and accounting reports show me activity pertaining to an application (statistics reports also contain GETPAGE information in the buffer pool section, but that data reflects activity for the DB2 subsystem as a whole).

GETPAGE numbers are great because they inform you as to CPU efficiency of an application's DB2-related processing. The more GETPAGE requests DB2 has to issue to perform a given piece of work, the more CPU time that piece of work is going to consume. I suggest getting baseline GETPAGE numbers for an application (or a piece of an application) that is of interest to you, performance-wise. This is not hard to do, as a DB2 monitor will typically enable you to group accounting information at various levels of granularity, including plan name (some installations have a plan per application), connection name (i.e., the name of a CICS region or a WLM application environment used for DB2 stored procedures), correlation ID (which could be a CICS transaction ID or the name of a batch job), and package name (DB2 for z/OS Version 8 introduced significant enhancements with respect to the level of detail provided with package-level accounting data).

Over time, track the GETPAGE numbers for the application at regular intervals (e.g., weekly). What trends are evident? Is the number of GETPAGEs going up? If so, that might not be a problem - it could be that the increase in GETPAGE activity is the result of an increase in the application workload (to check this, see if the number of GETPAGEs per COMMIT increased, or if the ratio of GETPAGEs to SQL statements executed went up - if not, the GETPAGE increase is probably due to workload growth). If an observed increase in GETPAGE activity can't be explained in terms of increased application activity, what else could be happening? Here are a few possibilities (there are others):
  • It could be that data pages are increasingly filled with "cold" data rows. By "cold," I mean rows that are very rarely accessed ("hot" rows are frequently accessed). Oftentimes, data rows become increasingly chilly (using this analogy whereby temperature is associated with the frequency with which rows are requested by application programs) as they become older. If data rows are NOT clustered in date sequence, and if they are never deleted from a table (you'd be surprised at how often this is the case out there in database land), over time the average data page will contain fewer and fewer "hot" rows, and more and more "cold" rows. Because programs are typically requesting sets of "hot" rows, and because each page is holding fewer "hot" rows as time goes by, it becomes more and more likely that the set of rows requested by an application program is now spread over, say, ten or twelve pages instead of two or three. Voila: GETPAGEs go up, and so does the application's consumption of CPU on the data-serving mainframe. A possible near-term solution would involve the creation of a history table to which "cold" rows can be moved. Implementation of a comprehensive information lifecycle management (ILM) solution would be a more strategically attractive response.
  • It could be that DB2 data access paths have changed for one or more programs. Perhaps a program that once retrieved data from a table by way of an index now gets that data via a tablespace scan. Maybe data access that formerly utilized an index that provided optimal retrieval efficiency (i.e., minimized GETPAGEs) now uses a different index that forces more page accesses to return the same data rows. In any case, the likely culprit is incomplete and/or incorrect statistics in the DB2 catalog (coupled with a rebind of the program - most likely an auto-rebind - if the SQL is static). Yes, it's possible that a good (fewer GETPAGEs) access path was changed to a bad one (more GETPAGEs) because of a bug in the DB2 optimizer, but that is rarely the case. Much more often, the optimizer is working just fine, but it has misleading information about things such as the number of rows in a table, or the cardinality of columns in the table, or the correlation between values of different columns in a table, or any of around 90 other statistical values stored in the DB2 catalog and used as input to the access path selection process. It's important to run the RUNSTATs utility regularly to keep that catalog stats up-to-date.
  • DB2 might be having to look at lots of data pages every time a row is inserted into a table. This can happen when a tablespace is partitioned and the length of rows in the tables is variable (and this does not require the presence of VARCHAR columns, as even rows you think of as being fixed-length are in fact variable-length rows if the tablespace is compressed). The looking at lots of pages (by DB2) when inserting a row is most likely to occur when data in the partitioned tablespace is not clustered by a continuously-ascending key. The reason for all the GETPAGEs in this case has to do with the relatively imprecise information in the space map pages of partitioned tablespaces with respect to the size of "holes" in data pages (these being chunks of free space that might accommodate a new row). This imprecise data means that DB2 has to actually examine a data page before it can be certain that a new row will fit into that page, and it's possible that quite a few pages will have to be examined before one with a big enough chunk of free space is found. People have dealt with this issue in various ways over the years (one response is to minimize free space in a tablespace and just force all inserts to the end of the table, sacrificing good clusteredness for improved insert performance), but there's really good news now: DB2 for z/OS Version 9 offers a new type of tablespace - called a universal tablespace - that is both partitioned and segmented. Using a universal tablespace, you get the data management advantages of partitioned tablespaces (particularly attractive for very large tables) with the insert efficiency (among other things) provided by segmented tablespaces (thanks to the much more precise information about data page "holes" maintained in the space map pages
    of segmented tablespaces). So, there's one more great reason to get to DB2 for z/OS V9 sooner rather than later.
Keep an eye on that GETPAGE data, and you'll have a good feel for the operating efficiency of your mainframe DB2 subsystem. Work to drive GETPAGEs down, and you'll improve the performance of your DB2-accessing application programs. One metric won't tell you the whole story of DB2 application performance, but for busy DBAs, GETPAGEs makes for a great starting point.

Wednesday, September 19, 2007

Time to Say Goodbye to Batch?

How's that for an opening? Actually, this post is not intended to be as controversial as the title might indicate. I don't mean to suggest that all database processing should be online. Instead, I'll argue that the term "batch," as a name for a type of computer workload, is outdated. On top of that, I feel that it's time to think of what we've long called "batch processing" in new ways.

OK, first the word. I'm fine with "batch" when used by, say, my mother-in-law, as in, "Robert, I made a batch of cookies" (Ginny makes really awesome cookies). I don't like "batch" in an IT context because I feel that it can constrain your thinking. To get a sense of what I mean by that, think about some brain-teaser you might have tried to solve at some time. I seem to recall one that instructed you to draw four straight lines so as to end up with two polygons, without lifting your pencil and without having any lines cross other lines (or something like that). You can do that by drawing a triangle with the first three lines, and then drawing the fourth line from the point at which you finished the triangle to the opposite side of the triangle. Some people would say of the solution, "That's cheating! You can't have the two polygons share a side!" Oh, really? Where does it say that you can't share a side? It doesn't say that anywhere, but we have a tendency to assume restrictions that in fact do not exist.

Thus it can be in the world of what we generally refer to as batch processing. You may work for an organization that has an application that is characterized by a large batch workload. Client companies send big files of data to your company on a daily basis. You process these at your site, and the output is in the form of more files that your company in turn sends somewhere else (or perhaps back to the originator). You do it this way, and your clients do it this way, because, well, it's always been done that way. Oh, sure, technology has intervened in some ways, most notably in that the old tape files have been replaced by electronic transmissions, and the application programs interact with a relational database management system, but the basic processing remains pretty much as its been for years (maybe decades). With online transactions going 24X7 these days, and with ever-increasing volumes of data coming in and going out, the old batch processing system is starting to crack at the seams, but what can you do? You've already reduced elapsed time by splitting incoming files into smaller files and processing them in parallel. What's left?

OK, time to think about breaking some of the rules that you've always imagined are there, but which in fact don't exist. Suppose your organization has developed an application that is great at transaction processing. Why not feed data received in files into that transactional system? Are you thinking that you can't do that? Who says? Does the idea of transactionalizing your batch workload violate some real constraint, or just an assumed restriction? You might think that you can't run batch stuff in a transactional fashion because it comes in files and transaction programs operate on an individual piece of input. Well, isn't a batch file just a big collection of individual input records? Couldn't you take a file and pre-process it with a program that puts the individual data records on a queue (perhaps a WebSphere MQ queue)? Having done that, couldn't queue "listener" programs take the messages (records) off the queue and feed them through a transactional system? At the back end of the system, the application output records could be placed on another queue, and a program could take them from there and build the output file expected by the organization (again, perhaps the originating client) that receives same today.

At this point, you might be thinking that a transformation of this nature sounds like a lot of work without a lot of benefit, since you just end up with the same output file you're generating today. Think further, however, and you might see advantages both tactical and strategic in nature. Tactical benefits include the following:
  • Better use of server resources - Parallel batch processing, in the traditional sense, is nice, but it tends to be rather inflexible. With individual records fed via a queue into a transactional system, you have the potential for really dynamic multi-threading. You can vary the number of concurrently executing programs that pull input messages off of a queue (and it can take quite a few such programs to pull records off of a queue as fast as an upstream program can put them there) so as to throttle up the processing during periods of low system demand, and you can just as easily throttle things down during peak-use hours.
  • Better coexistence with your traditional transactional workload - For years, people have fretted about batch workloads interfering with online workloads. Well, if everything is processed as transactions, where's the conflict?
On the strategic side, the potential benefits are very cool. With a queue between incoming data files and the transaction processing system, you're ready to approach your clients with an interesting proposition: "Why don't you send those input records to us as they're generated on your system, instead of batching them up and sending them in files?" You could make that a particularly compelling idea if you externalize your input-receiving queue, perhaps in the form of a Web service. And how about on the output end - would they like to get the information from you as your transactional system generates it, instead of waiting for your batch file to arrive? If so, you're ready to do that (remember, you have a queue on the tail end of your transactional system) as soon as they are ready to receive the record-at-a-time output from you (suggestion: they can externalize an output-record-receipt process as a Web service). Now, the whole end-to-end system becomes a dynamic thing of beauty (hey, beauty is in the eye of the beholder, and I'll admit to getting pretty geeked up when I think about this stuff). Think of the possibilities for your client's clients (much faster turnaround times for their application requests), and the compelling pitch that your organization can make to potential customers.

This is not the stuff of dreams. There are organizations that have already transactionalized processes that were once handled in traditional batch mode. Your organization can do it, too, and you can be a part of it.

Oh, and about that word "batch." Leave it for cookies and such, and use a different term for your newly transactionalized workload. Personally, I like "offline processing." Sounds better than "transbatchional." You can come up with your own term. Use your imagination. Creativity is a good thing.

Thursday, September 13, 2007

Thoughts on DB2 Data Clustering

Although this post is written specifically in the context of a DB2 for z/OS environment, most of what I have to say on the topic applies to DB2 for Linux, UNIX, and Windows, as well.

First, a quick review and level-set:
  • A clustering index informs DB2 of the physical order (according to the index key value) in which rows of the table should be arranged. The use of "should" here versus "must" is important. Suppose that a new row is to be inserted into a table, and the clustering index indicates that the row should be placed in page 123 so as to maintain the table's clustering sequence. If page 123 is X-locked by some other data-changing process that has not yet reached a commit point, DB2 will not wait for that X-lock to be released. Instead, it will attempt to place the new row in a page that is close to page 123 (within a few pages before or after page 123). The same is true if page 123 is full and can't hold the new row (index page splits can occur in a DB2 database, but there's no such thing as a DB2 data page split). This aspect of DB2's operation improves insert performance but can lead to tablespace disorganization. [Note: row-level locking is an alternative to the default page-level locking used by DB2 for z/OS, but be aware that row-level locking can increase CPU overhead in a data sharing group on a mainframe parallel sysplex.]
  • Every DB2 table with an index has a clustering index. The clustering index will be the one created with the CLUSTER keyword (CLUSTER can also be added via ALTER INDEX), or - in the absence of an index so designated - the first index defined on the table.
  • You can tell DB2 to ignore the clustering index when adding data to a table. For data added via the LOAD utility, you can do this through use of the RESUME option. For rows added via INSERT statements, clustering order can be ignored - starting with DB2 for z/OS V9.1 - via the APPEND option of CREATE and ALTER TABLE. [The MEMBER CLUSTER option of CREATE TABLESPACE also provides a means of ignoring clustering sequence for rows added via INSERT - the rows are placed in the tablespace based on available space.]
Now, a question that I find to be interesting: is clustering important? That is to say, if a tablespace becomes decidedly unclustered (i.e., if many rows are located far from where they should be with respect to clustering sequence), should you REORG the tablespace in order to get the data rows back into clustering sequence? Your response to this question might be, "Of course!" My response is, "Not necessarily." It depends a lot on the nature of data retrieval for the table in question. If read operations targeting the table generally provide a unique - and indexed - key, with just the one qualifying row being returned to the requester, who cares about the physical ordering of the data? This is particularly true if successive single-row read operations do not follow any kind of consistent pattern. You'll have a clustering index on the table (remember, if no index has the CLUSTER identifier, the first index created on the table is by default the clustering index), so the tablespace can get way disorganized (unless DELETE operations open up "holes" in pages and enable clustering to be maintained), but again, that's not necessarily cause for concern.

Think, too, about the relative importance of INSERT versus SELECT performance for the table (predicate-containing UPDATE and DELETE statements are lumped in with SELECTs with respect to this matter). Suppose that a table contains history information pertaining to users' visits to a Web site, and that this history information is used almost exclusively for customer care purposes. If very few site visitors call in with customer care issues, the number of table INSERT operations will far exceed the number of read requests targeting the table. That being the case, you'd probably want to optimize insert performance. This could mean defining only one index on the table (the more indexes defined on a table, the more CPU-costly INSERTs will be). If the lead column of that index key is some sort of customer-identifier value (i.e., not a continuously-ascending key such a sequence number or a timestamp), the table will soon become way unclustered. Let it be so, as long as data retrieval performance is acceptable. What matters in this scenario is CPU-efficient INSERT operations.

Of course, there are many situations in which maintenance of data clustering is very important. If you have an application program that retrieves (for example) six months of purchase activity for a given customer, performance for that program will be optimized if the purchase-activity rows for a given customer are physically located on the same table page (or in adjacent table pages), versus being spread all over the table. You'll want to have a clustering index with a key that begins with the customer number, and you'll want to maintain good clustering for the table. Clustering maintenance will require periodic REORGs to both restore the preferred ordering of rows and to reestablish freespace in data pages (adequate freespace helps to maintain good clustering between REORGs - think about a PCTFREE value for the tablespace that will allow for the insertion of at least a few new rows on any given page). Keep in mind that frequent updating of DB2 catalog statistics via RUNSTATs is important if you want to keep tabs on the clusteredness of tablespaces.

When you do run the REORG utility, you'll probably want to go with online REORG, as it really doesn't cost much more than an offline REORG (except for the disk space needed for shadow data sets), and if you need to cancel the REORG job before it completes, go ahead - there's no clean-up necessary, because the original tablespace and indexes are not changed by an online REORG. [Note that online REORG for DB2 for Linux/UNIX/Windows is different, in that it's a REORG-in-place operation versus the shadow-and-switch-data-sets technique used by a DB2 for z/OS online REORG.]

My bottom-line point is this: do not reflexively reorganize tablespaces that have become unclustered - to do so might waste some system resources. Instead, do a little bit of application and database object analysis, and see if you have some tablespaces that you can allow to become disorganized. The least expensive database maintenance task is one that you don't perform at all.

Thursday, September 6, 2007

Making SOA Perform

SOA, or Service-Oriented Architecture, is a hot concept these days, and it's more than a concept - a number of organizations have SOA-based applications in production, and more have such applications in various stages of development. You may know what SOA is all about:
  • Agility - The code that comprises an SOA-based application is modularized, i.e., it is contained within blocks that can be reused to build other applications. Programming productivity is further enhanced through abstraction, which serves to make invisible and immaterial to the developer of a service-consuming program all the technical particulars behind a service-providing program (e.g., programming language, server platform and operating system, application server type, DBMS, database schema, etc.). SOA enables an organization's IT group to respond quickly to changing market conditions.
  • Quality - Diagrams of complex applications can look like hairballs, what with all the point-to-point connections between programs and between programs and database objects (can't take credit for the hairball analogy - I first heard it used by Paul Benati, an IT executive at CheckFree Corporation). An SOA greatly reduces the number of connection points within an application, and makes extensive use of standards with respect to describing application-provided services (WSDL, or Web Services Description Language, is a form of XML that is used for this purpose) and invoking same (accomplished by way of SOAP, or Simple Object Access Protocol - another form of XML). The simplified structure characteristic of an SOA makes an application less brittle, i.e., less prone to breaking as a result of code modifications.
  • Operational flexibility - Use of abstraction and standard interfaces in an SOA means that IT systems people can use the platforms and operating systems they want for different parts of the application infrastructure. Programming languages and DBMSs can also be mixed and matched, as can application servers.
What's not to like about SOA? Well, performance might be a concern, given that abstraction is achieved through extra code, which means more in the way of machine instruction pathlength. The reduction in application component interfaces (out with the point-to-point hairball, in with more of a hub-and-spoke service access infrastructure) also tends to increase pathlength. Thus the question: can an SOA-based application perform well? My answer to this question is "Yes," and I believe that two keys to success are the use of asynchronous processing for data-changing operations, and caching for data retrieval operations.

First, asynchronous processing. In the context of this post, I'm talking about unhooking the client end of an application from back-end database changes. Suppose, for example, that a user submits an order through a retailer's Web-based purchasing application. This action will cause some changes to be made in the database on which the application is built. If the application is SOA-based, it may be that the elapsed time required to effect these data changes will be somewhat greater than would be the case for an old-style monolithic application. Does this mean that the user will have to wait longer after clicking on "Submit" before the hourglass on his screen goes away? Maybe not. Suppose the order entry application were to indicate end-of-transaction once the end user's order information is placed on a message queue (managed, perhaps, by IBM's WebSphere MQ - formerly known as MQSeries and often referred to simply as MQ)? That could be pretty fast. The queue can be configured as a recoverable resource, so the order-input data can be retained even if the whole application system crashes. The user can go on his merry way, while a process gets the order information message off the queue and the appropriate database changes are made.

Is there a catch? Yes, but I think it's a little one. Because the update of the order entry database is asynchronous with respect to the end user's clicking on "Submit," if the user were to immediately click on a "View order history" link on the order acknowledgment page, it is conceivable that the user would not see in the resulting display the information pertaining to his just-submitted order - this because the database (from which "order history" information is retrieved) has not yet been updated with the just-submitted order information on the MQ queue. I believe that the risk of this happening is quite small, because it's likely that a few seconds will pass before the user a) thinks about viewing his order history with this retailer, b) locates - and moves his cursor to - the "View order history" link, and c) clicks on the link. A few seconds should be more than enough time for the back end of the order entry application to get the order information message off the MQ queue and to make the corresponding database changes.

Aside from the performance benefit (from the user's perspective) that can accrue through the use of asynchronous processing for database update operations, consider the accompanying boost that MQ can deliver in the area of application availability (again, from the user's perspective). If for some reason the back-end database system is not available for update processing (perhaps due to a server or OS failure, or a program logic error - or because of a database maintenance operation), the front-end queue can still accept messages and the application can still indicate "Order received" to a user who clicks the "Submit" button. Messages stack up on the queue, and when the database is again available the message backlog is processed.

OK, on to caching for data-retrieval operations. Again, the idea is to use technology and application architecture to have your SOA cake (agility/quality/flexibility) and eat it, too (user-satisfying performance). The concept of data caching is relatively simple: you store a copy of oft-retrieved data in cache servers closer to the edge of an application's infrastructure, so that data retrieval requests do not have to flow all the way to the back-end database (the database of record) to be served. The not-so simple part of data caching is propagating back-end database updates in the cache server(s) in a timely and accurate manner. This can be done, and different organizations take different approaches with respect to data cache implementation. Some go the "roll your own" route, using in-house-developed code to keep mid-tier data stores in near-sync status relative to the back end database of record; other organizations utilize data caching solutions available from various vendors (examples include TigerLogic FastSOA from Raining Data and DataXtend CE from Progress Software). Whether implementation has been a matter of build or buy, organizations have seen dramatic improvements in data retrieval performance through the use of data caching. This speed benefit is particularly advantageous in an SOA environment, as it can more than offset the drag on transaction response times caused by the distinct functional layering, abstraction, and standards-based interfaces to application subsystems and to associated data that are hallmarks of an SOA.

So, embrace SOA for all the right reasons, and take advantage of message queuing and data caching technologies to give users what they want: speed.

Happy architecting!