The Most Important DB2 for z/OS Performance Metric
I don't mean to imply that effective DB2 performance monitoring and tuning comes down to tracking one number - there are about a dozen or so DB2 monitor fields on which I tend to focus in my work. I do feel that one metric stands alone as being the most useful and most illuminating with respect to the analysis of a DB2 subsystem's performance: GETPAGEs.
First, some review:
The GETPAGE numbers that are of the most interest to me are found in the buffer pool section of a DB2 accounting report (preferably a report called "accounting detail" or perhaps "accounting report - long," depending on the vendor of the monitor product). I'm often working to analyze the performance of a particular DB2 application or some component thereof, and accounting reports show me activity pertaining to an application (statistics reports also contain GETPAGE information in the buffer pool section, but that data reflects activity for the DB2 subsystem as a whole).
GETPAGE numbers are great because they inform you as to CPU efficiency of an application's DB2-related processing. The more GETPAGE requests DB2 has to issue to perform a given piece of work, the more CPU time that piece of work is going to consume. I suggest getting baseline GETPAGE numbers for an application (or a piece of an application) that is of interest to you, performance-wise. This is not hard to do, as a DB2 monitor will typically enable you to group accounting information at various levels of granularity, including plan name (some installations have a plan per application), connection name (i.e., the name of a CICS region or a WLM application environment used for DB2 stored procedures), correlation ID (which could be a CICS transaction ID or the name of a batch job), and package name (DB2 for z/OS Version 8 introduced significant enhancements with respect to the level of detail provided with package-level accounting data).
Over time, track the GETPAGE numbers for the application at regular intervals (e.g., weekly). What trends are evident? Is the number of GETPAGEs going up? If so, that might not be a problem - it could be that the increase in GETPAGE activity is the result of an increase in the application workload (to check this, see if the number of GETPAGEs per COMMIT increased, or if the ratio of GETPAGEs to SQL statements executed went up - if not, the GETPAGE increase is probably due to workload growth). If an observed increase in GETPAGE activity can't be explained in terms of increased application activity, what else could be happening? Here are a few possibilities (there are others):
First, some review:
- A GETPAGE is a request by DB2 to examine the contents of a database page. A GETPAGE is NOT equivalent to an I/O request. If DB2 issues a request to look at a particular page, and that page is already in one of the subsystem's buffer pools, the GETPAGE counter is incremented by one and the I/O counter is not incremented.
- If DB2 issues a request to look at ANY kind of database page - a data page, an index page (leaf or non-leaf), a space map page - that counts as a GETPAGE.
The GETPAGE numbers that are of the most interest to me are found in the buffer pool section of a DB2 accounting report (preferably a report called "accounting detail" or perhaps "accounting report - long," depending on the vendor of the monitor product). I'm often working to analyze the performance of a particular DB2 application or some component thereof, and accounting reports show me activity pertaining to an application (statistics reports also contain GETPAGE information in the buffer pool section, but that data reflects activity for the DB2 subsystem as a whole).
GETPAGE numbers are great because they inform you as to CPU efficiency of an application's DB2-related processing. The more GETPAGE requests DB2 has to issue to perform a given piece of work, the more CPU time that piece of work is going to consume. I suggest getting baseline GETPAGE numbers for an application (or a piece of an application) that is of interest to you, performance-wise. This is not hard to do, as a DB2 monitor will typically enable you to group accounting information at various levels of granularity, including plan name (some installations have a plan per application), connection name (i.e., the name of a CICS region or a WLM application environment used for DB2 stored procedures), correlation ID (which could be a CICS transaction ID or the name of a batch job), and package name (DB2 for z/OS Version 8 introduced significant enhancements with respect to the level of detail provided with package-level accounting data).
Over time, track the GETPAGE numbers for the application at regular intervals (e.g., weekly). What trends are evident? Is the number of GETPAGEs going up? If so, that might not be a problem - it could be that the increase in GETPAGE activity is the result of an increase in the application workload (to check this, see if the number of GETPAGEs per COMMIT increased, or if the ratio of GETPAGEs to SQL statements executed went up - if not, the GETPAGE increase is probably due to workload growth). If an observed increase in GETPAGE activity can't be explained in terms of increased application activity, what else could be happening? Here are a few possibilities (there are others):
- It could be that data pages are increasingly filled with "cold" data rows. By "cold," I mean rows that are very rarely accessed ("hot" rows are frequently accessed). Oftentimes, data rows become increasingly chilly (using this analogy whereby temperature is associated with the frequency with which rows are requested by application programs) as they become older. If data rows are NOT clustered in date sequence, and if they are never deleted from a table (you'd be surprised at how often this is the case out there in database land), over time the average data page will contain fewer and fewer "hot" rows, and more and more "cold" rows. Because programs are typically requesting sets of "hot" rows, and because each page is holding fewer "hot" rows as time goes by, it becomes more and more likely that the set of rows requested by an application program is now spread over, say, ten or twelve pages instead of two or three. Voila: GETPAGEs go up, and so does the application's consumption of CPU on the data-serving mainframe. A possible near-term solution would involve the creation of a history table to which "cold" rows can be moved. Implementation of a comprehensive information lifecycle management (ILM) solution would be a more strategically attractive response.
- It could be that DB2 data access paths have changed for one or more programs. Perhaps a program that once retrieved data from a table by way of an index now gets that data via a tablespace scan. Maybe data access that formerly utilized an index that provided optimal retrieval efficiency (i.e., minimized GETPAGEs) now uses a different index that forces more page accesses to return the same data rows. In any case, the likely culprit is incomplete and/or incorrect statistics in the DB2 catalog (coupled with a rebind of the program - most likely an auto-rebind - if the SQL is static). Yes, it's possible that a good (fewer GETPAGEs) access path was changed to a bad one (more GETPAGEs) because of a bug in the DB2 optimizer, but that is rarely the case. Much more often, the optimizer is working just fine, but it has misleading information about things such as the number of rows in a table, or the cardinality of columns in the table, or the correlation between values of different columns in a table, or any of around 90 other statistical values stored in the DB2 catalog and used as input to the access path selection process. It's important to run the RUNSTATs utility regularly to keep that catalog stats up-to-date.
- DB2 might be having to look at lots of data pages every time a row is inserted into a table. This can happen when a tablespace is partitioned and the length of rows in the tables is variable (and this does not require the presence of VARCHAR columns, as even rows you think of as being fixed-length are in fact variable-length rows if the tablespace is compressed). The looking at lots of pages (by DB2) when inserting a row is most likely to occur when data in the partitioned tablespace is not clustered by a continuously-ascending key. The reason for all the GETPAGEs in this case has to do with the relatively imprecise information in the space map pages of partitioned tablespaces with respect to the size of "holes" in data pages (these being chunks of free space that might accommodate a new row). This imprecise data means that DB2 has to actually examine a data page before it can be certain that a new row will fit into that page, and it's possible that quite a few pages will have to be examined before one with a big enough chunk of free space is found. People have dealt with this issue in various ways over the years (one response is to minimize free space in a tablespace and just force all inserts to the end of the table, sacrificing good clusteredness for improved insert performance), but there's really good news now: DB2 for z/OS Version 9 offers a new type of tablespace - called a universal tablespace - that is both partitioned and segmented. Using a universal tablespace, you get the data management advantages of partitioned tablespaces (particularly attractive for very large tables) with the insert efficiency (among other things) provided by segmented tablespaces (thanks to the much more precise information about data page "holes" maintained in the space map pages
of segmented tablespaces). So, there's one more great reason to get to DB2 for z/OS V9 sooner rather than later.