The Most Important DB2 for z/OS Performance Metric
I don't mean to imply that effective DB2 performance monitoring and tuning comes down to tracking one number - there are about a dozen or so DB2 monitor fields on which I tend to focus in my work. I do feel that one metric stands alone as being the most useful and most illuminating with respect to the analysis of a DB2 subsystem's performance: GETPAGEs.
First, some review:
The GETPAGE numbers that are of the most interest to me are found in the buffer pool section of a DB2 accounting report (preferably a report called "accounting detail" or perhaps "accounting report - long," depending on the vendor of the monitor product). I'm often working to analyze the performance of a particular DB2 application or some component thereof, and accounting reports show me activity pertaining to an application (statistics reports also contain GETPAGE information in the buffer pool section, but that data reflects activity for the DB2 subsystem as a whole).
GETPAGE numbers are great because they inform you as to CPU efficiency of an application's DB2-related processing. The more GETPAGE requests DB2 has to issue to perform a given piece of work, the more CPU time that piece of work is going to consume. I suggest getting baseline GETPAGE numbers for an application (or a piece of an application) that is of interest to you, performance-wise. This is not hard to do, as a DB2 monitor will typically enable you to group accounting information at various levels of granularity, including plan name (some installations have a plan per application), connection name (i.e., the name of a CICS region or a WLM application environment used for DB2 stored procedures), correlation ID (which could be a CICS transaction ID or the name of a batch job), and package name (DB2 for z/OS Version 8 introduced significant enhancements with respect to the level of detail provided with package-level accounting data).
Over time, track the GETPAGE numbers for the application at regular intervals (e.g., weekly). What trends are evident? Is the number of GETPAGEs going up? If so, that might not be a problem - it could be that the increase in GETPAGE activity is the result of an increase in the application workload (to check this, see if the number of GETPAGEs per COMMIT increased, or if the ratio of GETPAGEs to SQL statements executed went up - if not, the GETPAGE increase is probably due to workload growth). If an observed increase in GETPAGE activity can't be explained in terms of increased application activity, what else could be happening? Here are a few possibilities (there are others):
First, some review:
- A GETPAGE is a request by DB2 to examine the contents of a database page. A GETPAGE is NOT equivalent to an I/O request. If DB2 issues a request to look at a particular page, and that page is already in one of the subsystem's buffer pools, the GETPAGE counter is incremented by one and the I/O counter is not incremented.
- If DB2 issues a request to look at ANY kind of database page - a data page, an index page (leaf or non-leaf), a space map page - that counts as a GETPAGE.
The GETPAGE numbers that are of the most interest to me are found in the buffer pool section of a DB2 accounting report (preferably a report called "accounting detail" or perhaps "accounting report - long," depending on the vendor of the monitor product). I'm often working to analyze the performance of a particular DB2 application or some component thereof, and accounting reports show me activity pertaining to an application (statistics reports also contain GETPAGE information in the buffer pool section, but that data reflects activity for the DB2 subsystem as a whole).
GETPAGE numbers are great because they inform you as to CPU efficiency of an application's DB2-related processing. The more GETPAGE requests DB2 has to issue to perform a given piece of work, the more CPU time that piece of work is going to consume. I suggest getting baseline GETPAGE numbers for an application (or a piece of an application) that is of interest to you, performance-wise. This is not hard to do, as a DB2 monitor will typically enable you to group accounting information at various levels of granularity, including plan name (some installations have a plan per application), connection name (i.e., the name of a CICS region or a WLM application environment used for DB2 stored procedures), correlation ID (which could be a CICS transaction ID or the name of a batch job), and package name (DB2 for z/OS Version 8 introduced significant enhancements with respect to the level of detail provided with package-level accounting data).
Over time, track the GETPAGE numbers for the application at regular intervals (e.g., weekly). What trends are evident? Is the number of GETPAGEs going up? If so, that might not be a problem - it could be that the increase in GETPAGE activity is the result of an increase in the application workload (to check this, see if the number of GETPAGEs per COMMIT increased, or if the ratio of GETPAGEs to SQL statements executed went up - if not, the GETPAGE increase is probably due to workload growth). If an observed increase in GETPAGE activity can't be explained in terms of increased application activity, what else could be happening? Here are a few possibilities (there are others):
- It could be that data pages are increasingly filled with "cold" data rows. By "cold," I mean rows that are very rarely accessed ("hot" rows are frequently accessed). Oftentimes, data rows become increasingly chilly (using this analogy whereby temperature is associated with the frequency with which rows are requested by application programs) as they become older. If data rows are NOT clustered in date sequence, and if they are never deleted from a table (you'd be surprised at how often this is the case out there in database land), over time the average data page will contain fewer and fewer "hot" rows, and more and more "cold" rows. Because programs are typically requesting sets of "hot" rows, and because each page is holding fewer "hot" rows as time goes by, it becomes more and more likely that the set of rows requested by an application program is now spread over, say, ten or twelve pages instead of two or three. Voila: GETPAGEs go up, and so does the application's consumption of CPU on the data-serving mainframe. A possible near-term solution would involve the creation of a history table to which "cold" rows can be moved. Implementation of a comprehensive information lifecycle management (ILM) solution would be a more strategically attractive response.
- It could be that DB2 data access paths have changed for one or more programs. Perhaps a program that once retrieved data from a table by way of an index now gets that data via a tablespace scan. Maybe data access that formerly utilized an index that provided optimal retrieval efficiency (i.e., minimized GETPAGEs) now uses a different index that forces more page accesses to return the same data rows. In any case, the likely culprit is incomplete and/or incorrect statistics in the DB2 catalog (coupled with a rebind of the program - most likely an auto-rebind - if the SQL is static). Yes, it's possible that a good (fewer GETPAGEs) access path was changed to a bad one (more GETPAGEs) because of a bug in the DB2 optimizer, but that is rarely the case. Much more often, the optimizer is working just fine, but it has misleading information about things such as the number of rows in a table, or the cardinality of columns in the table, or the correlation between values of different columns in a table, or any of around 90 other statistical values stored in the DB2 catalog and used as input to the access path selection process. It's important to run the RUNSTATs utility regularly to keep that catalog stats up-to-date.
- DB2 might be having to look at lots of data pages every time a row is inserted into a table. This can happen when a tablespace is partitioned and the length of rows in the tables is variable (and this does not require the presence of VARCHAR columns, as even rows you think of as being fixed-length are in fact variable-length rows if the tablespace is compressed). The looking at lots of pages (by DB2) when inserting a row is most likely to occur when data in the partitioned tablespace is not clustered by a continuously-ascending key. The reason for all the GETPAGEs in this case has to do with the relatively imprecise information in the space map pages of partitioned tablespaces with respect to the size of "holes" in data pages (these being chunks of free space that might accommodate a new row). This imprecise data means that DB2 has to actually examine a data page before it can be certain that a new row will fit into that page, and it's possible that quite a few pages will have to be examined before one with a big enough chunk of free space is found. People have dealt with this issue in various ways over the years (one response is to minimize free space in a tablespace and just force all inserts to the end of the table, sacrificing good clusteredness for improved insert performance), but there's really good news now: DB2 for z/OS Version 9 offers a new type of tablespace - called a universal tablespace - that is both partitioned and segmented. Using a universal tablespace, you get the data management advantages of partitioned tablespaces (particularly attractive for very large tables) with the insert efficiency (among other things) provided by segmented tablespaces (thanks to the much more precise information about data page "holes" maintained in the space map pages
of segmented tablespaces). So, there's one more great reason to get to DB2 for z/OS V9 sooner rather than later.
4 Comments:
Thanks. Great summary.
In my opinion there is one more important reason of GETPAGE increase – disorganization of data.
However disorganization is an issue only for SQL workloads with access paths with prefetching.
You have succinctly brought up two important and interrelated points: 1) disorganization of data can drive up GETPAGE activity (and, therefore, CPU consumption) for queries, and 2) in some cases data organization is not so important.
With regard to point 1, I vividly recall a situation in which DBAs at a DB2 for z/OS-using organization watched the in-DB2 CPU cost of a frequently-executed transaction increase steadily from month to month. The DBAs didn't know what to make of this, as access paths for SQL statements issued via the transaction hadn't changed, nor had the size of result sets increased. It turns out that increasing GETPAGE activity was the cause of the CPU increase, and GETPAGE activity was going up because a) the table's clustering key was not continuously-ascending, so newer rows were spread all over the table (that's often OK), and b) they never deleted older rows from the table (that's often NOT OK). As a result, over time a greater and greater percentage of rows on each page, on average, were "old," and (in the case of this transaction) rarely retrieved. The newer rows (which were more frequently retrieved than the older rows) were further and further apart from each other in the table, separated by a growing number of "old" rows. Thus the rise in GETPAGE activity to retrieve multi-row result sets of newer rows. The performance-improving solution here would be to remove older rows from the table periodically, either eliminating them entirely (into "the bit bucket") or, if needed, archiving them in a low-cost repository (in or outside of the DB2 database). Archive/purge processes are often thought of in terms of regulatory compliance and legal safeguards (i.e., don't keep records longer than you have to), but they can definitely impact application performance in a positive way by getting "old and cold" data out of high-activity tables.
As for point 2, organizations often reflexively run REORG for table spaces that have become somewhat disorganized, without considering whether or not that actually buys anything back, performance-wise. When data in a table is accessed exclusively by queries that have single-row result sets and "equals" predicates that match unique index keys, who cares about the clusteredness of the data in the table? Why burn CPU in running REORGs that don't improve performance of the single-row queries that dominate access to the table's data? Organizations that want to get smart about running utilities such as REORG are using information provided by DB2 real-time statistics, such as the REORGCLUSTERSENS column of SYSIBM.SYSTABLESPACESTATS. If a table is not being accessed by queries that are sensitive to clustering sequence, data organization is generally not a big deal.
In my shop number of getpage increased derastically for one of the program after HA implementation. It used to be under 50 earlier but it reached in million for the same QUERYNO.I checked Access path and it is same as it was before HA.One thing which i noticed with respect to underlying Buffer pool is that VP SEQUENTIAL is set to 30 and due to that nuber of VPSEQUENTIAL hit are too high.Number of VPSEQT Hit per Prefetch I/O (Sequen.,Random and List) is around 1.6 that mean for every 1.6 prefetch request we are reaching VP SEQT
Sorry about the delayed response.
I don't know what you are referring to by an "HA implementation."
One possible cause of a GETPAGE spike is hitting the data manager threshold for a buffer pool. You might want to verify that the data manager threshold (DMTH) is not being hit for any of your buffer pools. If DMTH is hit for a buffer pool, it is usually because the pool is way too small, the pool's deferred write thresholds are way too high, or both.
Robert
Post a Comment
Subscribe to Post Comments [Atom]
<< Home