### DB2 9 for z/OS - Good News on the Stats Front

About 30 some-odd years ago, a team of IBM researchers (working on what would become DB2) invented cost-based SQL statement optimization (thank you, Patricia Selinger et al). This development ended up being a very big part of DB2's appeal in the marketplace, because it meant that the DBMS, given several alternative routes by which data qualified by a query could be retrieved, was able to select the access path that would cost the least in terms of system resources.

The part of DB2 code that performs access path selection is called the optimizer, and it bases its decisions largely on the database statistics that are generated through the RUNSTATS utility and stored in the DB2 catalog. In the early years, the optimizer was handicapped in that it did not have information regarding the distribution of values within a column of a table, and so had to assume an even distribution (e.g., if a table had 1 million rows and an associated column had 1000 values, DB2 would assume 1000 duplicates of each of the 1000 distinct values). This assumption could lead to sub-optimal access path selection decisions because in many cases a non-unique column has a very non-uniform distribution of data values (consider the previous example: 900,000 of the 1 million rows could contain one value of the column, with the other 999 distinct values spread across the remaining 100,000 rows). This "blind spot" regarding column data value distribution could lead to the calculation of inaccurate "filter factors" for query predicates, and that could lead to the selection of something other than the actual least-cost access path for a query.

As I recall, it was in the mid-1990s that DB2 and RUNSTATS (for the mainframe and for the Linux, UNIX, and Windows - aka LUW - platforms) were enhanced to provide the optimizer with information about the distribution of data values within a column. For both mainframe DB2 and DB2 for LUW, one could have RUNSTATS generate statistics pertaining to the percentage of rows in a column containing each of the "n" most frequently occurring of the column's values. Subsequent enhancements for the mainframe DB2 platform enabled one to also generate, through RUNSTATS, values for the percentage of a table's rows containing each of the "n" least-most frequently occurring values in a column, or stats for both the "n" most- and "n" least-frequently occurring of a column's values.

These frequency statistics were a great help to the optimizer, enabling DB2 to select better access paths for many queries referencing columns with non-uniform value distributions. [A number of mainframe DB2-using organizations are realizing the benefit of frequency-distribution statistics without even knowing it. When the DB2 for z/OS RUNSTATS utility is executed with the INDEX keyword on the control card, frequency-distribution statistics will automatically be generated for the top 10 most frequently-occurring values of the first column of the key of the targeted index. It's not uncommon for people to code a RUNSTATS TABLESPACE control card that specifies TABLE(ALL) INDEX(ALL), and in that case the utility will generate frequency distribution stats for the top 10 most frequently-occurring values of the first column of the key of every index on every table in the target tablespace.]

Still, there was room for improvement. Having statistics on the frequency of occurrence of the top 10, or 15, or 20 values in a column is great, but if a column has 100,000 values that can still lead, in some cases, to less-than-desired accuracy with respect to predicate filter-factor calculations and, by extension, sub-optimal access path selection. Collecting frequency-distribution statistics for the top, say, 10,000 of those 100,000 distinct values would effectively address that problem, but at a too-high cost in terms of RUNSTATS CPU consumption. A better solution was delivered in DB2 for z/OS V9 via the new HISTOGRAM option of RUNSTATS (DB2 for LUW users got histogram stats, generated via the WITH DISTRIBUTION option of the RUNSTATS command, some years ago).

When the RUNSTATS utility is executed with the HISTOGRAM option, DB2 will collect value frequency distribution statistics for a column (or group of columns) based on a certain number of quantiles (either user-specified or determined by DB2). Each quantile is associated with a range of column values, and each range includes about the same number of table rows. For a given quantile, DB2 knows the upper and lower values that bound the quantile, the number of distinct values within that range, and the number of rows in the value range. [This data is stored in catalog tables such as SYSCOLDIST and SYSCOLDISTSTATS (the latter used for partition-level statistics), and in new tables like SYSKEYTGTDIST and SYSKEYTGTDISTSTATS, which are used for so-called extended indexes (new with DB2 for z/OS V9, and something I'll probably cover in a future post to this blog).] Suppose, for example, that a table has 100,000 rows, and that histogram data for a certain column is gathered for 10 quantiles. Each quantile will represent about 10,000 rows. Let's say that quantile 1 is bounded by values 'AAA' and 'CCC' (assume CHAR for the data type) and contains 100 distinct values of the column. Quantile 2 is bounded by values 'DDD' and 'HHH' and contains 10 distinct values, and so on. In that case, DB2 would know that in the first (based on column value range) 10,000 rows of the table there are 100 distinct values of the column between 'AAA' and 'CCC' (an average of 100 duplicates per value), while in the second set of 10,000 rows there are 10 distinct values between 'DDD' and 'HHH' (an average of 1000 duplicates per value).

A big appeal of histogram statistics is that they are based on ALL values of a column (or a column group), whereas frequency statistics are based on a subset of values (usually a relatively small subset - this by practical necessity). This translates into richer catalog statistics, and with richer (i.e., more complete in terms of descriptiveness) stats to work with, DB2 is better able to optimize access path selection for queries.

So, what would I do in a DB2 for z/OS V9 environment? I'd probably start collecting histogram stats via RUNSTATS. If you currently code TABLE(ALL) INDEX(ALL) on your RUNSTATS TABLESPACE control card, you could just add HISTOGRAM to that. This would cause RUNSTATS to gather histogram statistics on the first column of the key of each index on every table in the tablespace, with DB2 determining the optimal number of quantiles on a table-by-table basis. Later, you could extend the gathering of histogram stats to other columns (or groups of columns) as you work to tune the performance of individual queries. Generating histogram statistics is likely to cause RUNSTATS to consume somewhat more CPU time than would be the case if you were requesting "top 10 values" frequency distribution stats, but that cost could be more than offset by the savings resulting from improvements in query access path selection.

The part of DB2 code that performs access path selection is called the optimizer, and it bases its decisions largely on the database statistics that are generated through the RUNSTATS utility and stored in the DB2 catalog. In the early years, the optimizer was handicapped in that it did not have information regarding the distribution of values within a column of a table, and so had to assume an even distribution (e.g., if a table had 1 million rows and an associated column had 1000 values, DB2 would assume 1000 duplicates of each of the 1000 distinct values). This assumption could lead to sub-optimal access path selection decisions because in many cases a non-unique column has a very non-uniform distribution of data values (consider the previous example: 900,000 of the 1 million rows could contain one value of the column, with the other 999 distinct values spread across the remaining 100,000 rows). This "blind spot" regarding column data value distribution could lead to the calculation of inaccurate "filter factors" for query predicates, and that could lead to the selection of something other than the actual least-cost access path for a query.

As I recall, it was in the mid-1990s that DB2 and RUNSTATS (for the mainframe and for the Linux, UNIX, and Windows - aka LUW - platforms) were enhanced to provide the optimizer with information about the distribution of data values within a column. For both mainframe DB2 and DB2 for LUW, one could have RUNSTATS generate statistics pertaining to the percentage of rows in a column containing each of the "n" most frequently occurring of the column's values. Subsequent enhancements for the mainframe DB2 platform enabled one to also generate, through RUNSTATS, values for the percentage of a table's rows containing each of the "n" least-most frequently occurring values in a column, or stats for both the "n" most- and "n" least-frequently occurring of a column's values.

These frequency statistics were a great help to the optimizer, enabling DB2 to select better access paths for many queries referencing columns with non-uniform value distributions. [A number of mainframe DB2-using organizations are realizing the benefit of frequency-distribution statistics without even knowing it. When the DB2 for z/OS RUNSTATS utility is executed with the INDEX keyword on the control card, frequency-distribution statistics will automatically be generated for the top 10 most frequently-occurring values of the first column of the key of the targeted index. It's not uncommon for people to code a RUNSTATS TABLESPACE control card that specifies TABLE(ALL) INDEX(ALL), and in that case the utility will generate frequency distribution stats for the top 10 most frequently-occurring values of the first column of the key of every index on every table in the target tablespace.]

Still, there was room for improvement. Having statistics on the frequency of occurrence of the top 10, or 15, or 20 values in a column is great, but if a column has 100,000 values that can still lead, in some cases, to less-than-desired accuracy with respect to predicate filter-factor calculations and, by extension, sub-optimal access path selection. Collecting frequency-distribution statistics for the top, say, 10,000 of those 100,000 distinct values would effectively address that problem, but at a too-high cost in terms of RUNSTATS CPU consumption. A better solution was delivered in DB2 for z/OS V9 via the new HISTOGRAM option of RUNSTATS (DB2 for LUW users got histogram stats, generated via the WITH DISTRIBUTION option of the RUNSTATS command, some years ago).

When the RUNSTATS utility is executed with the HISTOGRAM option, DB2 will collect value frequency distribution statistics for a column (or group of columns) based on a certain number of quantiles (either user-specified or determined by DB2). Each quantile is associated with a range of column values, and each range includes about the same number of table rows. For a given quantile, DB2 knows the upper and lower values that bound the quantile, the number of distinct values within that range, and the number of rows in the value range. [This data is stored in catalog tables such as SYSCOLDIST and SYSCOLDISTSTATS (the latter used for partition-level statistics), and in new tables like SYSKEYTGTDIST and SYSKEYTGTDISTSTATS, which are used for so-called extended indexes (new with DB2 for z/OS V9, and something I'll probably cover in a future post to this blog).] Suppose, for example, that a table has 100,000 rows, and that histogram data for a certain column is gathered for 10 quantiles. Each quantile will represent about 10,000 rows. Let's say that quantile 1 is bounded by values 'AAA' and 'CCC' (assume CHAR for the data type) and contains 100 distinct values of the column. Quantile 2 is bounded by values 'DDD' and 'HHH' and contains 10 distinct values, and so on. In that case, DB2 would know that in the first (based on column value range) 10,000 rows of the table there are 100 distinct values of the column between 'AAA' and 'CCC' (an average of 100 duplicates per value), while in the second set of 10,000 rows there are 10 distinct values between 'DDD' and 'HHH' (an average of 1000 duplicates per value).

A big appeal of histogram statistics is that they are based on ALL values of a column (or a column group), whereas frequency statistics are based on a subset of values (usually a relatively small subset - this by practical necessity). This translates into richer catalog statistics, and with richer (i.e., more complete in terms of descriptiveness) stats to work with, DB2 is better able to optimize access path selection for queries.

So, what would I do in a DB2 for z/OS V9 environment? I'd probably start collecting histogram stats via RUNSTATS. If you currently code TABLE(ALL) INDEX(ALL) on your RUNSTATS TABLESPACE control card, you could just add HISTOGRAM to that. This would cause RUNSTATS to gather histogram statistics on the first column of the key of each index on every table in the tablespace, with DB2 determining the optimal number of quantiles on a table-by-table basis. Later, you could extend the gathering of histogram stats to other columns (or groups of columns) as you work to tune the performance of individual queries. Generating histogram statistics is likely to cause RUNSTATS to consume somewhat more CPU time than would be the case if you were requesting "top 10 values" frequency distribution stats, but that cost could be more than offset by the savings resulting from improvements in query access path selection.

## 2 Comments:

HI Robert,

In your 2010 blog you explain to use KEYCARD to get the top 10 frequency values of a column

TABLE(ALL) INDEX(ALL) KEYCARD

Note that this will get you, by default, "top 10" FREQVAL stats for the first key column of each index on every table in the tablespace.

but in this 2009 blog you omit it saying it is the INDEX keyword that does it:

When the DB2 for z/OS RUNSTATS utility is executed with the INDEX keyword on the control card, frequency-distribution statistics will automatically be generated for the top 10 most frequently-occurring values of the first column of the key of the targeted index.

First, starting with DB2 10 for z/OS, KEYCARD will be in effect whether you specify it or not in a RUNSTATS utility control statement (or for inline statistics processing as part of, for example, a REORG job). In other words, the KEYCARD "switch" is now always "on."

Second, your statement is incorrect. KEYCARD and FREQVAL accomplish two different things. KEYCARD will cause the utility to gather statistics on the cardinality of intermediate combinations of a multi-column key's columns, as well determining the cardinality of the full key (so, if the index is on columns A, B, and C, KEYCARD will result in DB2 determining the cardinality of column A, of the combination of columns A and B, and of the full key (A, B, and C). FREQVAL, by contrast, will cause DB2 to determine (by default) the top 10 most frequently-occurring values of (by default) the first column of an index's key, along with the percentage of the table's rows that contain each of these top 10 values of the column.

In short, KEYCARD is about column (and column-combination) cardinality. FREQVAL is about the frequency of occurrence of the most commonly occurring (by default) values of a column (or combination of columns). Cardinality and frequency of value occurrence are not the same thing.

Robert

## Post a Comment

Subscribe to Post Comments [Atom]

<< Home