Robert's Blog


Monday, July 12, 2010

EXPLAINing DB2 for z/OS: Don't Overlook MATCHCOLS

Do you use EXPLAIN output when you're analyzing the performance of a query in a DB2 for z/OS environment? I hope so. If you do, you might be one of those who likes to retrieve access plan information by querying the PLAN_TABLE into which DB2 inserts rows when a SQL statement is EXPLAINed (either dynamically via the EXPLAIN statement or, for embedded static SQL, as part of the program bind process when EXPLAIN(YES) is included in the BIND command); or, you might prefer to view EXPLAIN data in the form of an access plan graph generated by a tool such as IBM's Optimization Service Center for DB2 for z/OS (free), Data Studio (also free), or Optimization Expert (have to pay for that one, but you get additional functionality). In any case, there are probably a few things that you really focus in on. These might include tablespace scans (indicated by an "R" in the ACCESSYPE column of PLAN_TABLE -- something that you generally don't want to see), index-only access (that would be a "Y" in the INDEXONLY column of PLAN_TABLE -- often desirable, but you can go overboard in your pursuit of it), and sort activity (a "Y" in any of the SORTC or SORTN columns of PLAN_TABLE -- you might try to get DB2 to use an index in lieu of doing a sort).

What about MATCHCOLS, the column of PLAN_TABLE that indicates the number of predicates that match on columns of the key of an index used to access data for a query (in a visual EXPLAIN access plan graph, this value would be referred to as "matching columns" for an index scan)? Do you scope that out? If so, do you just look for a non-zero value and move on? Here's a tip: pay attention to MATCHCOLS. Making that value larger is one of the very best things that you can do to reduce the run time and CPU cost of a SELECT statement. The performance difference between MATCHCOLS = 1 and MATCHCOLS = 2 or 3 can be huge. The reason? It's all about cardinality. You could have an index key based on three of a table's columns, and if the table has 10 million rows and the cardinality (i.e., the number of distinct values) of the first key column is 5, MATCHCOLS = 1 is an indication that you're going to get very little in the way of result set filtering at the index level (expect lots of GETPAGEs, a relatively high CPU cost, and extended run time). If the cardinality of the first two columns of the index key is 500,000, MATCHCOLS = 2 should mean much quicker execution, and if the full-key cardinality of the index is 5 million, a SELECT statement with MATCHCOLS = 3 should really fly. A larger MATCHCOLS value generally means more index-level filtering, and that is typically very good for performance.

Step one in boosting the value of MATCHCOLS for a query is to determine how high the value could be. Obviously, if a SELECT statement has only one predicate then any effort to make MATCHCOLS greater than 1 will be a waste of time. So, the upper bound on the value of MATCHCOLS in a given situation is the number of predicates present in a query (or, more precisely, in a subselect, if the statement contains multiple SELECTs -- think UNIONs, nested table expressions, subquery predicates, etc.); but, that's an oversimplification, because not all predicates can match on a column of an index key -- COL <> 2 is an example of a non-indexable predicate (my favorite source of information on predicate indexability is the "Summary of predicate processing" section of the DB2 9 for z/OS Performance Monitoring and Tuning Guide). Even that's not the whole story, however, as the position of columns in an index key -- and not just the number of indexable predicates in a query -- has an impact on the MATCHCOLS value.

This last point bears some explaining. Suppose you have a query with these predicates:

COL_X = 'ABC'
COL_Y = 5
COL_Z > 10

All three of the predicates are indexable. If the three columns referenced in the predicates are part of an index key, will the MATCHCOLS value for the query be 3? Maybe, maybe not. If the index key is COL_X | COL_Y | COL_Z, MATCHCOLS will indeed be 3 (assuming the index provides the low-cost path to the data retrieved by the query). If, on the other hand, the index key is COL_Z | COL_Y | COL_X, MATCHCOLS will be 1. Why? Because once a key column is matched for a range predicate, other columns that follow in the index key won't be matches for other predicates in the query (the basic rule, which I'll amend momentarily, is that index column matching, which proceeds from the highest- to the lowest-order column of an index key, stops following the first match for a predicate that is not of the "equals" variety). If the index is on COL_X | COL_Y | COL_H | COL_Z, MATCHCOLS will be 2, because index key column matching stops if you have to "skip over" a non-predicate-referenced column (in this case, COL_H). So, one of the things that you can do to get a higher versus a lower MATCHCOLS value for a query is to arrange columns in an index key so as to maximize matching (that's fine if you're defining a new index on a table, but be careful about replacing an index on a set of columns in one order with an index on the columns in a different order -- you could improve the performance of one query while causing others to run more slowly).

Now, about the aforementioned amendment to the basic rule that states that index key column predicate matching stops after the first match for something other than an "equals" predicate: there is at least one special case of a predicate that is not strictly "equals" in nature but which is treated that way from an index key column matching perspective. I'm talking about an "in-list" predicate, such as COL_D IN ('A01', 'B02', 'C03'). Thus, if a query contained that predicate and two others, COL_E = 'Y' AND COL_F > 8 (assuming that the three columns belong to one table), and if the target table had an index defined on COL_D | COL_E | COL_F, the expected MATCHCOLS value for the query would be 3.

Something else about indexes and MATCHCOLS: I mentioned earlier that cardinality is a big part of the story, but for cardinality to mean anything to DB2, DB2 has to know about it. DB2 gets index key cardinality information from the catalog, so it pays to ensure that catalog statistics are accurate through regular execution of the RUNSTATS utility; but, there's more: to provide DB2 with really complete information on index key cardinality, specify the KEYCARD option on the RUNSTATS utility control statement. When the utility is run sans KEYCARD, cardinality statistics will be generated for the first column of an index key and for the full index key, but not for anything (if anything) in-between. In other words, with an index on COL_A | COL_B | COL_C | COL_D, an execution of RUNSTATS for the index without the KEYCARD option will cause the catalog to be updated with cardinality statistics for COL_A (the first of the key columns) and for the entire four-column key, but not for COL_A | COL_B and not for COL_A | COL_B | COL_C. With KEYCARD specified, the utility will get cardinality statistics for these intermediate partial-key combinations.

A final note on this topic: it can make sense to include a predicate-referenced column in an index key, even if the predicate in question is non-indexable (meaning that the presence of the column in the key cannot help to make the MATCHCOLS number larger). This is so because the column could still be used by DB2 for something called index screening. So, if you have a non-indexable predicate such as one of the COL_J <> 'M' form, it might be a good idea to include COL_J in an index key after some predicate-matching columns, especially if COL_J is relatively short and the underlying table is large. DB2 can use the COL_J information in the index to filter results, just not as efficiently as it could if COL_J were referenced by an indexable predicate.

Bottom line: MATCHCOLS matters. Tune on!