Catterall Consulting

Robert's Blog

Thursday, August 5, 2010

Back at IBM

It's been longer than usual since my last post to this blog. I've been busier than usual, largely due to a career change. On August 2, I rejoined IBM (I previously worked for the Company from 1982 to 2000). I'm still doing technical DB2 work, and I'll continue to blog about DB2, but new entries will be posted to my new blog, called, simply, "Robert's DB2 blog." It can be viewed at http://www.robertsdb2blog.blogspot.com/.

Though I will no longer be posting to this blog, I'll continue to respond to comments left by readers of entries in this blog.

To those who've been regular readers, I thank you for your time. I hope that you'll visit my new blog.

Monday, July 12, 2010

EXPLAINing DB2 for z/OS: Don't Overlook MATCHCOLS

Do you use EXPLAIN output when you're analyzing the performance of a query in a DB2 for z/OS environment? I hope so. If you do, you might be one of those who likes to retrieve access plan information by querying the PLAN_TABLE into which DB2 inserts rows when a SQL statement is EXPLAINed (either dynamically via the EXPLAIN statement or, for embedded static SQL, as part of the program bind process when EXPLAIN(YES) is included in the BIND command); or, you might prefer to view EXPLAIN data in the form of an access plan graph generated by a tool such as IBM's Optimization Service Center for DB2 for z/OS (free), Data Studio (also free), or Optimization Expert (have to pay for that one, but you get additional functionality). In any case, there are probably a few things that you really focus in on. These might include tablespace scans (indicated by an "R" in the ACCESSYPE column of PLAN_TABLE -- something that you generally don't want to see), index-only access (that would be a "Y" in the INDEXONLY column of PLAN_TABLE -- often desirable, but you can go overboard in your pursuit of it), and sort activity (a "Y" in any of the SORTC or SORTN columns of PLAN_TABLE -- you might try to get DB2 to use an index in lieu of doing a sort).

What about MATCHCOLS, the column of PLAN_TABLE that indicates the number of predicates that match on columns of the key of an index used to access data for a query (in a visual EXPLAIN access plan graph, this value would be referred to as "matching columns" for an index scan)? Do you scope that out? If so, do you just look for a non-zero value and move on? Here's a tip: pay attention to MATCHCOLS. Making that value larger is one of the very best things that you can do to reduce the run time and CPU cost of a SELECT statement. The performance difference between MATCHCOLS = 1 and MATCHCOLS = 2 or 3 can be huge. The reason? It's all about cardinality. You could have an index key based on three of a table's columns, and if the table has 10 million rows and the cardinality (i.e., the number of distinct values) of the first key column is 5, MATCHCOLS = 1 is an indication that you're going to get very little in the way of result set filtering at the index level (expect lots of GETPAGEs, a relatively high CPU cost, and extended run time). If the cardinality of the first two columns of the index key is 500,000, MATCHCOLS = 2 should mean much quicker execution, and if the full-key cardinality of the index is 5 million, a SELECT statement with MATCHCOLS = 3 should really fly. A larger MATCHCOLS value generally means more index-level filtering, and that is typically very good for performance.

Step one in boosting the value of MATCHCOLS for a query is to determine how high the value could be. Obviously, if a SELECT statement has only one predicate then any effort to make MATCHCOLS greater than 1 will be a waste of time. So, the upper bound on the value of MATCHCOLS in a given situation is the number of predicates present in a query (or, more precisely, in a subselect, if the statement contains multiple SELECTs -- think UNIONs, nested table expressions, subquery predicates, etc.); but, that's an oversimplification, because not all predicates can match on a column of an index key -- COL <> 2 is an example of a non-indexable predicate (my favorite source of information on predicate indexability is the "Summary of predicate processing" section of the DB2 9 for z/OS Performance Monitoring and Tuning Guide). Even that's not the whole story, however, as the position of columns in an index key -- and not just the number of indexable predicates in a query -- has an impact on the MATCHCOLS value.

This last point bears some explaining. Suppose you have a query with these predicates:

COL_X = 'ABC'
COL_Y = 5
COL_Z > 10

All three of the predicates are indexable. If the three columns referenced in the predicates are part of an index key, will the MATCHCOLS value for the query be 3? Maybe, maybe not. If the index key is COL_X | COL_Y | COL_Z, MATCHCOLS will indeed be 3 (assuming the index provides the low-cost path to the data retrieved by the query). If, on the other hand, the index key is COL_Z | COL_Y | COL_X, MATCHCOLS will be 1. Why? Because once a key column is matched for a range predicate, other columns that follow in the index key won't be matches for other predicates in the query (the basic rule, which I'll amend momentarily, is that index column matching, which proceeds from the highest- to the lowest-order column of an index key, stops following the first match for a predicate that is not of the "equals" variety). If the index is on COL_X | COL_Y | COL_H | COL_Z, MATCHCOLS will be 2, because index key column matching stops if you have to "skip over" a non-predicate-referenced column (in this case, COL_H). So, one of the things that you can do to get a higher versus a lower MATCHCOLS value for a query is to arrange columns in an index key so as to maximize matching (that's fine if you're defining a new index on a table, but be careful about replacing an index on a set of columns in one order with an index on the columns in a different order -- you could improve the performance of one query while causing others to run more slowly).

Now, about the aforementioned amendment to the basic rule that states that index key column predicate matching stops after the first match for something other than an "equals" predicate: there is at least one special case of a predicate that is not strictly "equals" in nature but which is treated that way from an index key column matching perspective. I'm talking about an "in-list" predicate, such as COL_D IN ('A01', 'B02', 'C03'). Thus, if a query contained that predicate and two others, COL_E = 'Y' AND COL_F > 8 (assuming that the three columns belong to one table), and if the target table had an index defined on COL_D | COL_E | COL_F, the expected MATCHCOLS value for the query would be 3.

Something else about indexes and MATCHCOLS: I mentioned earlier that cardinality is a big part of the story, but for cardinality to mean anything to DB2, DB2 has to know about it. DB2 gets index key cardinality information from the catalog, so it pays to ensure that catalog statistics are accurate through regular execution of the RUNSTATS utility; but, there's more: to provide DB2 with really complete information on index key cardinality, specify the KEYCARD option on the RUNSTATS utility control statement. When the utility is run sans KEYCARD, cardinality statistics will be generated for the first column of an index key and for the full index key, but not for anything (if anything) in-between. In other words, with an index on COL_A | COL_B | COL_C | COL_D, an execution of RUNSTATS for the index without the KEYCARD option will cause the catalog to be updated with cardinality statistics for COL_A (the first of the key columns) and for the entire four-column key, but not for COL_A | COL_B and not for COL_A | COL_B | COL_C. With KEYCARD specified, the utility will get cardinality statistics for these intermediate partial-key combinations.

A final note on this topic: it can make sense to include a predicate-referenced column in an index key, even if the predicate in question is non-indexable (meaning that the presence of the column in the key cannot help to make the MATCHCOLS number larger). This is so because the column could still be used by DB2 for something called index screening. So, if you have a non-indexable predicate such as one of the COL_J <> 'M' form, it might be a good idea to include COL_J in an index key after some predicate-matching columns, especially if COL_J is relatively short and the underlying table is large. DB2 can use the COL_J information in the index to filter results, just not as efficiently as it could if COL_J were referenced by an indexable predicate.

Bottom line: MATCHCOLS matters. Tune on!

Tuesday, June 29, 2010

Using DB2 for z/OS Real-Time Statistics for Smarter Database Management

Unless you are somehow beyond the reach of advertising, you're probably familiar with IBM's "smarter planet" campaign. It's all about leveraging analytics -- the purposeful analysis of timely, relevant information -- to improve decision-making outcomes. If you administer a mainframe DB2 database, you can work smarter by taking advantage of a DB2 feature that, while familiar to many DBAs in an "I've heard of that" sense, is under-exploited to a surprising degree. I'm talking about real-time statistics (aka RTS). Understand real-time stats -- what they are, where you find them, and how you can use them -- and you're on your way to enhancing the efficiency and effectiveness of your DB2 database administration efforts.

I'm sure that most everyone who reads this blog entry is familiar with the database statistics found in the DB2 catalog. These statistics are updated by way of the RUNSTATS utility, and they can be useful for things like identifying tablespaces and indexes in need of reorganization. Still, from a "work smarter" perspective, they are less than ideal. For one thing, they are only updated when you run the RUNSTATS utility (or when you gather and update statistics as part of a REORG or a LOAD utility operation -- more on that in a moment). How often do you do that? Maybe not too frequently, if you have a whole lot of tablespaces in your database. Suppose you run RUNSTATS, on average, once a month for a given tablespace. Could that tablespace end up getting pretty disorganized in the middle of one of those one-month periods between RUNSTATS jobs? Yes, and in that case you wouldn't be aware of the disorganization situation for a couple of weeks after the fact -- not so good.

As for updating catalog stats via REORG and/or LOAD, that's all well and good, but consider this: when you do that, the stats gathered will reflect perfectly organized objects (assuming, for LOAD, that the rows in the input file are in clustering-key sequence). They won't show you how the organization of a tablespace and its indexes may be deteriorating over time.

Then there's the matter of dynamic cache invalidation. ANY time you run the RUNSTATS utility -- no matter what options are specified -- you invalidate SQL statements in the dynamic statement cache. For a while thereafter, you can expect some extra CPU consumption as the statement cache gets repopulated through the full-prepare of dynamic queries that otherwise might have resulted in cache hits.

So, there's goodness in getting frequently updated catalog statistics to help you determine when objects need to be reorganized, but running RUNSTATS frequently will cost you CPU time, both directly (the cost of RUNSTATS execution) and indirectly (the CPU cost of repopulating the dynamic statement cache following a RUNSTATS job). You could avoid these CPU costs by not using catalog stats to guide your REORG actions, relying instead on a time-based strategy (e.g., REORG every tablespace and associated indexes at least once every four weeks), but that might lead to REORG operations that are needlessly frequent for some tablespaces that remain well-organized for long stretches of time, and too-infrequent REORGs for objects that relatively quickly lose clusteredness. And I haven't even talked about tablespace backups. Getting a full image copy of every tablespace at least once a week, with daily incremental copies in-between, is a solid approach to recovery preparedness, but what if you're daily running incremental image copy jobs for objects that haven't changed since the last copy? How could you get smarter about that? And what about RUNSTATS itself? How can you get stats to help you make better decisions about updating catalog statistics?

Enter real-time statistics. This is the name of an item of functionality that was introduced with DB2 for OS/390 Version 7. That was almost 10 years ago, and while the feature has been effectively leveraged by some DBAs for years, it's remains on the edge of many other DBAs' radar screens, largely for two reasons:

You (used to) have to create the real-time statistics objects yourself. I'm referring to the real-time statistics database (DSNRTSDB), the real-time stats tablespace (DSNRTSTS), two tables (SYSIBM.TABLESPACESTATS and SYSIBM.INDEXSPACESTATS), and a unique index on each of the tables. Instructions for creating these objects were provided in the DB2 Administration Guide, but some folks just didn't have the time or the inclination to bother with this. Happily, with DB2 9 for z/OS the real-time statistics objects became part of the DB2 catalog -- they are there for you like all the other catalog tables (if your DB2 subsystem is at the Version 8 level and the real-time statistics objects have already been created, when you migrate to DB2 9 any records in the user-created RTS tables will be automatically copied to the RTS tables in the catalog).
People had this idea that real-time statistics drive up CPU overhead in a DB2 environment. They really don't. You see, DB2 is always updating the real-time statistics counters anyway, whether or not you make any use of them. What we know as real-time statistics involves the periodic externalization of these counters, and that's a pretty low-cost operation (the default RTS externalization interval is 30 minutes, and you can adjust that by way of the STATSINT parameter of ZPARM).

So, if you are already on DB2 9, take a few minutes and check out the data in the SYSIBM.SYSTABLESPACESTATS and SYSIBM.SYSINDEXSPACESTATS catalog tables (in a pre-9 DB2 environment, the names of the user-defined RTS tables are -- as previously mentioned -- SYSIBM.TABLESPACESTATS and SYSIBM.INDEXSPACESTATS). You'll see that the column names are pretty intuitive (Hmmm, wonder what you'll find in the EXTENTS column of SYSTABLESPACESTATS? Or how about TOTALENTRIES in SYSINDEXSPACESTATS?). The theme is "news you can use," and a primary aim is to help you get to a needs-based strategy with regard to the execution of utilities such as REORG, RUNSTATS, and COPY, versus running these using only time-based criteria. To this end, RTS provides valuable information such as the total number of rows added to a tablespace since it was last reorganized (REORGINSERTS), the number of rows inserted out of clustering sequence since the last REORG (REORGUNCLUSTINS), the number of updates since the last RUNSTATS execution for a tablespace (STATSUPDATES), the number of data-change operations since a tablespace was last image-copied (COPYCHANGES), and the number of index leaf pages that are far from where they should be due to page splits that have occurred since the last time the index was reorganized or rebuilt (REORGLEAFFAR). Note, too, that in addition to the utility-related numbers, RTS provides, in a DB2 9 system, a column, called LASTUSED (in SYSINDEXSPACESTATS), that can help you identify indexes that are just taking up space (i.e., that aren't being used to speed up queries or searched updates or deletes, or to enforce referential integrity constraints).

How will you leverage RTS? You have several options. You can process them using a DB2-supplied stored procedure (DSNACCOR for DB2 Version 8, and the enhanced DSNACCOX delivered with DB2 9). You might find that DB2 tools installed on your system -- from IBM and from other companies -- can take advantage of real-time statistics data (check with your tools vendors). DBAs who know a thing or two about the REXX programming language have found that they can write their own utility-automation routines thanks to RTS. And of course you can write queries that access the RTS tables and return actionable information. I encourage you to be creative here, but to get the juices flowing, here's an RTS query that I've used to find highly disorganized nonpartitioned tablespaces (this particular query was run in a DB2 Version 8 system -- it should work fine in a DB2 9 subsystem if you change TABLESPACESTATS to SYSTABLESPACESTATS):

SELECT A.NAME,
A.DBNAME,
CAST(REORGLASTTIME AS DATE) AS REORGDATE,
CAST(FLOOR(TOTALROWS) AS INTEGER) AS TOTALROWS,
REORGINSERTS,
CAST((DEC(REORGUNCLUSTINS,11,2) / DEC(REORGINSERTS,11,2)) * 100
AS INTEGER) AS PCT_UNCL_INS,
REORGDELETES,
B.PCTFREE,
B.FREEPAGE
FROM SYSIBM.TABLESPACESTATS A, SYSIBM.SYSTABLEPART B
WHERE A.NAME = B.TSNAME
AND A.DBNAME = B.DBNAME
AND TOTALROWS > 10000
AND REORGUNCLUSTINS > 1000
AND (DEC(REORGUNCLUSTINS,11,2) / DEC(REORGINSERTS,11,2)) * 100 > 50
AND A.PARTITION = 0
ORDER BY 6 DESC
WITH UR;

Real-time stats are going mainstream, folks. Be a part of that. Work smart.

Wednesday, June 9, 2010

Nuggets from DB2 by the Bay, Part 4

The last of my posts with items of information from the 2010 International DB2 Users Group North American Conference, held last month in Tampa (as in Tampa Bay), Florida.

Good DB2 9 for z/OS migration information from David Simpson. David, a senior DB2 instructor with Themis Training, described some things of which people migrating to DB2 9 should be aware. Among these are the following:

The pureXML functionality delivered in DB2 9 is quite comprehensive and opens up a lot of possibilities. One of David's colleagues at Themis figured out how to create views that make data stored in XML columns of DB2 tables look like standard relational data.

Do you have a handle on your simple tablespace situation? David reminded session attendees that simple tablespaces cannot be created in a DB2 9 environment. This being the case, it would be a good idea to move data from the simple tablespaces that you have to other tablespace types (segmented tablespaces, most likely). Sure, you can still read from, and update, a simple tablespace in a DB2 9 system, but the inability to create such a tablespace could leave you in a tough spot if you were to try to recover a simple tablespace that had been accidentally dropped (David suggested that people create a few empty simple tablespaces before migrating to DB2 9, so you'll have some available just in case you need a new one). You might think that you don't have any simple tablespaces in your DB2 Version 8 system, but you could be wrong there -- David pointed out that simple tablespaces are the default up through DB2 Version 8 (so, an implicitly-created tablespace in a pre-DB2 9 environment will be a simple tablespace).

New RRF developments. That's RRF as in reordered row format, a change (introduced with DB2 9) in the way that columns are physically ordered in a table. In the old set-up (now referred to as BRF, or basic row format), varying-length columns in a table (such as those with a VARCHAR data type) are physically stored, relative to other columns, in the order in which they appear in the CREATE TABLE statement. With RRF in effect, varying-length columns are grouped at the end of a table's rows, and that group of varying-length columns is preceded by a set of offset indicators -- one for each varying-length column -- that enable DB2 to very efficiently go right to the start of a given varying-length column. David told attendees that RRF does NOT affect what programs "see", as the logical order of a table's columns does not change with a change to RRF. RRF is a good thing with respect to varying-length-data access performance, but it may cause some issues when tablespaces are compressed (RRF rows sometimes don't compress quite as much as equivalent BRF rows), when data changes related to tables in compressed tablespaces are propagated via "log-scraping" replication tools (you just need to make sure that your replication tool can deal with the new compression dictionary that is created when a tablespace goes from BRF to RRF), and when tablespaces are operated on by the DB2 DSN1COPY utility (this utility doesn't use the SQL interface, so it is sensitive to changes in the physical order of columns even when this has no effect on the columns' logical order in a table).

Early on with DB2 9, the change from BRF to RRF was automatic with the first REORG in a DB2 9 environment of a tablespace created in a pre-9 DB2 system. Various DB2 users asked for more control over the row-format change, and IBM responded with APARs like PK85881 and PK87348. You definitely want to get to RRF at some point. With the fixes provided by these APARs, you can decide if you want BRF-to-RRF conversion to occur automatically with some utility operations (REORG and LOAD REPLACE), or if you want to explicitly request format conversion on a tablespace-by-tablespace basis. You can also determine whether or not you want tablespaces created in a DB2 9 environment to have BRF or RRF rows initially.

Time to move on from Visual Explain. David mentioned that VE is not supported in the DB2 9 environment -- it doesn't work with new DB2 9 data types (such as XML), and it can produce "indeterminate results" if a DB2 9 access plan is not possible in a DB2 Version 8 system. If you want a visual depiction of the access plan for a query accessing a DB2 9 database, you can use the free and downloadable IBM Optimization Service Center for DB2 for z/OS, or IBM Data Studio.

Bye-bye, BUILD2. David explained that partition-level online REORG in a DB2 9 system does not have a BUILD2 phase (in prior releases of DB2, this is the REORG utility phase during which row IDs in non-partitioned indexes are updated to reflect the new position of rows in a reorganized table partition). That's good, because data in a partition is essentially unavailable during the BUILD2 phase, and BUILD2 can run for quite some time if the partition holds a large number of rows. There's a catch, though: BUILD2 is eliminated because DB2 9 partition-level online REORG reorganizes non-partitioned indexes in their entirety, using shadow data sets. That means more disk space and more CPU time for partition-level REORG in a DB2 9 system. It also means that you can't run multiple online REORG jobs for different partitions of the same partitioned tablespace in parallel. You can get parallelism within one partition-level online REORG job if you're reorganizing a range of partitions (e.g., partitions 5 through 10). Note that in a DB2 10 environment, you can get this kind of intra-job parallelism for an online REORG even if the multiple partitions being reorganized are not contiguous (e.g., partitions 3, 7, 10, and 15).

DB2 for z/OS and application programming. Dave Churn, a database architect at DST Systems in Kansas City, delivered a session on application development in a DB2 context. David commented on a number of application-oriented DB2 features and functions, including these:

Fetching and inserting chunks of rows. DST has made some use of the multi-row FETCH and INSERT capabilities introduced with DB2 for z/OS Version 8. Dave said that performance benefits had been seen for programs that FETCH rows in blocks of 5-10 rows each, and for programs that INSERT rows in blocks of 20 rows each. The other side of that coin is increased programming complexity (Dave noted that with multi-row FETCH, you're "buffering in your program"). In DST's case, multi-row FETCH is not being used to a great extent, because the increased time required for programmers to write code to deal with multi-row FETCH (versus using traditional single-row FETCH functionality) is generally seen as outweighing the potential performance gain (and that gain will often not be very significant in an overall sense -- as Dave said, "How often is FETCH processing your primary performance pain point?").

Use of multi-row INSERT, on the other hand, has been found to be more advantageous in the DST environment, particularly with respect to the Company's very high-volume, time-critical, and INSERT-heavy overnight batch workload. As with multi-row FETCH, there is an increase in programming complexity associated with the use of multi-row INSERT (among other things, to-be-inserted values have to be placed in host variable arrays declared by the inserting program), but the performance and throughput benefits often made the additional coding effort worthwhile. Interestingly, others in the audience indicated that they'd seen the same pattern in their shops: multi-row INSERT was found to be of greater use than multi-row FETCH. Dave mentioned that at DST, programs using multi-row INSERT were generally doing so with the NOT ATOMIC CONTINUE ON SQLEXCEPTION option, which causes DB2 to NOT undo successful inserts of rows in a block if an error is encountered in attempting to insert one or more rows in the same block. The programs use the GET DIAGNOSTICS statement to identify any rows in a block that were not successfully inserted. These rows are written to a file for later analysis and action.

The new BINARY data type in DB2 9 can be great for some client-server applications. When DB2 for z/OS is used for the storage of data that is inserted by, and retrieved by, programs running on Linux/UNIX/Windows application servers, the BINARY data type can be a good choice: if the data will not be accessed by programs running on the mainframe, why do character conversion? Use of the BINARY data type ensures that character conversion will not even be attempted when the data is sent to or read from the DB2 database.

DB2 9 FETCH WITH CONTINUE is useful for really big LOBs. In some cases, a LOB value might be larger than what a COBOL program can handle (which is about 128 MB). The FETCH WITH CONTINUE functionality introduced with DB2 9 enables a COBOL program to retrieve a very large LOB is parts.

MERGE musings. The MERGE statement, new with DB2 9 and sometimes referred to as "UPSERT", is very handy when a set of input records is to be either inserted into a DB2 table or used to update rows already in the table, depending on whether or not an input record matches an existing table row. Dave mentioned that the matching condition (specified in the MERGE statement) will ideally be based on a unique key, so as to limit the scope of the UPDATE that occurs when a "no match" situation exists. DST likes MERGE because it improves application efficiency (it reduces the number of program calls to DB2 versus the previously-required INSERT-ELSE-UPDATE construct) and programmer productivity (same reason -- fewer SQL statements to code). Dave said that DST has used MERGE with both the ATOMIC and NOT ATOMIC CONTINUE ON SQLEXCEPTION options (when the latter is used for MERGE with a multi-record input block, GET DIAGNOSTICS is used to determine what, if any, input records were not successfully processed -- just as is done for multi-row INSERT).

SELECT FROM UPDATE/INSERT/DELETE/MERGE is great for efficiently obtaining DB2-generated or DB2-modified values. DST has used the SELECT FROM data-changing-statement syntax (introduced for INSERT with DB2 Version 8, and expanded to other data-changing statements with DB2 9) to obtain values generated by BEFORE triggers on DB2 tables (as an aside, Dave mentioned that DST has used triggers to, among other things, dynamically change a program's commit frequency). DST has also found it useful to execute SELECT FROM MERGE statements with the INCLUDE option (enabling return of values not stored in a target table) to determine whether rows in a MERGE input block were inserted or used to update the target table.

When will you use the new DB2 9 XML data type? You'll use it, Dave said, when "your clients want to exchange information with you in the form of XML documents." In other words, you're likely to use it when your company's clients make XML data exchange a requirement for doing business. DST is using DB2 9 pureXML now. You might want to get ready to use it, just in case you'll need to. Being prepared could make exploitation of the technology an easier process (and it is pretty amazing whet DB2 can do with XML documents, in terms of indexability, schema validation, and search and retrieval using XQUERY expressions embedded in SQL statements).

That's a wrap for this multi-part post. I hope that part 4 has provided you with some useful information, and I invite you to check out parts 1, 2, and 3, if you haven't already done so. The IDUG 2011 North American Conference will be held in Anaheim, California next May. I'm planning on being there, and I hope that many of you will be there as well. It's always a great source of DB2 "news you can use."

Tuesday, June 1, 2010

Nuggets from DB2 by the Bay, Part 3

Still more items of information from the 2010 International DB2 Users Group North American Conference held last month in Tampa, Florida.

A new chapter in the history of the DB2 optimizer. Terry Purcell, uber-optimizer-guy on IBM's DB2 for z/OS development team, delivered an excellent session on new query optimizer features in the DB2 10 environment. These goodies include:

An new option allows you to get dynamic statement cache matching for SQL statements that have different literal values but are otherwise identical. Prior to DB2 10, matching was possible only for statements that were identical on a byte-for-byte basis: such statements would either contain parameter markers or identical literal values. The CPU efficiency attained through statement matching with different literals won't be quite what you get when statements containing parameter markers are matched, but it should be pretty close.

How about PERSISTENT caching of prepared dynamic SQL statements? Dynamic statement caching is great, but when a prepared statement gets flushed out of the cache, it's gone, right? Not any more, folks. DB2 10 (in New Function Mode) will provide an access path repository in the catalog that will enable you to stabilize -- in a long-term way -- access paths for dynamic SQL statements (a "game-changer," Terry called it). When this option is in effect, DB2 will look first to match an incoming dynamic statement with a statement in the repository, then (if not found in the repository) in the dynamic statement cache. If neither of these matching attempts is successful, DB2 will dynamically prepare the statement. Want to change a path in the repository? You'll be able to do so by rebinding at the query level. By the way, the repository will also enable a more-robust implementation of DB2's access path hint functionality: it will be possible to put a hint into the repository, so you'll no longer have to provide a query number value in order to use a hint.

Mass rebind? No problem. A new REBIND option, APREUSE(YES), will instruct DB2 to generate a new control structure for a package (to take advantage of a service fix, for example) while retaining the existing access path, if possible. If the package's old access path can't be reused for some reason, a new one will be generated. And, speaking of new and different access paths, another DB2 10-delivered REBIND option, APCOMPARE(ERROR), can be used to tell DB2 to issue an error message if a rebind operation changes an access path (you can optionally have DB2 issue a warning instead of an error). Going forward, when you want to do a mass rebind of packages as part of a version-to-version DB2 migration, you may well want to do your rebinds with APREUSE(YES) and APCOMPARE(ERROR).

More user-friendly access plan stability. Lots of people like the access plan stability capability that was delivered with DB2 9 for z/OS via the new PLANMGMT option of the REBIND command. Nice as that is, it could be a hassle trying to get information about a version of a package's access plan other than the one currently in use. DB2 10 will address that problem with a new catalog table, SYSPACKCOPY, that will provide metadata for previous and original copies of access plans.

Playing it safe when choosing access paths. DB2 has always gone for lowest cost when choosing an access path for a query. Sometimes, that can be a problem for a statement with one or more host variables in its predicates, as the path identified as lowest-cost might result in really bad performance for certain variable values. The DB2 10 optimizer, older and wiser, will consider risk (i.e., the chance of getting poor performance for certain statement variable values) as well as cost in determining the optimal path for a SQL statement.

Staying with the RID option. Some folks cringe when DB2 starts down the path of using a RID list in executing a query (perhaps for multi-index access), and then switches to a tablespace scan because a RID limit was reached. DB2 10 can overflow a big RID list to a workfile and keep on trucking. Will you need more workfile space as a result? Perhaps, but note that the spillover effect is mitigated by a new larger default size for the RID pool in a DB2 10 environment.

In-list advances. DB2 10 extends its predicate transitive closure capability (the ability to determine a relationship between A and C based on A-B and B-C relationships) to in-list predicates. DB2 10 can also use matching index access for multiple in-list predicates in a query (prior to DB2 10, if a query had several in-list predicates, only one of these could be used for a matching index scan). And one more thing: DB2 10 can take several OR-connected predicates that match on one index and convert them to a single in-list predicate to generate a result set (that's more efficient than using "index ORing" for the predicates, as is done in a pre-10 DB2 system).

Query parallelism enhancements. With DB2 10, you can get parallelism for multi-row FETCH operations (though not for an ambiguous cursor). DB2 10 also enables parallel query tasks to share workfiles. And, in DB2 10, something called "dynamic record range partitioning" can be used to cause data in table A to be split into partitions that "line up" with the partitions of table B, the result being improved parallel table-join processing. This does introduce a data sort, but the expectation is that the technique will be used when table A is on the small side, so the sort shouldn't be a big deal.=

A RUNSTATS efficiency boost. It used to be that sampling provided some help in reducing the CPU cost of a RUNSTATS utility job. With DB2 10, sampling provides a LOT of help in this department, because the sampling percentage now applies to the percentage of pages examined (it used to refer to the percentage of data rows examined in gathering statistics for non-indexed columns -- you could sample 25% of the rows, but end up accessing ALL of the tablespace's pages). What's more, there's an optional specification that you can use to tell DB2 to figure out the sampling percentage to use for an execution of RUNSTATS.

Where to go for more DB2 10 information. IBM's Roger Miller has provided information about DB2 10 presentations available on IBM's Web site. They are in a folder accessible via this url: ftp://public.dhe.ibm.com/software/data/db2/zos/presentations/v10-new-function/

In this folder, you'll find these presentations that were delivered at the IDUG conference in Tampa:

Session A01: DBA improvements, by Roger Miller
Session B02: What's new from the optimizer, by Terry Purcell
Session A03: DB2 10 Performance Preview, by Akiko Hoshikawa
Session A06: DB2 and System z Synergy, by Chris Crone
Session A08: DB2 10 Availability Enhancements, by Haakon Roberts

You'll also find these presentations from the recent IBM Information on Demand European Conference:

Session 2908: DB2 10 for z/OS security features help satisfy your auditors, by Jim Pickel
Session 2894: What’s coming from the optimizer in DB2 10 for z/OS, by Terry Purcell
Session 3010: New pureXML Features in DB2 for z/OS: Breaking Relational Limits, by Guogen (Gene) Zhang

I'll be back here in a few days with still more from my notes taken at the IDUG Conference in Tampa. Ciao for now.

Monday, May 24, 2010

Nuggets from DB2 by the Bay, Part 2

More items of information from the 2010 International DB2 Users Group North American Conference, held earlier this month in Tampa (in fact by the bay -- the convention center is on the waterfront).

IBM's Roger Miller delivered a session on DB2 for z/OS Version 10 (in beta release for the past couple of months) with typical enthusiasm (opening line: "This is a BIG version"). Some of his points:

Much attention was paid to making life easier for DBAs. Among the labor-saving features of DB2 10:

Automated collection of catalog stats, so you don't have to mess with the RUNSTATS utility if you don't want to.
Worry-free scale-up, with DB2 thread-related virtual storage now above the 2 GB "bar" in the DB2 database services address space. The number of concurrently active threads can go WAY up in a DB2 10 environment.
Access path stability, a favorite of many DB2 9 users, is enhanced. That makes for worry-free rebinding of packages (and you'll want to rebind to get the aforementioned below-the-bar virtual storage constraint relief, and to get the benefit of optimizer improvements).
Reduced catalog contention will allow for more concurrency with regard to CREATE/ALTER/DROP activity.
The ability to build a tablespace compression dictionary on the fly removes what had been a REORG utility execution requirement.

Resiliency, efficiency, and growth:

DB2 9 for z/OS gave us some nice utility performance enhancements (referring particularly to reduced CPU consumption). DB2 10 delivers significant CPU efficiency gains for user application workloads (batch and OLTP).
More-granular DB2 authorities enable people to do their jobs while improving the safeguarding of data assets. An example is SECADM, which does not include data access privileges.
"Release 2" of DB2's pureXML support improves performance for applications that access XML data stored in a DB2 database.
An "ALTER-then-REORG" path makes it easier to convert existing tablespaces to the universal tablespace type introduced with DB2 9.

Counting down. Roger's 10 favorite DB2 10 features:

10. (tie) Hash access to data records (do with one GETPAGE what formerly might have required five or so GETPAGEs), and index "include" columns (define a unique index, then include one or more additional columns to improve the performance of some queries while retaining the original uniqueness-enforcement characteristic of the index).
9. Improved XML data management performance and usability.
8. Improved SQL portability.
7. Support for temporal (i.e., "versioned") data (something that previously had to be implemented by way of application code).
6. The new, more-granular security roles.
5. More online schema changes.
4. Better catalog concurrency.
3. 5X-10X more concurrent users due to the removal of memory constraints.
2. CPU cost reductions for DB2-accessing application programs.
1. Productivity improvements.

You get a lot of benefits in DB2 10 conversion mode:

More CPU-efficient SQL.
64-bit addressing support for more of the EDM pool and for DB2 runtime structures (thread-related virtual storage usage). You'll need to rebind to get this benefit.
Improved efficiency of single-row retrieval operations outside of singleton SELECTs, thanks to OPEN/FETCH/CLOSE chaining.
Distributed thread reuse for high-performance database access threads (aka DBATs).
Improved elapsed times for insert operations, thanks to parallelized index updates (for tables on which multiple indexes have been defined).
Support for 1 MB pages.
Access path enhancements, including the ability to get index matching for multiple in-list predicates in a query.
More query parallelism (good for zIIP engine utilization).
More avoidance of view materialization (good for efficiency).

More stuff:

Dynamic statement cache hits for statements that are identical except for the values of literals (this requires the use of a new attribute setting).
CPU efficiency gains of up to 20% for native SQL procedures (you regenerate the runtime structure via drop and recreate).
Hash access to data rows.
Index include columns.
In-line LOBs (storing of smaller LOB values in base table rows). Roger called these smaller LOBs "SLOBs." LOBs stored in-line in a compressed tablespace will be compressed. In-line storage of LOBs will require a universal tablespace that's in reordered row format (RRF). Said Roger: "RRF is the future."
Universal tablespaces can be defined with the MEMBER CLUSTER attribute (good for certain high-intensity insert operations, especially in data sharing environments).
"ALTER-then-REORG" to get to a universal tablespace, to change page size, to change DSSIZE (size of a tablespace partition), and to change SEGSIZE. With respect to "ALTER-then-REORG," you'll have the ability to reverse an "oops" ALTER (if you haven't effected the physical change via REORG) with an ALTER TABLESPACE DROP PENDING CHANGES.
Online REORG for all catalog and directory tablespaces.

Scalability improvements:

Reduced latch contention.
A new option that lets data readers avoid having to wait on data inserters.
Much more utility concurrency.
64-bit common storage to avoid ECSA constraints.
Package binds, data definition language processes, and dynamic SQL can run concurrently.
The skeleton package table in the directory will use LOBs. With CLOBs and BLOBs in the DB2 directory, the DSN1CHKR utility won't be needed because there won't be any more links to maintain.
SMF records produced by DB2 traces can be compressed: major space savings (maybe 4:1) with a low cost in terms of overhead.
"Monster" buffer pools can be used with less overhead.
You'll be able to dynamically add active log data sets.
You'll be able to grant DBADM authority to an ID for all databases, versus having to do this on a database-by-database basis.

More catalog and directory changes:

The catalog and directory will utilize partition-by-growth universal tablespaces (64 GB DSSIZE).
There will be more tablespaces (about 60 more).
Row-level-locking will be used.
The objects will be DB2-managed and SMS-controlled.

It really is a BIG version -- and there's still more to it (I've just provided what I captured in my note-taking during Roger's session). More nuggets to come in my part 3 post about DB2-by-the-bay. Stay tuned.

Monday, May 17, 2010

Nuggets from DB2 by the Bay, Part 1

I had to smile when I saw the thread that Ed Long of Pegasystems started the other day on the DB2-L discussion list. The subject line? "IDUG radio silence." Here it was, IDUG NA week, with the North American Conference of the International DB2 Users Group in full swing in Tampa, Florida, and the usual blogging and tweeting of conference attendees was strangely absent. What's with that? Well, I'll break the silence (thanks for the inspiration, Ed), and I'll start by offering my theory as to why the level of conference-related electronic communication was low: we were BUSY.

Busy. That's the word that comes first to mind when I think of this year's event, and I mean it in a good way. At last year's conference in Denver, the mood was kind of on the down side. Attendance was off, due largely to severe cutbacks in organizations' training and travel budgets - a widespread response to one bear of an economic downturn. Those of us who were able to make it to last May's get-together swapped stories with common themes: How tough is it on you? How many people has your company cut? How down is your business? A lot of us were in batten-down-the-hatches mode, and it was hard to get the ol' positive attitude going.

What a difference a year makes. The vibe at the Tampa Convention Center was a total turnaround from 2009. Attendance appeared to be up significantly, people were smiling, conversation was upbeat and animated, and there was this overall sense of folks being on the move: heading to this session or that one, flagging someone down to get a question answered, lining up future business, juggling conference activities with work-related priorities -- stuff that happens, I guess, at every conference, but it seemed to me that the energy level was up sharply versus last May. To the usual "How's it going" question asked of acquaintances not seen since last year, an oft-heard response was: "Busy!" To be sure, some folks (and I can relate) are crazy busy, trying to work in some eating and sleeping when the opportunity arises, but no one seemed to be complaining. It felt OK to be burning the candle at both ends after the long dry spell endured last year. Optimism is not in short supply, and I hope these positive trends will be sustained in the months and years to come.

In this and some other entries to come (not sure how many -- probably another couple or so) I'll share with you some nuggets of information from the conference that I hope you'll find to be interesting and useful. I'll start with the Tuesday morning keynote session.

The data tsunami: big challenges, but big opportunities, too. The keynote speaker was Martin Wildberger, IBM's VP of Data Management Software Development. He started out talking about the enormous growth in the amount of data that organizations have to manage -- this on top of an already-enormous base. He showed a video with comments by some of the leading technologists in his group, and one of those comments really stuck with me (words to this effect): "You might think that the blizzard of data coming into an organization would blind you, but in fact, the more data you have, the clearer you see." Sure, gaining insight from all that data doesn't just happen -- you need the right technology and processes to make it happen -- but the idea that an organization can use its voluminous data assets to see things that were heretofore hidden -- things that could drive more revenue or reduce costs -- is compelling. As DB2 people, we work at the foundational level of the information software "stack." There's lots of cool analytics stuff going on at higher levels of that stack, but the cool query and reporting and cubing and mining tools just sit there if the database is unavailable. And, data has to get to decision-makers fast. And, non-traditional data (images, documents, XML) has to be effectively managed right along with the traditional numbers and character strings. Much will be demanded of us, and that's good (it'll keep us busy).

Martin mentioned that IBM's overall information management investment priorities are aimed at helping organizations to:

Lower costs
Improve performance
Reuse skills
Reduce risk
Reduce time-to-value
Innovate

He talked up IBM's partnership with Intel, IBM's drive to make it easier for companies to switch to DB2 from other database management systems (especially Oracle), and the "game-changing" impact of DB2 pureScale technology, which takes high availability in the distributed systems world to a whole new level. Martin also highlighted the Smart Analytics Systems, including the 9600 series, a relatively new offering on the System z platform (this is a completely integrated hardware/software/services package for analytics and BI -- basically, the "appliance" approach -- that has been available previously only on the IBM Power and System x server lines). There was also good news on the cloud front: DB2 is getting a whole lot of use in Amazon's cloud.

DB2 10 for z/OS: a lot to like. John Campbell, an IBM Distinguished Engineer with the DB2 for z/OS development organization, took the stage for a while to provide some DB2 10 for z/OS highlights (this version of DB2 on the mainframe platform is now in Beta release):

CPU efficiency gains. For programs written in SQL procedure language, or SQLPL (used to develop "native" SQL procedures and -- new with DB2 10 -- SQL user-defined functions), CPU consumption could be reduced by up to 20% versus DB2 9. Programs with embedded SQL could see reduced in-DB2 CPU cost (CPU cost of SQL statement execution) of up to 10% versus dB2 9, just by being rebound in a DB2 10 system. High-volume, concurrent insert processes could see in-DB2 CPU cost reductions of up to 40% in a DB2 10 system versus DB2 9.
64-bit addressing for DB2 runtime structures. John's "favorite DB2 10 feature." With DB2 thread storage going above the 2 GB virtual storage "bar" in a DB2 10 system (after a rebind in DB2 10 Conversion Mode), people will have options that they didn't before (greater use of the RELEASE(DEALLOCATE) bind option, for one thing). DB2 subsystem failures are rare, but when they do happen it's often because of a virtual storage constraint problem. DB2 10 squarely addresses that issue.
Temporal data. This refers to the ability to associate "business time" and "system time" values to data records. John pointing out that the concept isn't new. What's new is that the temporal data capabilities are in the DB2 engine, versus having to be implemented in application code.
Getting to universal. John pointed out that DB2 10 would provide an "ALTER-then-REORG" path to get from segmented and partitioned tablespaces to universal tablespaces.
Access plan stability. This is a capability in DB2 10 that can be used to "lock down" access paths for static AND dynamic SQL.
Enhanced dynamic statement caching. In a DB2 10 environment, a dynamic query with literals in the predicates can get a match in the prepared statement cache with a statement that is identical except for the literal values (getting a match previously required the literal values to match, too).

DB2 for LUW performance. John was followed on stage by Berni Schiefer of the DB2 for Linux/UNIX/Windows (LUW) development team. Berni shared some of the latest from the DB2 for LUW performance front:

Performance PER CORE is not an official TPC-C metric, but it matters, because the core is the licensing unit for LUW software. It's on a per-core basis that DB2 for LUW performance really shines versus Sun/Oracle and HP systems.
SAP benchmarks show better performance versus competing platforms, with FEWER cores.
TPC-C benchmark numbers show that DB2 on the IBM POWER7 platform excels in terms of both performance (total processing power) AND price/performance.
DB2 is number one in terms of Windows system performance, but the performance story is even better on the POWER platform.
Berni pointed out that DB2 is the ONLY DBMS that provides native support for the DECFLOAT data type (based on the IEEE 754r standard for decimal floating point numbers). The POWER platform provides a hardware boost for DECFLOAT operations.
DB2 for LUW does an excellent job of exploiting flash drives.

Back to Martin for a keynote wrap-up. Martin Wildberger came back on stage to deliver a few closing comments, returning to the topic of DB2 pureScale. pureScale is a distributed systems (AIX/POWER platform) implementation of the shared data architecture used for DB2 for z/OS data sharing on a parallel sysplex mainframe cluster. That's a technology that has delivered the availability and scalability goods for 15 years. So now DB2 for AIX delivers top-of-class scale-up AND scale-out capabilities.

Martin closed by drawing attention to the IBM/IDUG "Do You DB2?" contest. Write about your experience in using DB2, and you could win a big flat-screen TV. If you're based in North America, check it out (this initial contest does have that geographic restriction).

More nuggets from IDUG in Tampa to come in other posts. Gotta go now. I'm BUSY.

Previous Posts

Archives