Catterall Consulting: Thoughts on DB2 Data Clustering

Although this post is written specifically in the context of a DB2 for z/OS environment, most of what I have to say on the topic applies to DB2 for Linux, UNIX, and Windows, as well.

First, a quick review and level-set:

A clustering index informs DB2 of the physical order (according to the index key value) in which rows of the table should be arranged. The use of "should" here versus "must" is important. Suppose that a new row is to be inserted into a table, and the clustering index indicates that the row should be placed in page 123 so as to maintain the table's clustering sequence. If page 123 is X-locked by some other data-changing process that has not yet reached a commit point, DB2 will not wait for that X-lock to be released. Instead, it will attempt to place the new row in a page that is close to page 123 (within a few pages before or after page 123). The same is true if page 123 is full and can't hold the new row (index page splits can occur in a DB2 database, but there's no such thing as a DB2 data page split). This aspect of DB2's operation improves insert performance but can lead to tablespace disorganization. [Note: row-level locking is an alternative to the default page-level locking used by DB2 for z/OS, but be aware that row-level locking can increase CPU overhead in a data sharing group on a mainframe parallel sysplex.]
Every DB2 table with an index has a clustering index. The clustering index will be the one created with the CLUSTER keyword (CLUSTER can also be added via ALTER INDEX), or - in the absence of an index so designated - the first index defined on the table.
You can tell DB2 to ignore the clustering index when adding data to a table. For data added via the LOAD utility, you can do this through use of the RESUME option. For rows added via INSERT statements, clustering order can be ignored - starting with DB2 for z/OS V9.1 - via the APPEND option of CREATE and ALTER TABLE. [The MEMBER CLUSTER option of CREATE TABLESPACE also provides a means of ignoring clustering sequence for rows added via INSERT - the rows are placed in the tablespace based on available space.]

Now, a question that I find to be interesting: is clustering important? That is to say, if a tablespace becomes decidedly unclustered (i.e., if many rows are located far from where they should be with respect to clustering sequence), should you REORG the tablespace in order to get the data rows back into clustering sequence? Your response to this question might be, "Of course!" My response is, "Not necessarily." It depends a lot on the nature of data retrieval for the table in question. If read operations targeting the table generally provide a unique - and indexed - key, with just the one qualifying row being returned to the requester, who cares about the physical ordering of the data? This is particularly true if successive single-row read operations do not follow any kind of consistent pattern. You'll have a clustering index on the table (remember, if no index has the CLUSTER identifier, the first index created on the table is by default the clustering index), so the tablespace can get way disorganized (unless DELETE operations open up "holes" in pages and enable clustering to be maintained), but again, that's not necessarily cause for concern.

Think, too, about the relative importance of INSERT versus SELECT performance for the table (predicate-containing UPDATE and DELETE statements are lumped in with SELECTs with respect to this matter). Suppose that a table contains history information pertaining to users' visits to a Web site, and that this history information is used almost exclusively for customer care purposes. If very few site visitors call in with customer care issues, the number of table INSERT operations will far exceed the number of read requests targeting the table. That being the case, you'd probably want to optimize insert performance. This could mean defining only one index on the table (the more indexes defined on a table, the more CPU-costly INSERTs will be). If the lead column of that index key is some sort of customer-identifier value (i.e., not a continuously-ascending key such a sequence number or a timestamp), the table will soon become way unclustered. Let it be so, as long as data retrieval performance is acceptable. What matters in this scenario is CPU-efficient INSERT operations.

Of course, there are many situations in which maintenance of data clustering is very important. If you have an application program that retrieves (for example) six months of purchase activity for a given customer, performance for that program will be optimized if the purchase-activity rows for a given customer are physically located on the same table page (or in adjacent table pages), versus being spread all over the table. You'll want to have a clustering index with a key that begins with the customer number, and you'll want to maintain good clustering for the table. Clustering maintenance will require periodic REORGs to both restore the preferred ordering of rows and to reestablish freespace in data pages (adequate freespace helps to maintain good clustering between REORGs - think about a PCTFREE value for the tablespace that will allow for the insertion of at least a few new rows on any given page). Keep in mind that frequent updating of DB2 catalog statistics via RUNSTATs is important if you want to keep tabs on the clusteredness of tablespaces.

When you do run the REORG utility, you'll probably want to go with online REORG, as it really doesn't cost much more than an offline REORG (except for the disk space needed for shadow data sets), and if you need to cancel the REORG job before it completes, go ahead - there's no clean-up necessary, because the original tablespace and indexes are not changed by an online REORG. [Note that online REORG for DB2 for Linux/UNIX/Windows is different, in that it's a REORG-in-place operation versus the shadow-and-switch-data-sets technique used by a DB2 for z/OS online REORG.]

My bottom-line point is this: do not reflexively reorganize tablespaces that have become unclustered - to do so might waste some system resources. Instead, do a little bit of application and database object analysis, and see if you have some tablespaces that you can allow to become disorganized. The least expensive database maintenance task is one that you don't perform at all.

Previous Posts

Thursday, September 13, 2007

Thoughts on DB2 Data Clustering

0 Comments:

Post a Comment