DB2 Partition-By-Growth Tables: Very Nice, But Not For Everything
Not long ago, I was doing some work at a site that had DB2 for z/OS V8 running in production (as is true, I think, of most mainframe DB2-using enterprises at present - with a tip of the hat to those organizations that have installed or migrated to V9). A DBA came by with an idea for table partitioning that he wanted to bounce off me. What he wanted to achieve by way of this scheme was minimization of long-term administrative hassle for a large table. His proposal: define the table with an identity column (with the default start-at-1 and increment-by-1 options in effect), and specify that identity column as the partitioning key (this would utilize the table-controlled - as opposed to index-controlled - partitioning capability introduced with DB2 V8). He would decide ahead of time how many rows he wanted in a partition and specify partition limit key values accordingly. He'd set up an initial number of partitions for the table (enough for several weeks' worth of inserts, or maybe a few months), and he would periodically add more partitions (with progressively greater limit key values), via ALTER TABLE ADD PARTITION, to stay well ahead of the advancing insert activity. He even talked of automating this periodic partition-add process. This proposed set-up would allow the table to become very large (resting easy with the knowledge that the table could have up to 4096 partitions, and that the identity column/partitioning key value could go up to one less than 10 to the 31rst power), with populated partitions of about equal size (good), and with the benefit of partition-level utility operations - all in all, about as close to a set-it-and-forget-it approach as you could have for a big table in a DB2 V8 environment.
My initial response to the idea was, "Too bad you guys aren't yet on DB2 Version 9. Partition-by-growth would take care of all that for you." The DBA's eyes widened a little. "Partition by what?" he asked. I went on to explain that he was not alone in wanting a really easy way (administratively speaking) to handle really big tables, and that IBM had responded to that recognized need with a new type of table organization that reduces DBA oversight requirements even more than would his partition-by-identity plan: you just tell DB2 how large you want a table's partitions to be (via the DSSIZE specification for the associated tablespace), and that's that. There's one partition to begin with, and when that one gets full DB2 defines another one and starts using it, and this continues as needed to accommodate INSERT and/or LOAD activity (the MAXPARTITIONS value for the tablespace sets the limit on the number of partitions that can be defined for the table). No muss, no fuss - you get an object that's as easy to manage as a segmented tablespace (with the favorable space management characteristics of segmented, since a partition-by-growth table uses the new universal tablespace technology delivered with DB2 V9), with the advantage of very large size (a segmented tablespace can't grow beyond 64 GB) and partition-level utility operations (except for LOAD, which has to run at the tablespace level for a partition-by-growth table).
So, plenty of DB2 for z/OS DBAs are keen (or will be) to use partition-by-growth (PBG) tables. I don't want to rain on that parade, but you should consider that PBG is not the right solution for every large-table situation. In particular, in a data warehouse environment or otherwise when a large table's rows will be often retrieved in large bunches (versus singly or in small sets, as is often true for online transaction processing systems), a goal of performance optimization will likely lead toward the use of partition-by-range (the other flavor of DB2 V9 universal tablespace) for large tables. Here's why:
My initial response to the idea was, "Too bad you guys aren't yet on DB2 Version 9. Partition-by-growth would take care of all that for you." The DBA's eyes widened a little. "Partition by what?" he asked. I went on to explain that he was not alone in wanting a really easy way (administratively speaking) to handle really big tables, and that IBM had responded to that recognized need with a new type of table organization that reduces DBA oversight requirements even more than would his partition-by-identity plan: you just tell DB2 how large you want a table's partitions to be (via the DSSIZE specification for the associated tablespace), and that's that. There's one partition to begin with, and when that one gets full DB2 defines another one and starts using it, and this continues as needed to accommodate INSERT and/or LOAD activity (the MAXPARTITIONS value for the tablespace sets the limit on the number of partitions that can be defined for the table). No muss, no fuss - you get an object that's as easy to manage as a segmented tablespace (with the favorable space management characteristics of segmented, since a partition-by-growth table uses the new universal tablespace technology delivered with DB2 V9), with the advantage of very large size (a segmented tablespace can't grow beyond 64 GB) and partition-level utility operations (except for LOAD, which has to run at the tablespace level for a partition-by-growth table).
So, plenty of DB2 for z/OS DBAs are keen (or will be) to use partition-by-growth (PBG) tables. I don't want to rain on that parade, but you should consider that PBG is not the right solution for every large-table situation. In particular, in a data warehouse environment or otherwise when a large table's rows will be often retrieved in large bunches (versus singly or in small sets, as is often true for online transaction processing systems), a goal of performance optimization will likely lead toward the use of partition-by-range (the other flavor of DB2 V9 universal tablespace) for large tables. Here's why:
- Partition-by-range (PBR) tables enable multi-dimensional clustering. Yes, I know - multidimensional clustering is, in the strictest sense of the term, a feature of DB2 for Linux, UNIX, and Windows (LUW); however, a similar effect can be achieved with PBR tables (and indeed, with DB2 for z/OS V8 table-controlled partitioning) by partitioning on one key (i.e., one data dimension) and clustering on another. So, for example, a table could be partitioned by date and clustered within partitions by customer ID. This type of arrangement can lead to both query parallelism (a user might want data for a given customer ID over the past 6 months, and if each partition holds data for a particular week, the target rows will be spread over 20-plus partitions) and locality of reference (within partitions, the rows for the specified customer ID will be located near each other). That can be a powerful performance combination, and it's not possible with PBG tables because for such tables there is no notion of a partitioning key.
- PBR tables allow for data rotation. Suppose you want to keep a year's worth of data in a table, and you want to partition that data by week, but you don't want more than 53 partitions (we tend to think of a year as being 52 weeks, but of course fifty-two 7-day periods doesn't quite get you there). With a PBR table (and again, with DB2 V8 table-controlled partitioning), you can use actual date values as partition key limits and keep a rolling 53 weeks of data by taking advantage of ALTER TABLE ROTATE PARTITION functionality. Again, can't do that with PBG.
- PBR tables enable page-range screening. When a query predicate references the limit key of a PBR table, DB2 can avoid accessing partitions that would not contain qualifying rows - a potentially huge performance benefit for SELECTs that target large tables (page-range screening, also known as limited partition scan, has been around for a while, but DB2 for z/OS V9 further leverages this optimization technique by extending its use to table join predicates and to non-matching page-range predicates).
9 Comments:
PBG tablespaces have to be reorg'd at the tablespace level, because data rows get added to the last partition when there is no space in the correct partition. PBR tablespaces can be reorg'd by partition. MAJOR difference for large tables.
In fact, a PBG table space CAN be reorganized at the partition level; true, if you only reorganize a partition of a PBG table space, rows won't be moved to where they "should" be (per the table's clustering index), but that may or may not be a big deal -- it depends on the situation.
Something else to keep in mind: at many sites, the large majority of PBG table spaces will never occupy more than one partition. The minimum DSSIZE is 1G, and lots of organizations will have tables in PBG table spaces that will never get to 1G in size. Obviously, REORG'ing at the partition level is nothing special if a PBG table space only has one partition.
When a PBG table space does go into multiple partitions, there are advantages to doing REORGs at the table space level; however, this is not the only option.
Robert
Hi Robert, I am trying to define a PBR similar to what you describe for 52 weeks, but for a rolling 10 years worth of data to be kept. My question: does the partitioning key have to be a DATE data type? The key that I am attempting to use is CHAR. There are no DATE data types in this table at all. What are the negatives of using a 'date' that is represented by a CHAR data type?
This will be my first PBR that I am defining.
Thanks!
Apologies for the very late response, Bob. I somehow overlooked the fact that you'd posted a comment on this post.
Anyway, the answer to your question is no - a DB2 for z/OS table partitioned on a time-period basis does NOT have to be partitioned on a column defined as DATE. If you have a CHAR column that contains values that you can use to effectively place data associated with different months into different partitions of a table, go ahead and partition the table on that column - you don't have to alter the table to add a DATE column for this purpose.
Regarding the use of DATE versus CHAR for a column that in fact holds date values, the main reason I'd prefer to use a DATE column in that case is that this lets DB2 know that the values are date values. That, in turn, lets me leverage DB2 features such as date/time arithmetic, and the wide range of date-related built-in scalar functions such as DAYOFWEEK, DAYS, and MONTHS_BETWEEN. DATE would also offer a bit of a disk space advantage over a character string representation of a date value, as values in a DATE column occupy only 4 bytes of space apiece.
Robert
I’ve had trouble understanding the ramifications of converting to PBG in regard to indexes on an traditional segmented table space. I just converted a 57 gig, 28 dataset traditional to PBG using DSSIZE 8G. The pk index still has piecesize 2G. Is the index going to run out of space after filling the 32nd dataset?
Could you provide a bit of additional information? Specifically: table space page size and index page size?
Robert
4K for both.
Sorry about the delay in responding.
When you change the underlying table space to universal PBG, another related change is the number of pieces allowed for an index on the table in the table space: instead of being limited to 32 pieces of 2G each, the new max pieces limit for the index becomes MIN(4096, 2^32 / (x / y)), with x being piece size and y being index page size. Plugging in 2G for x and 4K for y, that formula (if my math is correct) indicates that the index can go to 4,096 pieces; so, no it won't hit "full" at 64G.
Robert
Thank you VERY much for your time sir!
Post a Comment
Subscribe to Post Comments [Atom]
<< Home