Robert's Blog

Monday, September 8, 2008

DB2 for z/OS Data Warehousing: Query Performance

In my last installment, I riffed a little on DB2 for z/OS data warehousing. I did a little compare and contrast on the topic of performance management in OLTP versus business intelligence application environments, and concluded with some words about system-level performance optimization for DB2-based BI workloads. Herewith, a few thoughts on the subject of analyzing and tuning the performance of data warehouse queries.

Don't be intimidated by the sheer size of a BI-oriented SELECT statement. Complex queries of the type regularly found within a data warehouse workload can be downright scary-looking when initially examined (I was looking through one last week that featured a 27-table join operation). The biggest query can be broken down into bite-sized chunks. Keep in mind, too, that a "big" query, in terms of the number of tables referenced, is not necessarily a bad (i.e., poorly-written) query. Particularly in a star-schema database design, in which a large, central fact table is surrounded (logically) by a multitude of dimension tables (some of which might have their own logically-associated sub-tables, making a diagram of the table arrangement look like a snowflake), it is not uncommon to see a lot of tables joined in a query. Performance can still be fine, especially if the dimension tables are on the small side and only a small percentage of the fact table rows are accessed.

Get real comfortable with EXPLAIN. EXPLAIN output (the explanation of the access path chosen by DB2 for execution of a query) can be rather voluminous for a really big query, but, as just mentioned, you can get through it if you'll just be methodical and take it step-by-step. At each step along the way, ask yourself: does DB2's decision (regarding, for example, selection of an index, or use of a join method, or the order in which tables are joined) make sense to you? If not, figure out why you and DB2 disagree.

What do you know that DB2 doesn't? Suppose a query is running too slowly, and you determine that the cause is the sub-optimal access path chosen by the DB2 optimizer. If the choice of a different available path (perhaps using a certain index) seems obvious to you, and you're wondering why it isn't obvious to the optimizer, consider that you may have knowledge of the data that the optimizer doesn't have. The optimizer knows what's shown by the statistics in the DB2 catalog. How up-to-date and accurate are those statistics? This isn't just a matter of running RUNSTATS more frequently (though that's a good thing to do) -- it's also about how you run RUNSTATS. For example, have you set up RUNSTATS to gather correlation statistics for a set of columns in a table via the COLGROUP keyword (with, perhaps, some distribution statistics thrown in through a specification of FREQVAL)? Want some help in figuring out the right way to run RUNSTATS to boost the performance of your data warehouse queries? Check out IBM's new (and free) DB2 Optimization Service Center tool.

You can generally index a table more liberally in a data warehouse versus an OLTP environment. When the workload is characterized by a high volume of quick-running transactions, with many (perhaps most) updating data in the DB2 for z/OS database, I like to be pretty conservative in terms of defining indexes on tables -- this because the aim is throughput and every index on a table makes every insert and every delete operation (and updates, if index-key column values are changed) more expensive. In data warehouse systems, data updates are often accomplished en masse during nightly ETL (extract/transform/load) runs, and while these update operations have to complete within a certain time window, that can often be accomplished even with a pretty good number of indexes, on average, defined on tables in the database (in some cases, people will drop most indexes on data warehouse tables after online query hours, then update the table data, then rebuild the indexes before the start of the new query day). While I get uncomfortable if the number of indexes defined on a table in an OLTP system exceeds 4 or 5, I'm generally OK with more indexes per table -- maybe up to the vicinity of 8-10 -- when the database serves a BI purpose. More indexes are useful for BI applications because there is more uncertainty as to which columns will be referenced in the predicates of queries.

Look for opportunities to use MQTs. Materialized query tables, relatively new on the mainframe DB2 scene, have for some time enabled DB2 for Linux/UNIX/Windows users to score big performance gains in data warehouse systems. The idea's pretty simple: a query may end up materializing a result set that is subsequently used in the generation of the final result set that's returned to the user. Such materializations are par for the course when aggregate functions (e.g., AVERAGE and COUNT) are used. It can take a while to build this intermediate result set on the fly, and if it contains a lot of rows (it might not, even if it's based on an evaluation of a lot of rows), query run time may be further elongated because the intermediate result set isn't indexed in any way. Enter the MQT -- essentially, a pre-built intermediate result set that is not only already-there, but indexable, as well. The really good part? DB2 can use an MQT in executing a query, even if the query doesn't explicitly reference the MQT -- DB2 is smart enough to figure out whether or not an MQT can be used to reduce a query's run time.

Partition those big tables. In an OLTP environment, partitioning decisions are often driven by availability considerations (e.g., reduced time to recover a partition versus an entire tablespace). For DB2-based BI applications, partitioning is a performance thing. Among other things, partitioning drives query parallelization. Something important to keep in mind: with DB2 V8 and beyond, a table can be partitioned on one key and clustered on another (meaning that data within a partition can be clustered by something other than the partitioning key). What's a big table? I'd say, most anything with a million or more rows.

Make the right table clustering decisions. Speaking of clustering, this is a really big deal in a data warehouse environment, where bunches of rows are typically retrieved in the execution of a single query (versus the smaller result sets -- often just one row -- that are more the norm for OLTP applications). When you're going after a lot of rows, locality of reference (having the desired rows physically close to each other) can make quite a difference in query run-times. Also, when tables A and B are frequently joined, it's generally a VERY good idea to have the tables clustered in the same way.


Post a Comment

Subscribe to Post Comments [Atom]

<< Home