Robert's Blog


Wednesday, March 17, 2010

My Favorite DB2 9 for z/OS BI Feature

A lot of the consulting work that I do relates to the use of DB2 for z/OS as a data server for business intelligence (BI) applications. DB2 9 for z/OS delivered a number of new features and functions that are particularly attractive in a data warehousing context, including index compression, index-on-expression, instead-of triggers, the INTERSECT and EXCEPT set operators, and the OLAP functions RANK, DENSE_RANK, and ROW_NUMBER. That's all great stuff, but my favorite BI-friendly DB2 9 feature -- and the subject of this blog entry -- is global query optimization.

I'll admit here that I didn't fully appreciate the significance of global query optimization when I first heard about it. There was something about virtual tables, and correlating and de-correlating subqueries, and moving parts of a query's result set generation process around relative to the location of query blocks within an overall query -- interesting, but kind of abstract as far as I was concerned. Time and experience have brought me around to where I am today: a big-time global query optimization advocate, ready to tell mainframe DB2 users everywhere that this is one outstanding piece of database technology. I'll give you some reasons for my enthusiasm momentarily, but before doing that I'd like to explain how global query optimization works.

Adarsh Pannu, a member of IBM's DB2 for z/OS development team, effectively summed up the essence of global query optimization when he said in a recent presentation that it's about "improved subquery processing." That really is it, in a nutshell, and here's the first of the BI angles pertaining to this story: queries that contain multiple subquery predicates are common in data warehouse environments. These subquery predicates might be in the form of non-correlated in-list subqueries, like this one:

WHERE T1.C1 IN (SELECT C2 FROM T2 WHERE…)

Alternatively, a subquery predicate might be a match-checking correlated subquery like this one:

WHERE EXISTS (SELECT 1 FROM T2 WHERE T2.C2 = T1.C1…)

Prior to Version 9, in evaluating a query containing one or more subqueries DB2 would optimize each SELECT in isolation, without regard to the effect that the access path chosen for a subquery might have on the performance of the query overall. Furthermore, DB2 would handle subquery processing based on the position of the subquery within the overall query -- in other words, if a subquery were a good ways "down" in an overall query, it wouldn't be evaluated "up front" at query execution time. Using this approach, DB2 might well choose the optimal access path for each individual SELECT in a query, but the access path for the query overall could end up being quite sub-optimal in terms of performance.

DB2 9 takes a different approach when it comes to optimizing a query containing one or more subquery predicates. For one thing, it will evaluate a subquery predicate in the context of the overall query in which the subquery appears. Additionally, DB2 9 can do some very interesting transformative work in optimizing the overall query. It might, for example, change a correlated subquery predicate to a non-correlated subquery, materialize that result set in a work file, and -- treating this materialized result set as a virtual table -- "move” it to a different part of the overall query and join it to another table referenced in the query. Needless to say, such transformations -- correlated subquery to non-correlated subquery (or vice versa), subquery to join -- are always accomplished in such a way as to preserve the result set of the query as initially coded.

An example can be very useful in showing how all this comes together in the processing of a subquery-containing query in a DB2 9 environment. Suppose you have a query like the one below, in which a correlated subquery predicate has been highlighted in green:

SELECT * FROM TABLE_1
WHERE EXISTS
(SELECT 1 FROM TABLE_2
WHERE TABLE_1.COL1 = TABLE_2.COL1
AND TABLE_2.COL2 = 1234
AND…)

The EXPLAIN output for the query, showing the working of global query optimization, might look like this (and here we see just a few of the relevant columns from a PLAN_TABLE):



Here's what the EXPLAIN output is telling you: the optimizer changed the match-checking correlated subquery highlighted in green to a non-correlated subquery (evidenced by “NCOSUB” in the QBTYPE column of the PLAN_TABLE). This result set (which you’d get if you removed the predicate in the correlated subquery that contains the correlation reference to a column in TABLE_1) is sorted to remove duplicates (this to help ensure that the overall query’s result set will not be changed due to the query transformation) and to get the result set rows ordered so as to boost the performance of a subsequent join step (see the “Y” values under SORTC_UNIQ (shortened to SCU) and SORTC_ORDERBY (SCO) of the PLAN_TABLE). Then, the materialized and sorted subquery result set, identified as “virtual table” DSNWFQB(02), with the “02” referring to the query block from which the result set was generated (the “de-correlated” subquery), is moved to the “top” of the query and nested-loop joined to TABLE_1 (see the “1” under the METHOD (shortened to M) column of the PLAN_TABLE). It all boils down to this: the correlated subquery predicate is used to match rows in TABLE_1 with rows in TABLE_2, and in this case the optimizer has determined that the required row-matching can be accomplished more efficiently with a join versus the correlated subquery. It's a big-picture -- as opposed to a piece-part -- approach to query optimization.

So, why am I big on query optimization? Reason 1 is the potentially huge improvement in performance that can be realized for a subquery-containing query optimized by DB2 Version 9 versus DB2 Version 8 (I'm talking about an order of magnitude or more improvement in elapsed and CPU time in some cases). Reason 2 is the fact that global query optimization can greatly improve query performance without requiring that a query be rewritten. That's important because BI queries are often generated by tools, and that generally means that query rewrite is not an option. It's possible that some of the complex queries in a data warehouse environment might be generated by application code written by an organization’s developers (or by people contracted to write such code), and if such is the situation you could say that query rewrite (via modification of the query-generating code) is technically possible. Even in that case, though, rewriting a query could be very difficult. In my experience, custom-coded “query-building engines” may construct a query by successively applying various business rules, and this often takes the form of adding successive subquery predicates to the overall query (these might be nested in-list non-correlated subqueries). To “just rewrite” a query built in this way is no small thing, because coding the query with more in the way of joins and less in the way of subqueries would require more of an “all at once” application of business rules versus the more-straightforward successive-application approach. Thankfully, global query optimization will often make query rewrite a non-issue.

I'll conclude with this thought: if you have a data warehouse built on DB2 for z/OS Version 8, global query optimization is one of the BEST reasons to migrate your system to DB2 9. If you're already using DB2 9, but not in a BI-related way, global query optimization is a great reason to seriously consider using DB2 9 for data warehousing.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home