Robert's Blog


Wednesday, July 1, 2009

Outer Join: Get the Predicates Right

A few days ago, I was working with a team of people from a large company, trying to improve the performance of some queries executing in a DB2 for z/OS-based data warehouse environment. One query in particular was running much longer than desired, and consuming a lot of CPU time, to boot. One of the team members noticed that the problem query, which involved several table-join operations, had a rather odd characteristic: no WHERE-clause predicates. All the predicates were in the ON clauses of the joins. In fact, there was even an inner join of table TAB_A (I won't use the real table names) with SYSIBM.SYSDUMMY1 (which of course contains nothing), with two ON predicates referencing columns in TAB_A, like this:

SELECT...
FROM TAB_A
INNER JOIN SYSIBM.SYSDUMMY1
ON TAB_A.COL1 = 12
AND TAB_A.COL2 = 'X'
...

One of the application developers (the queries we were analyzing are report-generating SELECT statements built and issued by application programs) removed this inner join to SYSDUMMY1 and changed the two ON-clause predicates to WHERE-clause predicates, and the query's elapsed and CPU times went way down.

We were left thinking that this Cartesian join (i.e., a join with no join columns specified) to SYSDUMMY1 might reflect someone's thinking that WHERE-clause predicates are not needed in table-join SELECT statements. In fact, the use of WHERE-clause predicates versus ON-clause predicates in table-join statements can have a very significant impact on query performance. We looked at another long-running query in the aforementioned data warehouse application, and this one also involved a join operation and also had no WHERE-clause predicates. Importantly, the join was a left outer join, and the ON clause included multiple predicates that referenced columns of the left-side table
(the table from which we want rows for the result set, regardless of whether or not there are matching right-side table rows). A DBA took one of these left-side-table-referencing ON-clause predicates and made it a WHERE predicate, too. In other words, an ON-clause predicate like TAB_L.COL2 = 5 (with TAB_L being the left-side table in the left outer join operation) was added to the query in the form of a WHERE-clause predicate. The result? Response time for the query went from 10 minutes to less than 1 second.

Why did the query's performance improve so dramatically, when all the DBA did was make an ON-clause predicate a WHERE-clause predicate? Simple: for a left outer join, a WHERE-clause predicate that references a column of the left-side table will filter rows from that table. That same predicate, if coded in the ON-clause of the SELECT statement, will NOT filter left-side table rows. Instead, that predicate will just affect the matching of left-side table rows with right-side table rows. To illustrate this point, consider the predicate mentioned in the preceding paragraph:

TAB_L.COL2 = 5

Suppose this predicate is included in the query in the following way:

SELECT TAB_L.COL1, TAB_L.COL2, TAB_L.COL3, TAB_R.COL4
FROM TAB_L
LEFT OUTER JOIN TAB_R
ON
TAB_L.COL3 = TAB_R.COL3
AND TAB_L.COL2 = 5

With no WHERE-clause predicates in this query, all rows from TAB_L will qualify - none will be filtered out. What the TAB_L.COL2 = 5 predicate will do is affect row matching with TAB_R: if a row in TAB_R has a COL3 value that matches the value of COL3 in TAB_L, and if the value of COL2 in that row is 5, the TAB_R row (specifically, COL4 of that row, as specified in the query's SELECT-list) will be joined to the TAB_L row in the query result set; otherwise, DB2 will determine that the TAB_R row is not a match for any TAB_L rows (and the TAB_L rows without TAB_R matches will appear in the result set with the null value in the TAB_R.COL4 column).

Now, suppose that the same predicate is specified in a WHERE clause of the query, as follows:

SELECT TAB_L.COL1, TAB_L.COL2, TAB_L.COL3, TAB_R.COL4
FROM TAB_L
LEFT OUTER JOIN TAB_R
ON
TAB_L.COL3 = TAB_R.COL3
WHERE TAB_L.COL2 = 5

In this case, the predicate will be applied to TAB_L before the join operation, potentially filtering out a high percentage of TAB_L rows (as was the case for the query cited earlier that went from a 10-minute to a sub-second run time).

So, a person codes a predicate that references a column of the left-side table of a left outer join operation, and places that predicate in an ON clause of the query, versus a WHERE clause
(and I'm not talking about a join predicate of the form TAB_L.COLn = TAB_R.COLn, which you expect to see in an ON clause). Is it possible that the query-writer actually wants the result described above, to wit: no filtering of left-side table rows, and a further condition as to what constitutes a right-side table match? Yes, that's possible, but there's a very good chance that this person mistakenly placed the predicate in an ON clause because he (or she) thought that this would have the same effect as coding the predicate in a WHERE clause. Mistakes of this type are fairly common because misunderstanding with respect to the effect of predicates in outer join queries is quite widespread. Patrick Bossman, a good friend who is a query optimization expert with IBM's DB2 for z/OS development organization, pointed out as much to me in a recent e-mail exchange. Patrick also sent me the links to two outstanding articles written by Terry Purcell, a leader on the IBM DB2 for z/OS optimizer team. These articles (actually, a part-one and part-two description of outer join predicates and their effects on query result sets) were written a few years ago, while Terry was with DB2 consultancy Yevich, Lawson, and Associates, but the content is still very much valid today (Patrick considers the articles to be "a must-read for folks writing outer joins"). Check 'em out.

Outer join is a powerful SQL capability that is widely used in DB2 environments. If you code outer join queries (or if you review such queries written by others), make sure that you use ON-clause and WHERE-clause predicates appropriately, so as to get the right result (job one) and the best performance (job two, right behind job one).

6 Comments:

Blogger Agoglife said...

very good point, thanks, please keep them coming

July 8, 2009 at 7:13 PM  
Blogger Robert Catterall said...

Thanks for the positive feedback, Arun. I'll definitely keep 'em coming.

July 8, 2009 at 8:18 PM  
Anonymous Anonymous said...

I think the sentence:
"In this case, the predicate will be applied to TAB_R before the join operation, potentially filtering out a high percentage of TAB_R rows (as was the case for the query cited earlier that went from a 10-minute to a sub-second run time)."
may change to TAB_R to TAB_L, please recommend.

Somsak (somsakji@gmail.com)

August 17, 2009 at 10:39 AM  
Blogger Robert Catterall said...

Right you are, Somsak - good catch! I just now corrected that error.

August 23, 2009 at 7:00 PM  
Anonymous Anonymous said...

Robert,
I know this is a late comment on this blog entry. I read when it first came out and was looking back over it due to AGAIN pointing someone in its direction. On second reading I just wanted to mention that it should be stressed that in many cases the predicate in the ON clause may be there for filtering when to perform the Left Outer Join. It is something I have been in the habit of doing for many years and, in fact, Susan Lawson mentions doing so in some of her literature. For instance, I have a parent table and three child tables, there will be a row in one of the three child tables based on a field in the parent table and I want a report that shows me all of the corresponding data for a particular item I would code the query like:
select parent.cols
,child1.cols
,child2.cols
,child3.cols
from parent
left outer join child1
on parent.key = child1.key
and parent.type = 'child1'
left outer join child2
on parent.key = child2.key
and parent.type = 'child2'
left outer join child3
on parent.key = child3.key
and parent.type = 'child3'
...with or without a where clause on the parent table
Also, the mistake I see most often (and is, usually, more expensive) here is turning the LEFT outer join to an INNER join. Such as:
select cols
from table(s) a
left outer join my_table b
on a.key = b.key
....
where ....
and b.field = ?
Dave Nance
dav1mo@yahoo.com

June 3, 2010 at 7:43 AM  
Blogger Robert Catterall said...

Good stuff, Dave -- thanks for passing it along.

Robert

June 3, 2010 at 7:11 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home