Robert's Blog

Thursday, January 24, 2008

Referential Integrity: Application-Based or DB2-Based?

This is a debate that has interested me for years. I've heard very good arguments on both sides, but before I get into that I'll state that what I don't like to see is the use of both DB2-based and application-based methods used to define and enforce the same referential integrity (RI) constraints. In other words, for a particular parent-child table relationship, use either a DB2-enforced constraint or an application-enforced constraint, but not both (to use both would, in my opinion, be a waste of system resources).

OK, I'll start with a look at application-enforced RI (i.e., the situation in which user-written code ensures that a row in a dependent table will not be inserted if a key-matching row does not exist in the associated parent table). The way I see it, there is at least one case in which application-based RI is your only choice, and that's when you want to referentially link two objects that are in different DB2 catalogs (in other words, in different DB2 databases if you're talking about DB2 for Linux, UNIX, and Windows; and in different subsystems in a DB2 for z/OS environment). I will point out that the need to referentially relate two objects in different databases/subsystems may not be so great when you're using DB2 (as opposed to some other relational database management systems), because DB2 can make it less necessary to place different tables in different databases/subsystems. With DB2, you can use unqualified table names in your programs' SQL statements (in fact, it's a good idea to do so) and still access different schemas (you might have different table-name qualifiers in development and production, or for sets of structurally identical tables holding, for example, data for a company's different sales regions) with no problem thanks to packages (for which the appropriate table-name qualifier can be supplied at bind time), collections, and the SET CURRENT PACKAGESET special register. With some other relational DBMSs, the use of fully-qualified table names in application SQL statements is considered a best practice, and separating different sets of related tables into different schemas can require the use of different databases.

So, how about other situations, for which the use of application-based RI is a choice and not a requirement? Some folks go the application-based RI route for performance reasons. Typically, this is done when one of the more challenging performance objectives faced by an organization is getting a massive amount of batch processing done in a limited time frame. In such cases, using application-based RI instead of DB2-based RI can reduce CPU and elapsed time for application processes. Here's a classic example: suppose that records from a file are to be inserted into a table in a DB2 database. Assume that the target table is a dependent (RI-wise) of the CLIENT table, and assume further that 1000 of the records in the input file have the same value in the Client ID field. If the RI constraint referencing the CLIENT table is DB2-defined then DB2 will check for the existence of the parent table row for each of the 1000 insert operations it performs to get the file rows into the dependent table. If the application handles RI enforcement, it can check for the existence of the parent table row when inserting the first of the 1000 rows with identical client IDs and then continue inserting the other 999 rows without performing a parent-table check.

The other classic example of situation in which application-based RI can deliver performance benefits has to do with large-scale DB2 delete operations. Suppose that you need to delete 1000 rows from a parent table, along with all the related rows in a dependent table. Assume that your program will do these deletes in a "bottom-up" fashion, so that all dependent-table rows will be deleted before the parent table rows are deleted. If DB2-based RI is used, for each delete of a parent table row DB2 will check for the existence of key-matching rows in the dependent table, even though those rows have already been deleted (DB2 does not know that your program is doing a "bottom-up" delete of rows in the tables). If the RI constraint is application-enforced, parent table deletes can proceed with no dependent-table checks because the application knows that all of the related rows in the dependent table have already been deleted.

The advantages of DB2-based RI are closely related to the disadvantages of application-based RI. When you go with application-based RI, maintenance of referential integrity becomes a programmer responsibility. Suppose the coding that enforces a fairly complex RI relationship was done by someone who had a lot of domain knowledge and who recently left the company. How well did he document his work? There's never a problem when you need information about DB2-defined RI constraints - just look in the catalog tables that contain RI information.

Another disadvantage of application-based RI is that it requires a lot of discipline, process, and structure with respect to data-change operations. You can't have someone just write a data-insert program if it lacks the requisite RI logic. What do you do in such cases? Do you have one great big routine that handles all data-change operations for a set of referentially-related tables (the bigger they are, the more difficult application maintenance becomes)? Do you instead go for smaller and more specific routines, and replicate associated RI logic (keeping that logic, spread across several programs, consistent over time could be a challenge)? If you define the RI constraints via DB2, programmers can code leaner data-change logic, and that's generally a good thing.

You can also realize performance benefits through the use of DB2-based RI. Just as application-based RI can improve CPU and elapsed time for certain large-scale, batch-oriented insert and delete processes, DB2-based RI can deliver improved performance in a transactional environment. Why? Because when DB2-based RI is in effect, a program that (for example) inserts one row into a dependent table can perform the row insert and the requisite RI check with one trip back and forth across the application-DB2 interface. In an application-based RI environment the RI check and the insert require two application-DB2 round trips. Those trips back and forth across the interface require the execution of CPU instructions, so the fewer of those you have to do, the better for a transaction's CPU efficiency (and elapsed time may be a little better, as well).

Oh, and something else: if you use DB2-based RI and you want to verify that the data in a table is in fact OK from an RI perspective, just use the DB2 for z/OS CHECK DATA utility or the integrity-checking capability available through the DB2 for LUW Control Center. Want to do the same thing when RI constraints are application-based? If so, be prepared to write your own RI-checking code.

What's my own preference? I'd have to say it's DB2-based RI. For one thing, performance concerns over the past 10 years or so have been shifting from batch run-times to transaction throughput, and that makes DB2-based RI a good choice. Of greater importance to me is the fact that DB2-based RI takes a big issue off of programmers' plates, and today's need for quick-response application coding and flexible systems makes that a big deal. So, let DB2 do the RI work for you.


Blogger Rick Butler said...

Hi Robert, will you also consider mentioning informational referential integrity?
V8 SQL ref says:
Referential constraints between base tables are also an important factor in determining whether a materialized query table can be used for a query. For instance, in a data warehouse environment, data is usually extracted from other sources, transformed, and loaded into data warehouse tables. In such an environment, the referential integrity constraints can be maintained and enforced by other means than the database manager to avoid the overhead of enforcing them by DB2. However, referential constraints between base tables in materialized query table definitions are important in a query rewrite to determine whether or not a materialized query table can be used in answering a query. In such cases, you can use informational referential constraints to declare a referential constraint to be true to allow DB2 to take advantage of the referential constraints in the query rewrite. DB2 allows the user application to enforce informational referential constraints, while it ignores the informational referential constraints for inserting, updating, deleting, and using the LOAD and CHECK DATA utilities. Thus, the overhead of DB2 enforcement of referential integrity is avoided and more queries can qualify for automatic query rewrite using materialized query tables.

So an RI constraint can be defined as NOT ENFORCED and can also be ENABLED for QUERY OPTIMIZATION.

Also, in a a warehouse
environment, the theory is that rows are not changed once they have been added.

February 5, 2008 at 9:32 AM  
Blogger Robert said...

Excellent point, Rick. It did not occur to me to mention informational RI, but it should have. In addition to the MQT benefit that you've mentioned, I can see informational RI constraints in a data warehouse environment as a fairly easy means of documenting (in the catalog) important information about table relationships in the database.

February 8, 2008 at 2:33 PM  
Anonymous Anonymous said...

DB2 based, always. The query optimizer also takes the RI's into account when it's deciding the optimum access path (at least on DB2-LUW, for all contraints).

March 11, 2008 at 3:34 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home