Robert's Blog


Thursday, October 25, 2007

Getting DB2 Data From A to B

It's a fairly common challenge: you have an operational DB2 database (meaning, one that supports one or more of an organization's core business applications), and a query/reporting database, and you want to make sure that the data in the latter stays fairly current with respect to update actions (insert/update/delete) affecting data in the former. How should you address this challenge? The answer, of course, depends on a number of factors. In this post, I'll describe some data-change replication solutions that might be a good fit, given certain circumstances.

Often, development of an effective strategy for data-change replication begins with an understanding of the organization's priorities in this area. For one company, the priorities could be minimization of dollar cost and implementation time. In that case, one might opt to go with a trigger-driven approach. Database triggers provide some important benefits when the goal is a relatively low-cost replication solution that is up and running in a relatively short period of time:
  • You don't have to buy them - Trigger functionality is built into DB2.
  • You don't have to alter existing source-data-change programs - The triggers are fired by what these programs are already doing.
  • They provide a lot of flexibility - A trigger can initiate an action before or after the triggering data change operation has taken effect. The triggered action can be anything doable by way of an SQL statement or statements, and since a stored procedure call is an SQL statement, you can do most anything programmatically possible using a trigger.
A very important thing to keep in mind when contemplating a trigger-based replication solution is this: the action taken via a trigger is synchronous with respect to the triggering data-change operation. In other words, an insert statement (for example) aimed at a table on which an insert trigger has been defined will not complete successfully unless and until the action taken by the trigger has completed successfully. This being the case, one thing that you should NOT do through a trigger is apply a captured data change to the target query/reporting database. This would take too long, and would slow down source-data change operations too much. Instead, you should use the trigger to place the captured data change information into a holding tank from which an apply process running on the target system can retrieve it. The holding tank is sometimes in the form of a DB2 table (or tables) on the source system. An alternative for temporary storage of the captured, to-be-applied data changes is a message queue, such as that implementable through IBM's WebSphere MQ product (hereinafter referred to as MQ).

Here are some of the things I like about the use of an MQ queue for captured data changes that will be applied on the target database system:
  • MQ provides assured delivery of messages (such as those containing captured source-system data changes) from one location to another (or to several target locations through publish and subscribe functionality, also known as pub/sub).
  • An MQ queue can be configured as a recoverable resource, so that even if a server on which a queue is located crashes, no messages will be lost.
  • If a burst of source-system data changes causes messages to be placed on a queue (via triggers, for example) faster than they can be taken off the queue by target-system processes, that's not a problem - the queue depth just increases for a while, until the PUT (i.e., the placing of messages on the queue) and GET activity (referring to the retrieval of messages from the queue) returns to a state of equilibrium.
  • MQ runs on just about any server platform, under most all of the major operating systems.
  • An MQ GET can be executed as a "destructive read" (i.e., a message can be simultaneously retrieved from and removed from a queue), providing a very CPU-efficient message delete capability.
  • MQ messages can be processed on a FIFO basis (first in, first out) - important if the order of data change application on the target system is important.
  • Built-in DB2 SQL functions such as MQSEND and MQRECEIVE make it much easier to code triggers that interact with MQ.
Good stuff, yeah, but it must be said that at some companies, the volume of source-system data changes is too high for triggers to be a top-choice solution (there are systems that process 1000 or more database changes per second during peak times). These organizations are more likely to opt for a data replication solution that will interface with the source system's database transaction log manager. DB2 for z/OS users can write such routines themselves, but most choose to purchase a replication tool that will do the job for them. Real-time data-change replication tools that can capture from a DB2 for z/OS system and apply on target systems running DB2 or other DBMSs are available from several vendors, including:
The important thing is to know that you have options when it comes to implementing a DB2 data replication solution. Think about the characteristics of your environment, such as the rate of data change activity for source tables or interest, the allowable lag time between a source data change and the corresponding target data change (could be an hour, could be seconds), the toleration for impact of a replication solution on CPU consumption and throughput on the source side (depends largely on current source system CPU utilization, transaction response times, and batch run times), and budget constraints (vendor tools might or might not be a possibility). With your situation well understood, determine the approach that will make sense for you.

Thursday, October 18, 2007

From Data to Information

Greetings from Las Vegas. It is popularly said of this glitzy city that "what happens here, stays here." This post is an exception to that rule. IBM is communicating an important idea to about 6,500 people (myself included) attending the Company's Information On Demand 2007 Conference, and I want to communicate it further through my blog. The message is deceptively simple: in the world of data, what really matters is information. Data-oriented IT professionals - be they DBAs, systems programmers, or application developers - focus a lot of attention on database management systems such as IBM's DB2. These database "engines" are obviously important, but they are pieces of a larger puzzle, and the big picture is all about information.

So, what's the distinction between data and information? As I see it, information is data in a useful form. I'll illustrate the data-information difference with an example culled from my baby-boomer memory bank: quite a few years ago, when the "Saturday Night Live" TV show was just getting started, Chevy Chase was hilarious as the anchor of the fictitious "Weekend Update" news program. Once, in that role, Chase said that he had some football scores to report: "12 to 7, 28 to 14, 35 to 10, and 17 to 3." 12 to 7 is data. Dallas 12, Philadelphia 7 is information.

Now, you can't stop (at least, I can't) with the notion that information is data in a useful form. Information confers value on data, but the value of information itself depends on a number of factors, and people who work with database technology - in whatever capacity - would do well to keep these very much in mind:
  • Timeliness - Take a look at the "front page" of the CNN.com Web site. You'll see next to some of the headline items an indication (emphasized in red) of how much time has elapsed since the associated stories were posted or most recently updated (e.g., 38 min, 5 min). The folks at CNN are no dummies. They know that information can quickly go stale, and they want their site visitors to see that the goods on display are fresh. The same is often true in a business setting: getting information quickly to people who need it can have a significant impact on an organization's performance. I like the way Arvind Krishna, IBM's Vice President for Data Servers and Information Management Development, puts it when he speaks of the "increasing velocity of information" - the need to get information ever more quickly from point of origin to point(s) of consumption.
  • Accuracy - This is much more than simply guarding against erroneous database updates and inserts (although that is important). Informational accuracy is also about data rationalization. If you're a clothing company and you don't know that the Pat Richards in your customer database is the same person as the Patrick Richards who's also in the database, you won't be building goodwill when you send Pat a catalog of ladies' sportswear.
  • Reliability - A company loses credibility with securities analysts if its earnings forecasts consistently turn out to be well off the mark. Is there a problem with the data on which the company bases its forecasts, or with the tools and processes that generate the estimates?
  • Consistency - You've heard the words in countless commercials: "Has this ever happened to you?" People from an organization call five individuals employed by one of the company's suppliers, and get five different answers to the same question because the five people on the supplier end are using five different (and uncoordinated) informational systems. This kind of situation underscores the importance of having a "single version of the truth."
  • Accessibility - Of course you want end users to have ready access to the information they need to do their jobs, but what about application developers? Can they access data in your system in a way that makes boosts their programming productivity? If they want to use JDBC or ADO.NET or Ruby on Rails, are they told, "COBOL or nothing?" If they want to use dynamic SQL to speed the delivery of a new application function, are they stonewalled? Is that right?
  • Relevancy - Information about last month's electric bill for a store can be useful, but not to the person responsible for keeping the right product mix on the store's shelves. The right information has to get to the right people.
Sometimes IT infrastructure people draw the line at cost-efficiency, throughput, scalability, and availability, and figure that other imperatives (such as those listed above) are someone else's problem. The goals towards which these people devote their efforts are indeed important, but they are only part of story when it comes to providing the informational capabilities an organization needs for breakthrough success. Techies ought to realize that they can make valuable contributions within the broader domain of information management, and that they are on the same team as their colleagues who work further "downstream" along the course by which data becomes what the business really needs: information.

Wednesday, October 10, 2007

What Happens to Your Data When it Gets Old?

At way too many sites, the answer to this question is, "Nothing." Application developers and DBAs spend lots of time figuring out how data in a database will be inserted, updated, and retrieved, but all too often the matter of removing data from the database ends up getting deferred until sometime after an application has gone into production, if not indefinitely. As Arnold Schwarzenegger once said in a movie: "Big mistake."

What's the problem with leaving old data (and by this I mean data that is either never referenced or referenced very infrequently) in a database? A no-delete policy (and this policy may be de facto, as opposed to being intentional) can cause pain in several ways:
  • As the size of the database grows, data pages can become increasingly filled with rows that are rarely referenced. This can occur when 1) access frequency declines as rows age (quite common) and 2) data is not clustered by timestamp or some other continuously-ascending key (DB2 tables are often clustered so as to keep rows pertaining to a particular customer ID or product number in close proximity to each other, as opposed to being clustered in an all-new-rows-to-the-back fashion). As the old rows become more and more numerous, a query that returns some number of newer rows has to access more and more pages to assemble the result set. That, in turn, causes CPU consumption to increase (for more information on this topic, see my post of September 27, 2007, titled "The Most Important DB2 for z/OS Performance Metric").
  • Another contributor to greater page access activity (and with it, more CPU consumption): more rows in a table leads of course to more leaf pages in the indexes defined on the table, and more leaf pages can eventually cause a 3-level index (root page, non-leaf pages, and leaf pages) to become a 4-level index (with a second "layer" of non-leaf pages). Just like that, the page accesses needed to traverse that index tree go up by 33%.
  • As tables grow, the CPU cost and run times of utility operations tend to increase (examples include the DB2 REORG, COPY, and RUNSTATS utilities). This is true regardless of how the data in a table is clustered.
  • More rows in the database will lead to increased demand for disk space. Yeah, disk capacity keeps getting cheaper on a per-megabyte basis, but vendors are not (as far as I know) ready to give disk subsystems away. And guess what? The cost of additional disk devices is not your only concern here. These days, a lot of executives with data center management responsibilities worry about floor space and power distribution limitations (some organizations are barely able to plug in one more "box" in their primary data centers).
  • As the database grows, the cost of disaster recovery (DR) preparedness grows. A large number of organizations have implemented near-real-time data replication solutions that keep a DR-site copy of a production database very close in time with respect to the state of the source database. These solutions - some utilizing DBMS-provided functionality such as the HADR feature of DB2 for Linux/UNIX/Windows, others using disk array-based replication - are great for taking a recovery point objective (otherwise known as RPO - the target for maximum data loss in the event of a disaster situation at the primary data center) down below one minute, but the larger the primary database becomes, the more expensive this DR set-up becomes.
So, getting old data out of your production database is good. But is something so easily said, easily done? Not necessarily. Even deciding on delete criteria can be tricky. Yes, it is possible that something as simple as the age of a record (meaning the time since the row was inserted into the database) could, by itself, be the record's "time to delete" indicator. On the other hand, it could be that a record, even though its age has passed a delete threshold, cannot be deleted because some related event - the delivery of an ordered product, for example - has not yet occurred. Delete logic for an application's data records can actually be quite complex, and that is why user programming may be needed and why record deletion needs to be part and parcel of an application's design, and not a post-implementation activity.

Note that getting an old row out of a database may be a two-stage process. Some organizations will pair certain relatively large tables with associated history tables. Consider, for example, an ORDER table and a corresponding ORDER_HISTORY table. Stage 1: thirty days after a row has been inserted into the ORDER table, it is moved to the ORDER_HISTORY table. Stage 2: 150 days after being moved to the ORDER_HISTORY table, the row is removed from the database. An organization utilizing this approach might do so because it is known that information about an order is most likely to be accessed within 30 days of the order being placed. After that, the information will be accessed relatively infrequently for the next 150 days (the organization provides customers with online access to 180 days of order history information - the 30 days from the ORDER table plus the 150 days in the ORDER_HISTORY table). With the order rows split across two tables, performance is enhanced because the smaller, more active ORDER table provides better locality of reference for the most frequently accessed rows, thereby reducing page accesses and more effectively leveraging buffer pool resources.

Sticking with the above example, what happens to an order record after 180 days in the database? Well, it will be purged (tossed into the proverbial bit bucket) if the information is no longer needed by the organization; otherwise, it will be archived. The idea behind archiving data is to keep it available for possible retrieval (for legal reasons and/or customer-care reasons and/or...) for a considerable period of time (usually measured in years), in a way that minimizes storage costs (think higher-density, lower-performance disk storage in a multi-tiered storage infrastructure) and provides for automated location and retrieval as needed. Doing this in a roll-your-own fashion is a tall order, especially if the database is large and archive/retrieval rules are complex (an insurance company, for example, may need to know for some years what a policyholder's coverage was on a particular day in the past). Thus the growth of the Information Management Lifecycle (ILM) segment of the software industry. IBM is a major player in this area, particularly in light of Big Blue's recent acquisition of Princeton Softech. ILM solutions are also available from several other vendors, including Symantec and EMC. You might want to consider checking these out.

Here's the bottom line: when you're working out the ways in which database records will be inserted, retrieved, and updated in support of a new application, don't neglect to consider the important matter of removing data from the database when it is no longer needed or when access frequency drops to a very low level. Data purge and archive can be a fairly complicated process, so give yourself enough runway to design or purchase an effective solution before the application is in production. For applications already in production, see if you can phase in an ILM solution - preferably one that can work across platforms and DBMSs. Take it from me: getting data that doesn't belong in your database out of your database should be a priority for your organization. Bigger is NOT always better. Get the excess bulk out of your database, OK?

Wednesday, October 3, 2007

Pros and Cons of Dynamic SQL

Static versus dynamic SQL is a debate that's gone on for years in DB2 circles. I'm weighing in on the matter with this post because the nature of the debate has changed recently in some interesting ways. I'll be writing primarily from the DB2 for z/OS perspective, but many of the issues are the same, or nearly so, in DB2 for Linux/UNIX/Windows (LUW) environments.

When dynamic SQL is the topic of a discussion, you typically have application developers on the one side ("Let us use dynamic SQL!") and DB2 systems programmers and DBAs on the other side ("Sure - as soon as pigs learn to fly"). The appdev people are usually seeking a programming productivity boost, while the systems folks see dynamic SQL as being a risk factor with respect to their twin goals of providing a highly-available and CPU-efficient DB2 subsystem. Both sides have valid points to make, and chances of reaching agreement are enhanced if each group makes an effort to understand the motivation behind the other group's arguments for and against the use of dynamic SQL.

The sysprogs and DBAs often argue that dynamic SQL will consume too many CPU cycles on the DB2 data-serving platform, compared to the same statements executed as static SQL. Let's look at this issue more closely. If the concern is over the cost of preparing a dynamic SQL statement (a cost incurred at package bind time for static SQL), it's worth considering the nature of the dynamic SQL statements that would be issued by an application program. If the statements are expected to be very short-running, the CPU cost of statement preparation could indeed be several times the cost of statement execution (unless DB2 dynamic statement caching is utilized - more on this momentarily). It could be, for example, that a dynamic SQL statement consumes half of a CPU millisecond in execution, and two CPU milliseconds in preparation. If, on the other hand, the dynamic SQL statements are expected to be relatively long-running, statement preparation cost could pale next to execution cost, making the CPU consumption of statement preparation pretty much a moot point (imagine a dynamic SQL statement that consumes a whopping 10 CPU milliseconds in preparation and then uses 5000 CPU milliseconds - 5 CPU seconds - in execution).

Sometimes, the DBAs and sysprogs are concerned that DB2 CPU efficiency could be negatively impacted by a "runaway" dynamic SQL statement. Here, it's important to make the distinction between dynamic SQL that is ad-hoc (i.e., statements will be very much unpredictable with respect to form and structure) and dynamic SQL that is not a mystery at all because it is written by programmers (and quite possibly reviewed by DBAs prior to program implementation). This not-ad-hoc stuff is what I call structured dynamic SQL. It might be that an application enables an end user to supply search arguments that will be used in a query's predicates by typing values into some or all of a group of fields on a screen. To handle the various possible predicate combos, the programmers could code a bunch of DECLARE CURSOR statements, with one being issued based on a user's supplied values, or they could write code that would dynamically build the statement string based on the user's input and then pass that statement string to DB2 for preparation and execution. That's predictable dynamic SQL, and statements can be tested before program implementation because structure is known ahead of time (based on possible user inputs, the statements will have a common overall form and structure will come from a definable set of alternatives based on supplied predicate values).

Another example of structured dynamic SQL involves the use of a database-access API such as JDBC (Java Database Connectivity) or ODBC (Open Database Connectivity) that cause DB2 to process the resultant SQL statements dynamically. In such cases, the structure and form of the SQL statements is certainly known beforehand, and unpredictability is virtually a non-issue.

"Whoa," say some sysprogs and DBAs, "Even when we know what an incoming dynamic SQL statement will look like, we could still have unpredictability because a statement might get a different access path from one execution to another." Technically, that's true, given that the statements are prepared at each execution (again, deferring discussion of dynamic statement caching), and the catalog statistics on which the DB2 optimizer makes its access path decisions could change from one issuance of a statement to another. Here, as in many cases, it's important to dig deeper before letting access path change concerns be a dynamic SQL deal-breaker. First of all, if an access path changes because of updated catalog stats, the change should be for the better (DB2 makes better decisions when catalog stats are more up-to-date and accurate). Second, some statements are very unlikely to get a different access path, even after catalog stats changes. If, for example, a structured dynamic SQL statement uses an index to retrieve a row based on the match of a unique index key, that path is highly to differ from one execution of the statement to another.

Resource contention caused by dynamic SQL is another concern frequently cited by DB2 DBAs and sysprogs. A key worry is often the S-lock taken on a DBD (database descriptor) as a result of a dynamic SQL statement targeting one or more tables within a DB2 database (and here I'm using "database" in the narrow DB2 for z/OS sense: a user-specified set of one or more tables and related objects that comprise a subset of all the tables defined within aDB2 subsystem). This S-lock will block DDL statements (DROP, CREATE, ALTER) issued at the time against one or more of the objects in that same database. OK, again, some digging is advisable. How often, and when, are DDL statements issued in the production environment? How long do the dynamic SQL statements typically run? Can objects accessed by the dynamic SQL statements be isolated within a certain database or databases within the DB2 subsystem (and likewise, can objects frequently targeted by DDL statements be isolated in smaller databases) so as to minimize DBD contention?

What about contention with utility processes, such as online REORGs of indexes and tablespaces? This should not be an issue, since IBM DB2 utilities utilize drain locking to gain restricted access to objects, and application processes that issue dynamic SQL statements must release claims (whether read or write) on objects at commit points - just as static SQL-issuing processes must do.

How about security? That's certainly an important consideration. When dynamic SQL is used, the DB2 authorization ID of a process must have the table access privileges (SELECT, INSERT, UPDATE, DELETE) needed for successful statement execution, whereas successful static SQL execution requires only that the DB2 auth ID of the application process have the EXECUTE privilege on the package containing the SQL statements. One oft-used solution here is to package the "table-touching" SQL statements in DB2 stored procedures which can be invoked using JDBC or ODBC or whatever database access API the application programmers want to use. When that's not an option (as when a program will dynamically build an SQL statement string based on user-supplied predicate values), some organizations will have programs provide an application-specific DB2 authorization ID that is not equal to any individual's ID - in that case, data security becomes mostly an application-side responsibility.

There are times, of course, when the DB2 DBAs and sysprogs have to go with dynamic SQL. The classic example here is a purchased application such as SAP or PeopleSoft that uses what DB2 sees as dynamic SQL because the application can work with different DBMSs and static SQL is pretty much a DB2-exclusive concept. When that is the case, the usual response is to heavily leverage DB2 dynamic statement caching so as to slash statement preparation costs. The amount of virtual storage an EDM pool can consume when dynamic statement caching is used with a big application (could be several hundred megabytes) used to worry some DB2 systems programmers, but that worry faded when DB2 for z/OS (starting with Version 6) allowed the part of the EDM pool used for caching prepared dynamic SQL statements to be kept in a z/OS data space, and just about disappeared when DB2 for z/OS Version 8 delivered 64-bit virtual storage addressing.

An important point about dynamic statement caching: it's of very little value if the dynamic SQL statements in question are not parameterized. That is to say, if predicate values are literals plugged directly into statement strings, dynamic statement caching won't do much to help reduce statement preparation costs unless the plugged-in predicate values hardly ever change. It's a much better coding practice to use parameter markers where predicate values would go, and then to supply the substitution values when the statement is issued by the program. Off-the-shelf software providers know this and code accordingly. When structured dynamic SQL is to be used in your environment, parameterized statements should be a requirement of the application programmers.

Here's a new reason for considering structured dynamic SQL: the emergence of application enabling tools that map the object-centric data view of a modern application programmer with the schema-centric data view of a DBA. Through a tool of this nature, an application developer can request an object ("object" in this context meaning an informational instantiation of something - one example would be information about a particular customer), probably by calling a Web service. The request is transformed by a server-side piece of code into an SQL statement, and the desired information is returned to the requesting client program. The client program might update the object and then invoke a Web service to save the modified object. The aforementioned tool transforms the save request into the SQL statement or statements that will "harden" the data changes associated with the object modifications to the back-end database. The SQL statements sent to the data server are dynamic, and that might give some DBAs and sysprogs pause; however, by enabling business-logic tier application programmers to access and manipulate information in the form of objects, an object-schema mapping tool could significantly improve developers' ability to quickly respond to calls for new functionality spurred by market changes or new opportunities. This would enhance the organization's agility, and that might be worth the expenditure of some extra CPU cycles to process dynamic SQL statements.

The bottom line is this: programmers who request the use of dynamic SQL to boost their ability to quickly develop application functionality generally know what they're talking about, and the same is true of sysprogs and DBAs who want to handle dynamic SQL in such a way as to protect and make good use of an organization's DB2 data-serving resources. When people in these two groups make the effort to understand each others' objectives and concerns, a way can often be found that will satisfy all stakeholders. Don't stand behind outdated edicts. Dig into the issues and put your knowledge and creativity to good use.