Robert's Blog

Wednesday, October 10, 2007

What Happens to Your Data When it Gets Old?

At way too many sites, the answer to this question is, "Nothing." Application developers and DBAs spend lots of time figuring out how data in a database will be inserted, updated, and retrieved, but all too often the matter of removing data from the database ends up getting deferred until sometime after an application has gone into production, if not indefinitely. As Arnold Schwarzenegger once said in a movie: "Big mistake."

What's the problem with leaving old data (and by this I mean data that is either never referenced or referenced very infrequently) in a database? A no-delete policy (and this policy may be de facto, as opposed to being intentional) can cause pain in several ways:
  • As the size of the database grows, data pages can become increasingly filled with rows that are rarely referenced. This can occur when 1) access frequency declines as rows age (quite common) and 2) data is not clustered by timestamp or some other continuously-ascending key (DB2 tables are often clustered so as to keep rows pertaining to a particular customer ID or product number in close proximity to each other, as opposed to being clustered in an all-new-rows-to-the-back fashion). As the old rows become more and more numerous, a query that returns some number of newer rows has to access more and more pages to assemble the result set. That, in turn, causes CPU consumption to increase (for more information on this topic, see my post of September 27, 2007, titled "The Most Important DB2 for z/OS Performance Metric").
  • Another contributor to greater page access activity (and with it, more CPU consumption): more rows in a table leads of course to more leaf pages in the indexes defined on the table, and more leaf pages can eventually cause a 3-level index (root page, non-leaf pages, and leaf pages) to become a 4-level index (with a second "layer" of non-leaf pages). Just like that, the page accesses needed to traverse that index tree go up by 33%.
  • As tables grow, the CPU cost and run times of utility operations tend to increase (examples include the DB2 REORG, COPY, and RUNSTATS utilities). This is true regardless of how the data in a table is clustered.
  • More rows in the database will lead to increased demand for disk space. Yeah, disk capacity keeps getting cheaper on a per-megabyte basis, but vendors are not (as far as I know) ready to give disk subsystems away. And guess what? The cost of additional disk devices is not your only concern here. These days, a lot of executives with data center management responsibilities worry about floor space and power distribution limitations (some organizations are barely able to plug in one more "box" in their primary data centers).
  • As the database grows, the cost of disaster recovery (DR) preparedness grows. A large number of organizations have implemented near-real-time data replication solutions that keep a DR-site copy of a production database very close in time with respect to the state of the source database. These solutions - some utilizing DBMS-provided functionality such as the HADR feature of DB2 for Linux/UNIX/Windows, others using disk array-based replication - are great for taking a recovery point objective (otherwise known as RPO - the target for maximum data loss in the event of a disaster situation at the primary data center) down below one minute, but the larger the primary database becomes, the more expensive this DR set-up becomes.
So, getting old data out of your production database is good. But is something so easily said, easily done? Not necessarily. Even deciding on delete criteria can be tricky. Yes, it is possible that something as simple as the age of a record (meaning the time since the row was inserted into the database) could, by itself, be the record's "time to delete" indicator. On the other hand, it could be that a record, even though its age has passed a delete threshold, cannot be deleted because some related event - the delivery of an ordered product, for example - has not yet occurred. Delete logic for an application's data records can actually be quite complex, and that is why user programming may be needed and why record deletion needs to be part and parcel of an application's design, and not a post-implementation activity.

Note that getting an old row out of a database may be a two-stage process. Some organizations will pair certain relatively large tables with associated history tables. Consider, for example, an ORDER table and a corresponding ORDER_HISTORY table. Stage 1: thirty days after a row has been inserted into the ORDER table, it is moved to the ORDER_HISTORY table. Stage 2: 150 days after being moved to the ORDER_HISTORY table, the row is removed from the database. An organization utilizing this approach might do so because it is known that information about an order is most likely to be accessed within 30 days of the order being placed. After that, the information will be accessed relatively infrequently for the next 150 days (the organization provides customers with online access to 180 days of order history information - the 30 days from the ORDER table plus the 150 days in the ORDER_HISTORY table). With the order rows split across two tables, performance is enhanced because the smaller, more active ORDER table provides better locality of reference for the most frequently accessed rows, thereby reducing page accesses and more effectively leveraging buffer pool resources.

Sticking with the above example, what happens to an order record after 180 days in the database? Well, it will be purged (tossed into the proverbial bit bucket) if the information is no longer needed by the organization; otherwise, it will be archived. The idea behind archiving data is to keep it available for possible retrieval (for legal reasons and/or customer-care reasons and/or...) for a considerable period of time (usually measured in years), in a way that minimizes storage costs (think higher-density, lower-performance disk storage in a multi-tiered storage infrastructure) and provides for automated location and retrieval as needed. Doing this in a roll-your-own fashion is a tall order, especially if the database is large and archive/retrieval rules are complex (an insurance company, for example, may need to know for some years what a policyholder's coverage was on a particular day in the past). Thus the growth of the Information Management Lifecycle (ILM) segment of the software industry. IBM is a major player in this area, particularly in light of Big Blue's recent acquisition of Princeton Softech. ILM solutions are also available from several other vendors, including Symantec and EMC. You might want to consider checking these out.

Here's the bottom line: when you're working out the ways in which database records will be inserted, retrieved, and updated in support of a new application, don't neglect to consider the important matter of removing data from the database when it is no longer needed or when access frequency drops to a very low level. Data purge and archive can be a fairly complicated process, so give yourself enough runway to design or purchase an effective solution before the application is in production. For applications already in production, see if you can phase in an ILM solution - preferably one that can work across platforms and DBMSs. Take it from me: getting data that doesn't belong in your database out of your database should be a priority for your organization. Bigger is NOT always better. Get the excess bulk out of your database, OK?


Post a Comment

Subscribe to Post Comments [Atom]

<< Home