Sunday, October 26, 2008

Side effect

A couple of months ago I had an incident on a live production system. The obvious error was to execute a DDL statement at 9 am that should have been performed during off-business hours. On the other hand I guess you cannot avoid any ad hoc change on a live system forever. That will require you to be aware of any problem and issue that may strike your system. Say you have an issue right now, users are complaining, or to put it bluntly, someone is losing money. You think you have the solution, but before you actually apply it you want to know if it will actually work or make matters worse. Most people will run to their test system, at least to check that the syntax is right. I had verified it long time ago and knew it would work. However, there is one problem with test systems. Though we strive to make them as similar to the original production system as possible, they never are. Even if you can afford the seemingly waste of hardware and space, something is likely to be missing in the test system. In my case it was the users... With Oracle 11g and Real Application Testing (which I have not tried yet) the load can be better mirrored in the test system, but I can't imagine it will behave as real users.

The mistake I did was to change a rather unimportant index that was supposed to speed up a query from one specific app. What I forgot in the rush was that we had other pl/sql packages that depended on the underlying table; at the moment the DDLs were executed several connections had exactly one failed transaction. The package state became invalid and the transactions failed with ORA-4061. Slightly embarrassing, and it reinforced three principles:
  • Pick the right moment for any change.
  • When someone is pressing on for an immediate change, take a break.
  • There is no substitute for a peer review.
We have redundancy on every level, from mirrored disks to a standby database in another city. But no second DBA.