Sunday, October 26, 2008

Side effect

A couple of months ago I had an incident on a live production system. The obvious error was to execute a DDL statement at 9 am that should have been performed during off-business hours. On the other hand I guess you cannot avoid any ad hoc change on a live system forever. That will require you to be aware of any problem and issue that may strike your system. Say you have an issue right now, users are complaining, or to put it bluntly, someone is losing money. You think you have the solution, but before you actually apply it you want to know if it will actually work or make matters worse. Most people will run to their test system, at least to check that the syntax is right. I had verified it long time ago and knew it would work. However, there is one problem with test systems. Though we strive to make them as similar to the original production system as possible, they never are. Even if you can afford the seemingly waste of hardware and space, something is likely to be missing in the test system. In my case it was the users... With Oracle 11g and Real Application Testing (which I have not tried yet) the load can be better mirrored in the test system, but I can't imagine it will behave as real users.

The mistake I did was to change a rather unimportant index that was supposed to speed up a query from one specific app. What I forgot in the rush was that we had other pl/sql packages that depended on the underlying table; at the moment the DDLs were executed several connections had exactly one failed transaction. The package state became invalid and the transactions failed with ORA-4061. Slightly embarrassing, and it reinforced three principles:
  • Pick the right moment for any change.
  • When someone is pressing on for an immediate change, take a break.
  • There is no substitute for a peer review.
We have redundancy on every level, from mirrored disks to a standby database in another city. But no second DBA.

Wednesday, August 27, 2008

Connection Manager and ORA-12529

A simple configuration with Oracle Connection Manager (CMAN) did not work, I constantly received ORA-12529 when trying to connect with sqlplus to a database through CMAN. After checking cman.ora over and over and browsing all the trace files I finaly dived into one and found a problem with host name resolving. Turned out that the host name defined in /etc/hosts on the database server was not known outside the server. The database, I assume, did a lookup on the ip number in local /etc/hosts and sent the host name to CMAN when registering. The fact that I had used ip-numbers in the local listener.ora and tnsnames.ora did not change that behavior. A properly DNS would have saved me this, the solution was to define the database server name in local host file on server where CMAN is running.

This means that receiving ORA-12529 does not necessarily mean that there is something wrong with your rules. 12529 seems to be a kind of 'catch all error' when rule filtering fails. Look for nprffilter: entry in the trace file and start reading from there. I found this:

snlinGetAddrInfo: entry
snlinGetAddrInfo: Invalid IP address string FOO
snlinFreeAddrInfo: entry
snlinFreeAddrInfo: exit
snlinGetAddrInfo: exit
snlinGetAddrInfo: entry
snlinGetAddrInfo: Name resolution failed for FOO
snlinFreeAddrInfo: entry
snlinFreeAddrInfo: exit
snlinGetAddrInfo: exit
nprffilter: Unable to resolve to IP
nprffilter: exit
nsglbfok: no rule match or an error from nprffilter
nsglbfok: exit

(Real hostname replaced and timestamps removed). Looking for 12529 or whatever code you receive in the trace-file and start reading backwards is also an option.

Tuesday, July 15, 2008

Multi-column function based index

Today I was looking at an SQL statement with a slow response. The statement does a lookup on a person using 'upper(lastname) and upper(firstname)'. There is a functioned-based index (fbi) on the lastname column, but by checking the statistics it turned out to have a rather low selectivity and the CBO went for a full table scan. I don't know why it has not occurred to me before, but today it dawned upon me that it may be possible to create an index like this:

create index foo_idx on foo(bar(a), bar(b)) ;

There is an example in Tom Kyte's last book, but I guess the tidbit got lost on its way to or inside my hippocampus. I have always thought about the fbi as a way to index the result of a function (which could take many arguments, but only return one value). By looking at the fbi as an index on a shadow-column it becomes obvious that more than one column can be indexed this way, just like any other multi-columned index.

Point is of course that each of lastname and firstname has a very low density (at least in this part of the world), but the combination of them is not too bad.

Or may be it was something else that went right, one should always supply a proof these days. I reapplied the old index with the same slow response as a result. Also I gathered statistics again every time an index was created, and the results where consistent.

Monday, June 30, 2008

Table partitioning to improve stats

In a earlier post about legacy databases I mentioned that dirty data are more likely to enter the database on an earlier stage. One day while I was looking at the selectivity for a few columns I found the values a bit strange. After some digging I noticed that for most of the time the data was OK, even though a proper foreign key constraint was missing for the columns. But at some time many years ago somebody had inserted something completely different, which had clearly affected the selectivity (the columns are indexed).

Table partitioning is better to assist in the management of the data more than for tuning purposes, and I imagined that this would be another example. By range partitioning on transaction time dirty data would be kept on a partition separate from more recent history and thus improve the column stats. Though I haven't set up a test case (for the lack of a real problem) I believe that a correct selectivity will help the CBO and avoid these border problems where the CBO is tricked to throw off a wrong plan based on wrong statistics. That would be another example where the CBO is blamed for something that is not its fault.

This is why I don't like hints, it feels like cheating and being lazy. The CBO may have bugs, but I think you have to be as good as the author of this book to prove it; usually there is an underlying problem, like the one I encountered.

Anyway, I did a quick test, by copying the table in question into a partioned version and ran dbms_stats.gather_table_stats on it. In fact the selectivity did not change much between partitions, probably not enough to make an impact on CBO choice of plan. But I had used other parameters on the new table (most importantly I think, estimate_percent=>100), the columns in the two tables had very different values for selectivity. Which reiterated to me previous knowledge; checking the stats and the way they are collected is important, and sometimes a simple solution will do.

Sunday, May 25, 2008

Another great tool: SQL Developer

I like great tools. An efficient tool will help you do a piece of work faster than when you are on your own. If the tool is made by someone who is close to the material chances are the tool is efficient. The tool I use most is SQL Developer. It is easy to use and I promote it to any developer where I work, especially if they only need a quick access to the data dictionary, run some simple (or complex) SQL statements. It is not invasive, contrary to other tools it does not provide all the DBA tools that may be useful in some cases, but also may be rather intrusive and put a severe load on oracle instance as it queries the internal v$-tables.

Installation of SQL Developer can hardly be easier; unpack a zip-file, and execute the program. It asks for the java.exe if you downloaded the version without java and asks you if you would like to migrate from a previous version. Connection can be done through JDBC, meaning no oracle client is necessary, but in case you have the client software installed, it lets you pick a connection from a list of known tns-connections.

Latest version 1.5 adds support for CVS and Subversion, and lets you generate documentation in html-format. Very useful for offline-browsing or add to some intra net server.

Another tool I use is Schemester. Simple and efficient to capture a data model from the database and visualize the foreign keys. Schemester used to be located at, but that site seems to have been abandoned, and I have not seen much activity around it for some time now. As you upgrade or change the data model in Schemester it can generate "delta DB DDL script". On top of my wish list for SQL Developer are these features from Schemester. If the status quo sets the standard they are likely to be implemented without some bugs that Schemester has.

Sue Harper posts updates about SQL Developer on her blog, you won't have too many posts to catch up, but the tidbits are absolutely useful (like when I can look forward to a new version.)

Tuesday, May 13, 2008

Praise for Oracle BI Publisher

I have been working on a small scale data warehouse project for a few months along with many other tasks. Following Kimball's ideas on dimensional modeling and by keeping it simple I had more or less an idea on how to proceed with the construction of the database, data model and how to create the reports (but only in SQL). One guy in management actually wanted to learn SQL and went ahead with SQL Developer (fantastic product, so easy to use that anybody familiar with the layout of the keyboard can use it). But I did not expect the same attitude from the rest. I have to say that I've seen many products for making glorified reports, but all of them gave me the impression of being cumbersome to install and not exactly bug free. I did not know any product very well and neither had I the time to start testing them. Earlier I've played with Application Express and thought some reports could be created in ApEx, but I had a nagging feeling this would not cover all our needs.

After a presentation delivered by a guy from Bicon at OUGN's (Oracle User Group Norway) yearly conference I decided to try out Oracle BI Publisher (formerly called XML Publisher). BIP is clearly an option for us. After installation which is plain sailing, you define a few simple reports for immediate publishing. BIP comes with its own web server and can be installed on any windows box lying around (which you then baptize The Application Server). Anybody with a net browser can indulge themselves with the reports. The security model is intuitive with roles and users; I can easily write a report and decide who will have access to it. The BIP Desktop add-on for Windows does a good job with defining a layout. The ability to export the report in different formats guarantees that reports can be published in different ways and even processed further elsewhere (e.g. Excel). BIP has a scheduler (requires access to an oracle database for storing a repository) so that long-running reports (i.e. the sofar-not-converted batch job reports that used to run on the OLTP database) can be run during the night and be ready for the early morning bird at 07 hs.

If you are a DBA with limited experience outside the world of sqlplus and alike, I recommend checking out BIP. Later on when demand for more analyzing tools increases I will dig into some other BI tools, but for a while I will spend time creating more reports in BIP.

Monday, May 5, 2008


Just a short post on this error code. Happened last time I recreated a physical standby. I could not find more than the reference note on this error on Metalink (a note that is just a copy of the output from command oerr ora 16047):

Cause: The DB_UNIQUE_NAME specified for the destination does not match the DB_UNIQUE_NAME at the destination.
Action: Make sure the DB_UNIQUE_NAME specified in the LOG_ARCHIVE_DEST_n parameter defined for the destination matches the DB_UNIQUE_NAME parameter defined at the destination.

The error message was misleading in my case; my bad was that the log_archive_config on primary and the standby did not agree (actually it was nothing on standby). This error may show up in alert-log (didn't every time with me), and will also show up in v$archive_dest:

select dest_id,status,error
from v$archive_dest;

If this is your case, check log_archive_config on all nodes.

Tuesday, April 29, 2008

Create a standby database with RMAN

After the incident with missing forced logging in primary database the logical standby had to be created again. Compared to what is normal these days the database is not huge, but the standby database remains on a server in a different city and the link between isn't exactly a digital super highway. Also I had to restrict the consumed bandwidth during transfer with secure copy (scp -l 2500, restricts to 2500 kbs) or else connections to the primary database might suffer.

RMAN provides the command duplicate target database for standby, but you don't have to use this in order to create a physical standby. Actually some prefer not to use the duplicate command; the target database RMAN needs to be connected to while the duplicate process executes is the source database, which may be your main production database. If it takes some time to finish and you forget that your target is not the new database (the clone), but the primary, and you then do a shutdown immediate... well, you probably do that mistake only once.

Anyway you need a backup of the database and of the controlfile in standby format. In RMAN:

run {
backup current controlfile for standby
format '/s01/backup/foo/standby.ctl';
allocate channel c1 device type disk
format '/s01/backup/foo/%U' maxpiecesize 10G
backup database ;

I use the MAXPIECESIZE-option in order to limit the size of each file, it makes it all more manageable. Another useful option is RATE, which limits how much is read pr second from the database files. I did not use it since RMAN did not put a severe load on our SAN while the backup was running. The next three days where spent copying the files to the other server.

The instance for the physical standby was started in nomount and with RMAN connected to the soon to be physical database as target, restore was done like this:

restore standby controlfile
from '/s01/backup/bar/standby.ctl';
alter database mount;
catalog start with '/s01/backup/bar/ld';
crosscheck backup;
restore database;

The path in the catalog-command is just a common part for the backup pieces, the files started with ld. The restore actually finished several days after the backup was started on primary and physical standby would have to apply many archivelogs to catch up with primary. Here is where another useful feature of RMAN comes to rescue, incremental backup from a specific SCN. With the database at least in mount run the following query to find SCN of the physical standby:

select current_scn from v$database;

Then on the primary start an incremental backup in RMAN:

backup incremental from scn 42 database
format '/s01/backup/foo/incr%U';

42 might not be your SCN (wasn't our either), I used the SCN from the query minus one (a small overlap, didn't bother to check the manual). The incremental backup was of course much smaller and took less time than a full backup to complete and transfer. After the backup was transferred to the other server I executed an incomplete recovery:

catalog start with '/s01/backup/bar/incr';
recover database noredo;

This does not take much time to complete and the physical standby is not far behind primary. Normal apply can now be started with:

alter database recover managed standby database disconnect;

Sunday, April 20, 2008

Understanding a legacy database

Any database that has gone into production is a legacy database according to Tom Kyte. I agree, but to me some databases give me associations to legacy more often than others. In a post at the OraStory blog Dom complains about companies that hire contractors for a maximum of one year, and due to the fact that it takes half a year to understand the business and the data, you hardly get to show what you can deliver.

Understanding the business is one thing, but how it is projected into the database is another matter. With a crash introduction to SQL any developer can create a few tables and start dumping data. Everyone working with the code knows the connections, but lack of documentation, no constraints, and apparently inconsistency in the data makes it difficult for the newly hired DBA to understand what is going on or what is supposed to be going on. And this agile movement doesn't make it easier, documentation is not exactly given high priority (...the primary goal of software development is to create software, not documents- Ambler; Agile Database Techniques 2003, p8). In places where more attention has been given to the applications than the database the database can easily become a hard to manage legacy system. The characteristics are tables with few or no constraints (besides the primary key on an id column with values from a sequence, of course), tables with similar column names but with different content and meaning for the same column name, same data stored twice or more (and not in sync), same column used again in another table but with different type, missing data (not null-constraint missing) and so on.

If you have a tuning problem, adding constraints can help, but how can you add a constraint if you don't know the data model or the connections in the data? Adding an unique index can certainly speed things up, but what if a column has 99% unique data, is that an error in the data or just coincidental? I have found the following techniques and strategies useful for understanding the data model:
  • Ask questions - some like to talk about the code they wrote, but there is a limit to all interruptions.
  • Talk to staff likely to be in need for reports and find out if they use reports from the database, then you trace the reports back to the database, i.e. the actual SQL used. Sometimes there is an intermediate data dump that is later imported to Excel for further processing. Looking at each step, asking for an interpretation of the report (if it is not obvious) you may see some new connections.
  • Views in the database are useful to store complex queries, hide access to columns, create a standard interface to a set of tables, and more. But I have seen views defined as select col1 newname1, col2 newname2, ... from foo; i.e. the only change from the underlying table is new column names that relates to reports or a term that is used outside the database.
  • Stored procedure and packages are more likely to have comments than tables and views. If some strange logic is applied in the source, it motivates a question and may reveal something about the data model.
  • Anticipate dirty data (if constraints are few). Say the quality of the applications are reasonable good, most of the data may be fine, but there will be some orphan rows, illegal values, etc. Doing some digging with simple select statements counting and looking for seemingly missing parent values (as per an imagined business rule, since the constraint is not defined) may show a set of dirty data that should be looked into. I have used this as a pretext to ask about the tables ("Is it not so that ... is supposed to have ...'cause I have seen some ... that does not have a corresponding ..."). This may be an opportunity to suggest to clean up the data and add constraints (because management understands the value of the data and the danger of not be able to trust them). Invalid data are more likely to have entered early in the history of the database ("well, some testing was done on the production database") or because the data model evolved with new columns added or business rules changed at an early stage.
  • Besides finding corrupt data, researching the statistical distribution is useful for tuning purposes and identifying other constraints. When a column with a discrete set of distinct values have relative few nulls I suspect a candidate for a not null constraint.

I wish the oracle blogosphere would discuss database refactoring more often. The biggest challenge is to identify what can break elsewhere when you try to fix wrongdoings (impact analysis is what they call it) It is not trivial. Do people undertake this, or is it sometimes a relief when the contracts expires after twelve months so you can move on?

Monday, April 14, 2008

Data Guard and forced logging

When you start with Oracle Data Guard you have three sources of useful information: The Data Guard Concepts and Administration manual, which is quite good, other peoples experience, and after some time your own. The manual does not include all the mistakes one can do, nor what many really don't use (like the data guard broker; not everybody would leave switchover, or worse failover to a robot).

After a planned switchover to the physical standby the following ora-600 error message was received on our logical standby:

ORA-00600: internal error code, arguments: [krvxbpx20], [1],
[34735], [60], [16], [], [],[]

SQL Apply terminated after this error and neither restart of sql apply or bouncing the instance helped. On Metalink the only relevant information I found was Bug 6022014, which has status "Closed, not a bug". An ora-600 error that is not a bug...well, if it's my fault a decent error message would help. An SR was opened with Oracle Support and it was suspected quite right that supplemental logging was not enabled on primary and asked me to run the following query on the new primary:

from v$database;

Which returned:

--- --- --- --- -------- ---

Suddenly our logical standby was not my biggest concern. How come the primary does not run with forced logging? Following the manual this is one of the first step when preparing for data guard, mentioned in section 3.1.1 (10g) and from table 3-1 looks like a requirement. The manual doesn't tell you to execute it again for the new physical standby, neither before or after the first switchover. It surprised me that it is possible to successfully execute a switchover without having the new primary running with forced logging. Though I have not tested it, I imagine that the clone for the first physical standby was made before forced logging was enabled in the production database, and if I had enabled it in the order given in the manual the setting would follow the copy of the controlfile (in standby format) used to make the physical standby. At least that is the only explanation I have.

The solution was simple, just turn on forced logging with:

alter database force logging;

This is not recommended in an OLTP database with many ongoing transactions and can take some time to complete, but will finish soon when activity is low.

Just in case you don't know, forced logging makes sure nobody can create a table, index or other object without logging everything to the redo log, i.e. the NOLOGGING option in create statements is ignored. Checking that every node runs with forced logging should be part of a regular review of your data guard configuration.

Our logical standby must be recreated since many transactions have been lost, but that's OK now that a potential bigger problem have been avoided. Why do we have a physical and a logical standby ? The former serves the fastest and safest switchover or failover if necessary. The latter is used for off-site recovery options and as a source for our data warehouse. Remarkably stable and easy to setup once you have data guard with physical standby in place.

Sunday, April 13, 2008

First post...

A Sunday morning is a perfect moment for starting your blog. Not sure where this is going, or if it's going. Here the other day I found a solution to a problem I thought others may have encountered. I figured that my own blog would be a place to write it down. At least I can find the solution again if I make the same mistake twice.

This blog will concern mostly about stuff related to Oracle, the database, that is, not the company. Posts for fellow DBAs and maybe developers (I'm a hybrid now, working as a semi-DBA among developers...)

Until more, take care.