Wednesday, July 2, 2014

Real Application Testing - some notes

This post is a mixture of experiences from a recent project and an overview on how to use RAT in a project.

Real Application Testing (RAT) is a licensed feature for the Enterprise Edition (EE) of the Oracle database. It was released on 11g, but can be used to test migration of databases from 9i and 10g to 11g (assuming that the source databases have a required version/patch). The term Real is, I reckon, used to hint at a more authentic testing process. The tests are executed based on a capture from a real production system.

RAT can be used to:

  • Evaluate new hardware, including migration from one platform to something completely different.
  • Verify execution of SQL from today’s production on a newer version of the database
  • Analyze the effect of changes in configuration
  • Scaling up, see how current system works with higher load; this works best for queries / reporting and not so well for data manipulation (DML).

A project involving RAT can be divided into these activities:

  1. Capture workload from a production system
  2. Create a clone of production system with flashback database enabled and a restore point created.
  3. Prepare for replay on test database and with agents on application servers or other client machines.
  4. Execute replay
  5. Generate reports (RAT and AWR)
  6. Flashback database
  7. If you haven’t reached a conclusion yet, repeat from 4.

In two projects I was asked to do step 2 - 6 and figured out that I should write down what I learned for posterity. (The fact that I have a friend in Belgium that reminded me about blogging more is a mere coincidence.) I will not put many details on step 1 and 2 here.

Some think it is a good idea with a long capture, but keep the replay part in mind and the time it takes to actually execute the replay. Though the flashback of the database is impressively fast, a longer replay (with much updates) will lead to more time spent on setting the database back with flashback. If you can limit the capture to say two hours that is a time saver later, but if you need to confirm say scalability for one night's ETL jobs, you may need to capture several hours. Before you start with capture, make sure you have all the backups you need to duplicate the database with point in time recovery as close to the starting point as possible. This should not be an issue since all productions system (IMHO) should run in archiving mode and being backed up. This is an opportunity to verify your backups.

For large databases it may take quite some time to transport the backup (and capture files) to new platform and have it restored to a clone. Flashback database is a wonderful feature since it takes litle time to flashback a database compared to a full restore for large databases. Flashback is turned on with command:
ALTER DATABASE FLASHBACK ON;

A restore point is created with:

CREATE RESTORE POINT B4_REPLAY GUARANTEE FLASHBACK DATABASE;

The name of the restore point is your decision (within limits). You will need some space for flashback logs, depending of the amount of changes to the database during replay.

In a bypass; yes, there are as usual two ways to do this. Enterprise Manager for the young and aspiring DBAs, or you can use the PL/SQL packages. When I have the routines ready for using the API I tend to favour them, so I did most of the work with the packages DBMS_WORKLOAD_CAPTURE and DBMS_WORKLOAD_REPLAY together with the packages for creating AWR reports DBMS_WORKLOAD_REPOSITORY. Sure, I had some looks at nice graphs and reports in EM too.

Going into more details on step 3; the result of the capture is a collection of many files that needs to be transported to the new platform. On the test database create a database DIRECTORY object that points to the file directory where they are stored. First step is to process the capture with:

exec DBMS_WORKLOAD_REPLAY.PROCESS_CAPTURE('REPLAY')

where REPLAY is the name for the directory. The result of this is a new directory with a name that starts with pp followed by a version number. Copy it all to the machine(s) that will host the replay agents (actually replay the workload and send it to database.)  If you are doing this on a RAC-database, copy the same stuff to other nodes with same path for the directory object.

In each installation of the Oracle database and client you have the workload replay client (wrc) included under $ORACLE_HOME/bin. It’s task is to send SQL from the captured workload to the database.  Before you start the actual replay a calibration should be run first. With ORACLE_HOME and PATH set, execute:


wrc mode=calibrate replaydir=/tmp/replay

This will report the recommended minimum number of clients. Too few means that the replay clients cannot replay the workload fast enough. The same will happen if the servers where the wrc binaries are overloaded, or the network between them and the database have any issues.

So if you have to execute 10 clients, put 10 copies of this in a script:

#!/bin/bash
wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay 
sleep 1
wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay
sleep 1

wrc system/welcome1@clonedb mode=replay replaydir=/tmp/replay 
sleep 1

Don’t run it yet.

Now we’re onto step 4. You repeat the following sequence if you decide to repeat a replay, but the preparation hitherto is only done once. Initialize the replay with the following command, using a distinct name for the replay each time:

exec DBMS_WORKLOAD_REPLAY.INITIALIZE_REPLAY(replay_name=>’run_1’, replay_dir=>’REPLAY’)

In the middle of this we have to do some remapping... The connections to the database used during production capture are not the same as used during replay. After the previous command has completed you can query the view DBA_WORKLOAD_CONNECTION_MAP and see what was used during capture. This view is empty before the initialize_replay procedure is executed, after it has values in the column CAPTURE_CONN, but REPLAY_CONN is empty. First time you have to look at the output and see if you have many different connections to the database and figure out how you will remap them. Often, only one connection string is used (connection string as the key in TNSNAMES.ORA) and the mapping is simply replacing it with the one you are now using to reach the new clone database. In that case you can use the following PL/SQL block to remap:

begin
  for i in (select conn_id, capture_conn 
    from dba_workload_connection_map m, dba_workload_replays r
    where replay_id = id
    and name = ‘run_1’)
  loop
    dbms_workload_replay.remap_connection(connection_id=>i.conn_id, replay_connection=>'cloneserver:1521/clonedb');
  end loop;
  commit;
end;
The name => ‘run_1’ refers to the name you gave to the replay. You can either use an entry from tnsnames.ora or connection string as shown above when setting the new connection. If you have some load balancing or application partitioning where you use different connections to the database they have to be remapped accordingly. Though you have to execute this for every replay, you only spend time on this first time and keep the PL/SQL ready for next time you do a replay.

Now after initialize, next step is prepare (!) In this step you set various parameters that how the replay is executed. In it simpliest way:

exec DBMS_WORKLOAD_REPLAY.PREPARE_REPLAY(synchronization => TRUE, think_time_scale=>100, scale_up_multiplier=>1)
synchronization set to ‘TRUE’ is equal to ‘SCN’, another more advanced (according to manual) is ‘OBJECT_ID’. This sometimes causes problems for the replay and testing several times with synchronization to ‘OFF’, ‘SCN’ or ‘OBJECT_ID’ may be necessary.

Later on when a normal replay has been executed it is always interesting to see what happens when you reduce think time or scale up the load. Note, that DML, when scaled up won’t actually update the same way, so this needs some testing from you. For reporting usage scaling up should work, however, note if you double the number of users, they are in realilty not likely to send the same SQL as today twice as many times, that is, users usually will run slightly different queries. But if this is a problem you have to find out by analysing the reports - if an a few statements are popular in an OLTP system you can assume they will remain so after you scale up.

Now you can add filter for what will be actually replayed, but I am going to leave it out. After you have exuted the prepare_replay procedure above, you can kick off the script on the clients, that is the script that includes the wrc commands. They will start and write out a message that they are waiting for replay to start.

You can start the replay with:

exec DBMS_WORKLOAD_REPLAY.START_REPLAY()
When this is executed the replay is running and the agents (wrc) writes out a message that replay has started.

EM has a nice page for monitoring a replay. A few times we have seen that the replay lags behind compared to capture, but catches up later on, or sometimes takes longer time than the capture. Which sometimes means an error with RAT itself and not necessarily a problem with the new platform.

In step 5, after replay is finished, you generate a RAT report and also AWR reports. In DBA_WORKLOAD_REPLAYS you’ll find the replay_id of the replay and the RAT report can be generated with:

select dbms_workload_replay.report(42,’HTML’) 
from dual;

Generate the AWR reports as usual. If you want to compare AWR report from capture with replay, you will have to import the AWR data (AWR data is automatically exported after the capture and included in the same directory). First verify that the capture is visible in DBA_WORKLOAD_CAPTURE. If not you can import all the data about captures found in the directory with:

exec dbms_workload_capture.get_info(‘REPLAY’)
where ‘REPLAY’ is the name of the DIRECTORY-object where capture is stored. The capture(s) should now be visible in DBA_WORKLOAD_CAPTURES. The AWR-data belonging to it can be imported with

select dbms_workload_capture.import_awr(42,’LUSER’) 
from dual;

Take a note of the number returned. This is a generated DBID that you will use later to refer to the database of the capture (this means that this is a made up DBID that is different than the DBID of the database where you executed the capture). 42 above is the capture_id from DBA_WORKLOAD_CAPTURES and LUSER is the name of a schema used for staging. Just create on temporary if you don’t have one ready.

The import can take some time. When it is finished you can see in DBA_HIST_SNAPSHOT what snapshots are available, check for the DBID that was returned in last command. You can now create comparsion reports with:

select * 
from table(dbms_workoad_repository.awr_diff_report_html(123456789,1,120,130,734501234,1,10,20));

The first three parameters are DBID, start snap id, end snap id; all from the first database (source/capture) and the last three the same respectively for the second database where replay took place. Creating these reports work pretty well in SQL Developer, use Ctrl-A and Ctrl-C to copy the html code and paste it into an editor and save it. You can actually create more reports than I have shown here, see the documentation for the mentioned packages to get an idea. AWR data is automatically exported to the same directory after a replay. This means you can later be import them into another incarnation of the database to compare different replays.

Step 6, flashback database is pretty easy. Shutdown the database, start it in mount, execute:

flashback database to restore point b4_replay;

Then open it with:

alter database open resetlogs;

If you are on RAC you have to shutdown the other nodes and start them after the previous command is completed.

There used to be some bugs that caused the database to crash spectacularly with an ORA-600 during flashback leaving the database useless and forcing you to do a new restore/duplicate from backup. Since I’m stressed during a project like this, and therefore I don’t have time to waste, I tend to avoid structural changes (like moving tablespaces around after creation of the restore point) that can challenge the flashback. I am not sure if the bugs are still there, though.

Step 7, are you done yet? I usually have to run the replay over and over several times trying out with different parameters for procedure dbms_workload_replay.prepare_replay(...). This takes a lot of time, don’t underestimate the project. Also remember that people will question your results.

In conclusion, try to figure out before you start what you want to prove. Are you comparing two systems, you need to agree on a common set of parameters; or are you trying to prove that a system is good enough (without comparing it to the old?) Try to lower you ambitions or scope, the time it takes with restore, replay, generating reports, looking over them, and possibly analyze further with extended trace, starting over,... it may feel like a walk in the desert sometimes.