Our PeopleSoft system suddenly started showing spikes in network usage. Users were complaining of slow performance. Enterprise manager shows big spikes in network waits. Strangely these don’t seem to be reflected in the SQL. The table on the left doesn’t have as much beige/grey as the one on the right.
Lets click on network and see what the waits really are. SQL*Net message from dblink.
Ok brilliant. It is confusing why we don’t see it reflected in the SQL, but we know it is a dblink that is slowing things down.
So lets have a look in the v$active_session_hist table.
SQL>setnull<Null>SQL>select sql_id from V$ACTIVE_SESSION_HISTORY
where wait_class ='Network'and event ='SQL*Net message from dblink'and rownum <10;
SQL_ID
-------------
<Null><Null><Null><Null><Null><Null><Null><Null><Null>9rows selected.
This is really confusing. We are waiting on a database link and there is no SQL running?
Trace to the rescue
PeopleSoft uses a pool of application servers. This is what the PSAPPSRV processes are which we can see on the right in the first graph. So to see why they are slow, maybe I can trace one and it will tell me what it is waiting for.
Thanks to Tim Hall at Oracle Base we can see a large number of ways to trace a session. I picked an application server process pretty much at random and found the serial number using the command below, then started tracing.
Alter session? That shouldn’t be slow. And why was it quicker this time? And where is it going to anyway?
lsof to the rescue
Oracle has a view which tells you which dblinks a session has open: V$DBLINK. The problem is that you can only run it in your own session. So that’s useless for me. Instead I have to resort to Unix tools. I found the server process that was running the session I was interested in. Fortunately in this situation I am not using multi threaded (shared) servers.
To identify the process on the server, we need to check the database. V$SESSION has process and machine, but that for the client, in this case the application server. I want the process on the database server, for which I need to consult V$PROCESS:
SQL>select spid, traceid from v$process
where addr = (select paddr
from v$session where sid =1234);
SPID
------------------------
TRACEFILE
----------------------------------------------
5678/oracle/diag/rdbms/db/DB/trace/DB_ora_5678.trc
It also tells you where the trace file is. Handy.
After a bit of trial and error I found I could do the following to see which databases a process had open:
I can see that the process has a connection to server db1 and server db2 in addition to the apps server which I already knew about. So these are the databases it is waiting on. So why is this?
I had a look on the remote databases, and also checked with the administrator of another one of them. We couldn’t see any performance problems.
So the problem is somewhere in between.
tcpdump to the resuce!
I am a database administrator, so I am really out of my depth here. I know very little about networks. After some searching I found the following is a good way to capture a network dump for later analysis. I think this records everything that goes through the network, so there is the potential for files to get very big very quickly. Also there is potential to make performance even worse.
This shows all the communication between my database server and server db1. I get a load of lines I don’t really understand. Not terribly informative. However, I then found out that I can select statistics->TCP Stream Graphs->Round trip time
This was more interesting - a nice graph of the round trip time. It shows me what I already knew, that the communication was very slow to start with but then sped up dramatically towards the end of the trace. However, I could change the query line to db2, and it showed the same shape. Finally I changed it to app, and the graph was straight, the response was the same for the entire duration.
I am coming towards the end of what I can do, but I can infer a lot from this information.
The remote databases aren’t the problem (They are separate, so they don’t both slow down at the same time)
The local database server isn’t the problem (Because if it was the communication with the apps server would also slow down)
The switch isn’t the problem because the application server communication is OK.
In fact, thinking about it, things that go through the firewall go slow, and things that don’t are OK.
Maybe it is the firewall.
The firewall administrator checked and discovered CPU was rather high. He made a change to reduce it and all was well.
The waits disappeared and users confirmed that the performance was much better.