Diagnosing Performance Issues
(Part 1 of 8192)
Picture a Rubik’s cube in your mind: at the core is the holy grail of system performance, and every colored tile is a potential bottleneck preventing you from attaining that perfection. On one side, you find database issues, another side is the network, there are the physical servers, and mixed throughout is plain old bad code. Today we’ll focus on a few big red flags related to your physical system resources: CPU, memory and disk.
My CPUs are on FIRE!
When it comes to CPUs, you need to know one thing and check three:
- Is your server running on bare metal or is it virtualized?
- Are a handful of processes gobbling up all your CPU?
- Are there no standout processes and system CPU usage is higher than 20-30%?
- Is top/glance/perfmon/nmon reporting reasonable CPU usage, but everything is still slow?
A few guilty gobblers
These are the easy ones. If you are consistently seeing a handful of OpenEdge client processes (_progres, _proapsv) consuming massive amounts of CPU, the culprit is almost certainly bad code. Use the client statement cache or run $DLC/bin/proGetStack to find out what code is being executed.
Lots and lots of processes
When system CPU usage is high, it means that the machine is spending an inordinate amount of time doing internal busywork. The most likely culprit is that there are simply too many processes running on too few CPUs. Use vmstat to look at the run queue. If it’s greater than four times the number of CPUs then you probably need more CPUs.
Mystery Bad Performance
In a virtualized environment, your server might seem to be behaving well, but performance is still horrible. The first place to look is the Hypervisor layer and the easiest way to check is the bogoMIPS graph in ProTop (available for free at http://protop.wss.com). If you see huge swings in bogoMIPS, chances are the physical box is not giving your server all the promised CPU cycles.
In the example to the right, the client was complaining about horrible performance following a system reboot. A quick glance at the ProTop bogoMIPS graph clearly showed a drop in CPU performance and a discussion with the VMWare administrator uncovered that the virtual machine had been migrated from one physical host to another.
All Your Memory Are Belong to Us
Stop. Don’t panic. It’s ok if your server is consuming 100% of memory. All modern operating systems will assign unused RAM to the file system cache and give it back as soon as something more important needs it. What you really need to know is how much RAM is being used by applications. Scratch that,
you only need to know one thing: Are you paging to paging space?
Run vmstat and make sure the si and so columns are always zero. Windows has an equivalent screen in PerfMon.
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
. . .
1 0 0 13344 1444 1308 19692 0 168 129 42 1505 713 20 11 69
1 0 0 13856 1640 1308 18524 64 516 379 129 4341 646 24 34 42
3 0 0 13856 1084 1308 18316 56 64 14 0 320 1022 84 9 8
If not, you have a BIG problem.
Fortunately, the most likely culprits are database broker processes and/or Java so concentrate your investigative efforts there first. On Linux, use the pmap command to calculate how much memory each process is using. On AIX, use svmon. Other operating systems have similar tools. Do not use memory utilization reported by ps or Task Manager as none of the information reported by these tools corresponds directly to memory used by a process. You will likely end up counting the same memory twice and counting disk-backed memory usage (file system cache) that should not be counted.
Rust Rust Spinning Rust
The first thing that I’m going to say is that SAN/NAS/iSCSI disk arrays are accountant tools. As my partner is fond of saying, there is no such thing as a high-performance SAN. Companies deploy shared disk arrays to optimize usage and reduce costs, not to increase performance. As for the different RAID levels, any solution that includes some variation of a parity bit is going to be suboptimal for writes.
Diagnosing disk performance issues is not particularly difficult, but there are a hundred different things that could be going wrong. Start with this very simple write performance test:
proutil sports -C truncate bi -bi 16384
time proutil sports -C bigrow 2 -zextendSyncIO
Note: use -zextendSyncIO in version 11 only.
If the bigrow takes more than 10 seconds, your disk subsystem is an accounting wonder and a performance nightmare. A local SSD drive will take less than 1 second to complete this command.
On Linux, tools like iotop can show you disk I/O by process. If it’s a database process doing copious reads, sample database reads, physical reads and database buffer hits during the period of excessive I/O. The problem could be as simple as an inadequate -B parameter or there could be some suboptimal queries being executed. ProTop can show you which process is responsible for the massive reads, pinpointing the table and index being abused and the abusing line of code.