A couple of weeks ago we received a ProTop alert about a long running transaction that was threatening to bring down our customer’s ERP system. It is not a pleasant SMS to get on a Saturday night, but I knew I needed to respond quickly to prevent a crash.
All of the information I needed was in the SMS message itself. The BI file’s size had reached a critical threshold, risking a file system full event and a crash recovery nightmare that could last for hours. The alert included a list of likely suspects sorted by transaction age. In ProTop these alert enhancers are visible by hovering over the alert in the alert feed:
I had the process ID and the database user ID and was even able to determine the offending program and line of code:
This was enough to inform my client of what was happening and to obtain permission to terminate the offending process.
In a perfect, yet lawless, world the DBA can make that decision and kill a threatening process. However, doing so would be unwise. We need to understand the business impact of zapping that process in order to avoid causing more harm than good.
My colleague Paul also received the alert and noticed something odd about that process. There was a tremendous amount of reads on one of the tables which was unusual for this environment:
I opened the ProTop Realtime Monitor to investigate what this process was doing in real time, and lo and behold, it has been reading a single record, more than 13,000+ times per second and has been since it was launched earlier that day:
This was an obvious sign that the process was stuck in a loop. At this point data collection was complete so we killed the process by sending a simple SIGINT <Ctrl-C>. The BI usage almost immediately dropped back down to zero:
Disaster was averted, just like that!