Corruption and Crash Protection, Courtesy of After-Imaging

No one wants to think about a database crash, and even less about losing data, and yet every month or two I get a call from an IT Director in exactly that situation. Their OpenEdge system has been down for hours and stays down for hours more while we do our best to perform emergency surgery. In the end the customer is left with a stitched up database and a post-mortem that reveals that all this could have been avoided if they had simply followed the advice in this blog and protected their OpenEdge database with after-imaging.

Protecting against lost data

Data protection is comprised of two equally important components: backups and after-image archives. Almost everyone understands backups: if there’s a problem, they get you 95% of your data back, up until the time of the last backup. After-image archives get you down that last mile, containing the detailed changes that were applied to your database. Think of them as a recording that you can play back (we call this “rolling forward”) on top of your restored DB. All the recorded changes are applied to the restored database in the same way as they were done the first time around.

Why are they equally important then? Because the 95% is historical data: stuff you already built, shipped and invoiced to your customers. Losing historical data is bad, but losing work in progress (wip) data is arguably worse. If you lose wip data, chances are you’ll have to deal with some very angry customers!

So how do you protect against wip data loss? You set the granularity of your AI archives down to a number that makes sense for your business. We like to use 15 minutes, meaning that the business can tolerate losing up to 15 minutes of database changes if there is a catastrophic loss of the database disks.

What about multi-database systems?

If your OpenEdge environment is comprised of multiple databases, realize that the 15 minute granularity could translate to 29 minutes of data loss. You have to find the latest point in time in the archived AI files to which you can roll forward all your databases. Typically you start all your databases at the same time, meaning that if the archive interval is set to 15 minutes, all the AI file archivers will act within a few seconds of each other. But if for some reason you restarted one of the databases or one of the archivers, your archives might be desynchronized.

Luckily the roll forward utility has a “roll forward until” option. Once you identify the latest point in time common to all the AI archives, simply roll all the restored databases forward to that common point in time.

How to make AI archives [almost] worthless

Oh this really burns me…if you are setup this way, please don’t tell me. Fix it quietly and pretend it never happened.

Some of you out there are backing up your databases and archiving AI files to the same disk subsystem as your production database. Even worse, some of you are only archiving AI files once per day, just before the backup. Please understand that this is almost useless. If only the DB file system gets corrupted, then yes you can use these files. But if you lose the whole disk subsystem to a SAN crash or a ransomware attack, then you’re toast. You have nothing.

As soon as the backup completes, or the AI archiver archives an AI file, get it off the box. Get it out of the data center. If you can, get it out of the city. This is your lifeline in case of a problem with your database.

About that AI Archiver

I have two pieces of advice with respect to the AI archiver:

  1. Do not archive directly to a network shared drive. It sounds like a great idea: archive directly to your DR server in your secondary data center. But in reality, I have seen this configuration freeze a database, forcing us to crash it and restart it. Archive to a local file system, then use a separate scripted process to copy the files to the remote location.
  2. Monitor the AI files and the AI Archiver.  At the very least, you want to verify that the AI archiver process is a) running; and b) archiving and that there most of the AI extents are empty. With ProTop, you can monitor and alert on empty, full and locked (OE Replication) AI extents.

Safety first

No matter the situation, your first responsibility is to protect the data in your database. Make sure that a) you’re using AI; b) using it correctly; and c) that you’re archiving these crucially important backup files and AI archives to a distant location, safely tucked away on a server that won’t be affected by problems at your primary data center.

If you have any questions, or would like one of our experts to audit your OpenEdge environment, please don’t hesitate to reach out to one of the experts at WSS.

 

No Comments

Post A Comment

Related post...