From the Vault: High Availability and Disaster Recovery with OnBase

The following is a text transcription of the above webinar. Use it to follow along and keep your place as you watch, or if you’d rather read than listen!

0:00 My name is Amy Halperin and I am the trainer and one of the solution engineers here at KeyMark. And I’ve been around a while, and I’m quite well versed in OnBase. And I am a massive, massive nerd. I’m going to be talking to you today about high availability and disaster recovery, specifically as it applies to OnBase. So let’s get going. 

0:32 Today, we’re going to be talking about terminology and planning and strategies, and some strategies for the platform you’re using, and for OnBase, specifically. So with terminology, what exactly constitutes a disaster? That’s anything sudden, that brings down your system, basically. I mean, there’s a technical definition on your screen, but it can be a lot of different things for what is a disaster. It can be anything from a tornado, to a hurricane, we, you know, those of us who live here in the southwest, southeast, we know from hurricane. It can be a flood. It can be a blackout is a disaster. I mean, if you were around in remember the blackout in the northeast in 2003, the cascading blackouts that happened because of, it actually started, I lived in Cleveland at the time, and it was starting one. There was an overload in the Cleveland station that affected one. It turned out to be my understanding is one overload hit another substation that couldn’t handle it. And because the infrastructure didn’t have sufficient backups, it just cascaded and took down the entire eastern seaboard. So seven hours after the blackout, from space, you can see how little light there was left in the eastern seaboard. 

2:19 I mean, this is a very funny slide, I think. Windows did not shut down successfully. If that happens in the middle of something important, that’s a disaster. For the people who were trying to get something accomplished, that constitutes a disaster. Something as simple as corroded wires, that brings down your main server, that’s a disaster. A flood that takes down your server, just a local flood, not something that floods the whole area. But if the pipes in your basement leak, and flood out your servers, that’s a disaster. High winds, tornadoes, those are disasters. If your one router goes down, and your app server can’t talk to any of your clients, you’ve just experienced a disaster. And of course, they’re very personal disasters. 

3:21 High availability, in its simplest sense, is your system’s ability to stay up even if something goes wrong. Generally speaking, that’s accomplished by having fault tolerance, and or load balancing. And we’re talking about both of those this morning. Disaster Recovery is different from high availability. Disaster Recovery is how long it takes you to get back up if you go down. So high availability is preventing you from going down when something bad happens in the first place. Disaster Recovery is getting you back up after you go down, when High availability isn’t enough to keep you up. It doesn’t matter how good your high availability system is. If you get hit by a disaster of the scope of Katrina, your system is going to go down for a few days. Even if you have battery backup, that is good for 24 to 36 hours, if the entire power grid goes down for three weeks, you’re going down. Disaster Recovery is the process of getting back up to speed after that. So that’s the difference between them. High availability is staying up as long as possible, and preferably not going down at all. Disaster Recovery is getting your data back up to speed and getting your Operations back up to speed as quickly as possible, after you go down at all. And there’s a couple of different ways you can do both of these. 

5:10 High availability will use a redundant array of inexpensive disks, server clusters, and load balancing and disaster recovery relies mostly on Backup and Restore, and sometimes on a cold or warm standby site. Disaster Recovery is generally easier and less expensive to implement. And it doesn’t guarantee or assist with keeping a high availability structure going. Having your good backups is not the same as having high availability to do disaster recovery, not high availability. Conversely, high availability is more expensive and harder to implement. But it can help with disaster recovery, depending on how it’s implemented. So while disaster recovery is not really going to give you the high availability you want, high availability may really help you with disaster recovery. 

6:17 Redundancy is when you include extra components, you don’t really need them in your day to day operation. But you have them in case your day to day operation fails. So if your SQL Server box goes down, you have another redundant server you can go back to that’s an example of redundancy. Monitoring is just keeping track of what’s going on in your system. As any process that keeps track of it, whether you have emails that go out when something fails to happen, or when you reach a certain level, if your discs get high get to a certain level of saturation, you send a warning out. Or if you reach a certain level of traffic, you send a warning out. That’s monitoring. Availability is a characteristic of a system, it’s what we call the length of time a system is operational. 

7:22 We’re going to get into some math stuff here, don’t freak out on me. Availability is generally AO, is the way it’s done in its total time minus the downtime divided by the total time, that’s how you figure out what your available time is. So you take the total time available, if you are up 24 hours in a day, you got 24 hours in a day, we’ll just talk about a single bed, and you’re down for two minutes. So your total available time then, is 24 hours minus two minutes. So that’s 24 hours, I’m actually sitting here with my calculator in front of me. 1,440 minutes minus two … 1438 minutes, divided by 1440. So you’ve got .9986 is your AO in that your availability is .9986 in that case. You have to figure out what your tolerable downtime is. How much downtime you need. And then you can figure out what your availability needs to be. 

9:04 So we’re talking about the nines here. And here’s a good chart for it. 99% time we’re talking about 99%, you are up three, you’re down rather 3.65 days of the year. In a week, you’re down one hour and 41 minutes. At 99.9% up, you’re down less than nine hours and in a week, you’re down 10 minutes. If you take that down to two decimal places, you’re down less than an hour a year, less than a minute in a week. Take it up to three decimal places. You’re down less than six minutes, just over five minutes in an entire year. Six seconds in a week. What does that really look like for you? Well, this is what it comes down to is, how expensive is that to maintain. And this is what we’re talking about, the more nines you have, the harder and more expensive it is to implement. And that’s what you have to take in mind. If you really need that kind of five nines uptime, you’re talking some serious cash. And that may be what you require, I mean, you may be a hospital, you may have lives on the line. And you may need that kind of redundancy, but you’re going to have to invest. 

10:45 So there’s a couple of other things we need to talk about. And we’re still in terminology here. Hot standby is when you have another system that is immediately available. You don’t really lose any data or work. As soon as your system goes by, you’ve got another system that is completely redundant, you may lose like 5-10 minutes of work at the most. But you can immediately switch over into this other system and just keep working as if nothing happened. Awards standby is another session that’s prepared to take over. But you may lose just a little tiny bit of work, maybe a couple of hours, at the most a day. A cold standby is another system that’s ready to be started. But there’s some restoration of backups required. And depending on your last backup, you could lose a day you could lose more than a day of work. Fail over is switching to a redundant or standby system. That system is automatically there and it happens pretty much automatically. There’s nothing you really need to do. This is an automatic process that your system will do for you, once something abnormal occurs. 

12:13 So if you’re again, we’re going to talk to SQL Server here because it’s a good example, if your main SQL Server fails for some reason, if your system can’t talk to your main SQL Server, it automatically fails over to the other one in your cluster. Another good example a great example actually of failover, are the OnBase inherent disk groups, if you have disk group copies, copy two is your failover copy. If it can’t reach copy one, that’s what copy two is for. It then goes oh, I can’t find copy one, I’m going to go look at copy two. And it automatically fails over to copy two, and you never know the difference. 

12:54 Recovery is when business operation has stopped, for whatever reasons, usually the disasters we’ve been talking about, maybe not completely. But recovery brings you back into being able to use your business. Again, you may not be up to full capacity, but you can get some work done, you can get your most critical operations working again. That’s what we’re talking about when we’re talking about recovery. And a runbook is one of the most important pieces of your entire process when we’re talking recovery. It’s the procedures and the operations you need to do to recover. It’s what you have planned out the steps you need to follow in the exact order you need to follow them to recover your system. And there’s a recovery time objective, how long it takes you to get you to your recovery point. So that when you can get back to that level where you can actually start working again. And you have a recovery point objective. That’s where you want to be. The maximum targeted period where the maximum amount of data you can lose before your business is irrevocably lost. Your recovery point objectives. How much data can you lose before you’re not going to be able to recover your systems at all. 

14:53 So we need to talk about planning a little bit. Planning is one of the most important things you can do before you have a disaster, and a lot of people just figure out, we’ll figure it out as we go. But that’s not gonna work. Companies who face a disaster without a plan in place, are no longer in business today. You have to plan the cost for each business system, figure out the hard costs associated with being down for a short period of time, or an extended period of time. Figure out the risks for being down. Include soft costs. 

15:45 Now let’s talk about the difference between a hard cost and soft cost. Hard costs, you’re losing money on rent, you’re losing, you can’t bill your customers. You can’t capitalize. You can’t fulfill your contracts, you can lose contracts for that, those are hard costs. But there are soft costs involved as well. You can’t get into business, you’re losing potential business, that’s a soft cost. You lose your reputation. That’s a soft cost. And there’s other risks. The loss of reputation is the risk. The inability to keep your employees if you’re down for an extended length of time, and unable to pay your employees, people are going to jump ship. That’s also a soft cost. Because now you’ve lost some of your best people and you’ve lost intellectual capital because of it. Determine your resumption: what order do you need to bring your systems up? To minimize your downtime? You got to bring things up in the right order. You can’t bring them up indiscriminately. It doesn’t do you any good to bring up your accounting system, before you bring up the server that runs it. 

17:10 You need to establish your acceptable recovery time objective and recovery point objectives. Figure out what you need to see, to stay lucid to stay viable. And then reconcile it with your available budget. Can you actually afford to be up in a day and a half? Is that feasible for you? I mean, I know it’s good. I know it’s what you want. But can you actually afford to do it? Can you afford that hot site that’s going to be required? And then identify the risks. The types of risks you face. And by that I’m talking about how likely is it that you’re going to get hit by a hurricane? If you’re in Kansas, it’s not that great. If you’re in South Carolina, Yeah, probably. If you’re in Florida yeah, you’re going to get hit by one. How likely is it you’re going to get hit by an earthquake? If you’re in California plan on it. Guess what, I’m from Ohio, been watching my friends and I got hit by one yesterday. Wasn’t bad enough to do any damage, but they all felt it, they need to start thinking about it. On the other hand, if you are in California, you probably don’t need to worry so much about tornadoes. If you’re in Ohio, you need to be thinking about that a lot. 

18:42 Think about the types of risks you have. And don’t just think about the physical dangers of risks. Hacking is a risk. Viruses are a risk: biggest risk you face in your system right now. There are two that you are facing every single day and you need to start planning for them right now. Risk number one: data theft. Your employees. A disgruntled employee messing up your system. It’s a horrible thing to think about, but it happens. And number two biggest risks to any system out there right now are those nasty guys who are sending out those viruses and holding your system hostage. I can’t think of the name of it. Give me a hand here guys. Know what I’m talking about. You’re not the scams, the it’s a virus it they lock out your system and you have to pay them … ransomware, that’s it. Thank you very much whoever just typed that in: ransomware. That’s the biggest risk you face right now. How do you handle that? Have a plan for it. That is a huge risk for every business right now it is rampant. So figure out the types of risks you face. Score them by their likelihood, and how big an impact that will be. And then figure out how you’re going to deal with it and prioritize them accordingly. 

20:40 Here’s something that can help you. You’re going to get this presentation, by the way, look at this map. Earthquakes, tornadoes, hurricanes, volcanoes, tsunamis. There are very, very few places in this country where you don’t have to worry about any of this stuff. I was lucky I grew up in northwest Pennsylvania, we didn’t really have to worry about anything but the occasional tornado. And I grew up right on Lake Erie. So if it was going to be a tornado, most of the time it turned into a waterspout anyway. Now I live in western North Carolina, I kind of got to worry about everything. You got to worry about natural disasters and electrical problems. Do you have battery backup? Do you have an uninterrupted power supply? Even if you don’t have an uninterrupted power supply that’s enough to keep your entire business running, do you have enough of a UPS that you can gracefully shut your servers down? So that you don’t experience data loss by a sudden shutdown? Without network problems? You have a backup ISP? What about a fire issue? I got a sprinkler system. Well, that’s fantastic. What’s your sprinkler system going to do to the seven or $8,000 worth of computer system of computer equipment sitting on your floor? Or the $50,000 worth of computer system? Do you have an appropriate fire suppression system for your infrastructure? crime hacking sabotage, we just talked about that. Hardware failures. What happens if your main server disk drive crashes? It happens. Even a solid state system can die. 

22:48 Process failures. Think about it. We’re all changing our processes. We’re becoming a much more agile system now. We’re all changing the way we do work because the world is changing around us. Are we testing those systems well enough to make sure that when we implement a new process, we’re not messing ourselves up. If your process fails, can you recover from that? Just planting confidence. Think about it. How closely are you managing your users and your systems? And what does being available mean to you? Is the system available if clinicians can access medical records, but your AP team can’t process invoices? Is that available in your hospital? Maybe. Different departments have different requirements. Which software is critical to each area? And do all of your departments need to be up for your system to be considered available to you? 

24:11 If tech support is down, but sales is not is your system available? And how do you measure it? Maintenance windows count. If you go down for planned maintenance window, is that considered downtime? In some cases it is, some organizations don’t consider that downtime because it’s planned. Some organizations only consider unplanned downtime as downtime. And do you track peak usage or just low usage? Then you need to document all of this. It’s your runbook I was talking about. All of your recovery steps need to be very carefully documented. 

25:00 Then you need to run disaster drills. Because just having a book is no good. Because think about it, the worst has happened. You’ve been hit by, pick a disaster, tornado, Hurricane blackout, whatever. Lightning has struck, your building has been burned to the ground. Pick a disaster. You find your book. Do you really want to have to take the five hours to read through it and figure out who you need to call who needs to be involved? What’s going on? Before you can start actually getting started? No, run the disaster drills with the team that’s involved. Do it regularly, at least once a year, preferably every six months if not every quarter. Prove your backup reliability. 

26:04 Now this is something that when I’m talking about backups, bomb bays, anybody who’s heard me talk about database backups will hear me say this. Every time you back something up, restore it somewhere to make sure it works. That should just be part of your regular backup maintenance. Restore it, and check it for integrity every single time. Because an unproven backup isn’t a backup. It’s just a paperweight. And again, check them for corruption. Because just because you have the backup, and you can restore, it doesn’t mean something didn’t get backed up wrong. Because nothing’s perfect. And establish off site backups. Now there used to be a rule prior to Katrina, to keep an offsite backup, two hours drive time away. And then Katrina hit. And people in the Crescent City found out two hours drive time away is not enough, because those backups got wiped out, too. And it turned out that even four hours drive time, wasn’t enough. And Matthew proved that as well. Just a couple years back. So the new rule is two to four hours flight time away. 

27:45 This is why the cloud is the fantastic way to do some backups. This is one of the strengths of the OnBase Cloud backup, and having OnBase do some hosting for you. Because their backups, they mirror all of their sites. Don’t know if you know this, those of you who are using OnBase’s global cloud services, they’re mirrored sites are, well, I think the mirrored site for their Andover, Maryland is in Kansas, I believe. The mirror site for England I want to say is in Amsterdam. They do a really good job of separating their sites. And it’s very important that you do the same. I mean, it doesn’t have to be that drastic, but establish off site backups. If you’re going to be doing your backup in the cloud, do the research. Talk to the company where you’re putting it, find out where their backups are. They’ll tell you if you ask. Be sure to ask, what’s their backup procedure? Don’t just trust that they’re doing it right. Ask some. Backups don’t exist if they haven’t been verified. And you don’t have a disaster recovery plan until you’ve tested it. The two things to remember. 

29:22 Now, let’s talk some strategy here. Platform strategy. Hardware components of your high availability disaster strategy costs money, sometimes a lot. So you have to plan your strategies to make sure you’re getting what you need and not killing your budget. So there’s a couple ways you can do it for your platform. You can cluster. Multiple servers host a lot of different services. It’s more expensive for the operating system. It requires more knowledgeable resources, but you can get to four or five of those nines this way. With network, redundancy goes a long way. Because links are going to go out. This is especially important. If your data center and your users aren’t geographically together, if you’ve got scattered users, you really need to watch your network redundancy. Have a second ISP. 

30:30 Load Balancing is a very effective and reasonably priced way to get high availability. OnBase is load balancer agnostic, we don’t care. All load balancers are supported. But load balancer and OnBase session expiration need to be synchronized. And you need to keep that in mind. There’s two basic architectures for load balancing, there’s stovepipe and bow tie. Stovepipe, you basically do the load balancing on the front end, to the, in the OnBase world, to the web server, and then each web server connects to a specific asset server. There’s a hybrid version as well, where there’s one web server, this load balanced to multiple app servers. And then there’s the bow tie, where there’s multiple load balancers where you have a load balance to multiple app servers, and then those apps web servers rather, and then those web servers are load balanced to multiple app servers. OnBase works great with any of that. 

31:45 Virtualization is also a great way to have high availability. VMware, Citrix, fully supported by OnBase. Very quick to restore. And with products like v motion, you can move from one piece of hardware to another with very little downtime. It provides very effective both high availability and disaster recovery. Because if you’ve got a golden image sitting there, boom, just put it up on a new piece of hardware you’re done. But you need to do it carefully. Consider the impacts of sharing your physical resources on the host, I cannot tell you the number of times we’ve gotten calls in support saying all of a sudden our OnBase system is slow. And it turns out they’ve overextended a virtual host. So design your host to handle the maximum load of all of those VMs. Consider dedicating hosts to specific high traffic, high volume VM. Don’t overextend your hosts, because you will affect everybody, and you will affect them negatively. And if you’re doing virtualization, there’s a priority to it. You can absolutely virtualize your DDS or file server, and your processing stations and your web server. Then get to your application server and your database server. Now, I am not going to read this out loud. But I will say OnBase does not recommend virtualizing all of these. We’ve seen it done. We’ve seen it done many, many times very successfully. However, I am just throwing it out there OnBase does not recommend it and recommends that the app server and the database server be on physical servers, not virtual servers. You’re virtualizing them there’s nothing wrong with it. Like I said, we’ve got lots of customers doing it very successfully. But it is not Hyland’s recommendation. It’s supported. But it is not the recommendation. 34:15 Storage. There’s three ways you can do storage. There’s online. That’s absolutely active right now. There’s nearline backups very easy to get to quickly restored to online status. And then there’s offline. That’s the remote stuff. That’s the stuff that’s stored four hours away, take longer to restored. So those are your three kinds of proximities for storage.

 34:45 Disaster recoveries, often done on tape nearline and offline usually, if remotely restored. That’s your best bet to get back from a major structural problem. Optical storage is usually done near line. Disk is usually online. Near line if it’s redundant. RAID disks are online. If you’ve got partial damage depending on the RAID selection, if you use using parity. It can be used, if one disk goes bad in the RAID, you can still get your recovery back. Cloud depending on your provider can be nearline, online or offline. High Availability: optical, nearline. Disk: online. RAID: online. Cloud: again, for high availability that can be on or near line, again, depending on your contract and the provider. For storage, SAN will generally have your highest performance with high availability benefits. Read once write once, read once. Sorry, write once read many a warm disk. Great for backup, little slow to write out. 

36:21 Replication and storage virtualization can be a good way to go. For disk groups in OnBase, media really doesn’t matter except for performance. And the fact that your first copy must, must must must be mass storage, whether that’s SAN or a hard disk, either one works. Disk groups provide a hot failover. Disk groups copy built in failover. I mentioned this earlier, storage replication and storage virtualization. Backups through disk groups are a cold backup, they can be restored. But it is a cold failover. Have more than one copy. And if you’re going to have more than one copy of your disk groups, don’t put them on the same physical box. If it goes down, you’ve lost everything and use RAID for the first and the most inexpensive level of high availability. OnBase recommends RAID one or RAID 10 and supports many other options, including striking parity assisting. You can validate your platters, make backups and monitor and maintain offsite copies. 

37:56 For OnBase, for a hot database, we do support failover clustering. This is cool. If your database goes down, cluster over to the new one. OnBase supports database mirroring. This is only for failover with high safety, with a witness, safety without a witness, manual failover and high performance manual failover. All of those are supported. Mirroring and high safety mode, you’re not committed until they’re hardened you’re in high safe in high performance mode. They’re hardened and it’s not considered complete until hardened on both sides. 

38:57 Log shipping SQL server does provide the ability to use log shipping. It can be used with OnBase, it’s a warm database backup. It’s mirroring without a witness and virtualization is also possible. And of course there’s a cold database backup which is just a backup and it gets restored. But check your corruption with every backup and if corruption is ever found in the backup or in your database itself, contact KeyMark support immediately before you take any action of any kind. Now about corruption. OnBase database transactions are not multi statement. Therefore when something goes wrong, investigation will be required and very specific OnBase knowledge is necessary. Do not try to figure this out by yourself DBAs, because you are going to get it wrong and you will have the possibility of A. messing up your database even worse, and B. invalidating your contract with Hyland and KeyMark. The net result of that is if anything goes wrong, Hyland’s going to charge you to fix it.