What sort of uptime should 'mission critical' systems be getting?

Mark R · Jun 19, 2008

I know that 99% sounds pretty reasonable, but going for a prolonged lunch break, or a starbucks once a week is getting boring, especially as the work with urgent deadlines just piles up while you're sitting at a blank screen.

We've had the system for about 1 year, and it's really not a lot better than it was - if anything, it's getting worse as the load on it increases. Recently, however, senior management seem to have become aware that the system's performance is 'suboptimal' and have been hassling the prime contractor to get them fixed. So far, the contractors have disabled a whole heap of functionalilty in order to try and stop the app servers crashing a couple of times a week.

However, the desktop client software sucks - and we've had some really cracking non-explanations from the prime contractor as to why the software is broken. E.g. the voice dictation software used on the desktops is flaky as anything - it often ends up opening up 2 recording windows simultaneously, which results in completely scrambled audio getting recorded and a bizarre error message like 'insufficient resources to save audio'. Not helpful when it's a report on 1 hour's work. Contractor's diagnosis for this bug. Insufficient memory in client machines. 2 GB is insufficient. Recommend urgent upgrade of client workstations to 4 GB'. Hmm. Let's see, java is slow and memory hungry, but it's not that bad. Also these are 32 bit XP workstations, which as I recall don't make particularly great use of 4 GB of RAM - especially not when we're talking about ECC FB-DIMMs.

Another great choice has been the use of smartcard authentication - you require a smartcard to log in to the app software. Fine, except the smart cards are authenticated against a server in a remote datacenter run by the prime contractor. It was nice last weekend, the link to the datacenter went down. Message pops up on the screen 'Your user rights have changed. You will be logged off in 10 seconds'. 10 seconds later, app closes down, unsaved work lost. Try to log in again, nada for a couple of hours. Called IT, but not a lot they could do - a 3rd party problem. Had to wait until the contractor resolved the problem, nothing could be done locally.

Oh, and don't even get me started on the smartcard client software - the underlying service hangs when you pull your smartcard out. Only way to log in again, is to power cycle the workstation! The service actually prevents windows shutting down, so you have to hard power off. Unsurprisingly the OS gets hosed on these workstations with alarming regularity.

Gah. It's frustrating. At least we now have a way to report problems - prior to our ultimatum to the contractor, they weren't even interested in problems. I prepared a detailed report on about 5 bugs, and was simply told 'why are you using those tools? That way sucks? These other tools are better'.

I'm just wondering how long it's going to take to sort this system out. Nothing much has happened in 9 months. In the month since the ultimatum, we've at least had some feedback, but the system is still unreliable as hell.

Fardringle · Jun 19, 2008

Based on your post, it sounds like a lot less than 99% up-time. To me, up-time doesn't just mean that the server or application is turned on and running, but that it is working the way it is supposed to be working. 99% true up-time (i.e. 1 hour of difficulty for every 100 hours of perfect or near-perfect operation) is probably acceptable in most cases. 1 hour of complete outage along with 20 hours of partial functionality in that 100 hour time span would tell me that you need a new system and/or service provider.

RebateMonger · Jun 19, 2008

99% uptime is pretty horrible. That'd be a system down for three days a year. My small business clients would kill me if they or their systems were down for three days a year.

spidey07 · Jun 19, 2008

A few things - there is no such thing as "the link to the datacenter went down". It should take a humongous tornado to do that, bascially eliminating the data center. If this is not the case than that's a poor data center or poor design.

But I'd call standard service level agreement for true mission critical applications to be 99.999% outside of normal maintenance. And even with maintenance the application should not be unavailable unless major work on the app is being done.

All that being said, these kinds of things should be stipulated in whatever contract is in place. If your needs are not being met then check the contract and/or find somebody else.

Mark R · Jun 20, 2008

Originally posted by: spidey07
A few things - there is no such thing as "the link to the datacenter went down". It should take a humongous tornado to do that, bascially eliminating the data center. If this is not the case than that's a poor data center or poor design.

The scenario is that the ASP's datacenter is a few hundred miles away. We have primary severs and storage on site, but the data is mirrored to the ASP's site (for disaster recovery, and eventual sharing). The ASP also hosts the primary authentication servers, although a local cache is available.

The telco that provides the connection was performing 'maintenance', when they unintentionally interrupted the link to the datacenter for a few hours. When smartcards' credentials expired from the local cache, they were forcibly logged out, and couldn't log back in again, until the connection was restored.

Again, you'd think that they've have a high reliability link, or a sensible cache that won't purge data when it's isolated. But, someone somewhere decided that this was the best way to do it.

But I'd call standard service level agreement for true mission critical applications to be 99.999% outside of normal maintenance. And even with maintenance the application should not be unavailable unless major work on the app is being done.

Well, it's nice to have a figure that should be achievable. Currently, the system has been going down (completely, or several servers in the cluster have unavailable) for around 1 hour a week. Then there is 'maintenance' to try and work out the bugs, which is taking about 1-2 hours every couple of weeks or so.

I'd like to think that it's truely 'mission critical', but the procurement department and contractor make think different.

All that being said, these kinds of things should be stipulated in whatever contract is in place. If your needs are not being met then check the contract and/or find somebody else.

Unfortunately, a lot of the fine detail is kept confidential, and peons like myself are given the 'mushroom treatment'. However, I do know that the matter has been escalated to top-level management at the prime contractor. This has led to 'crack teams' of engineers flown in from the various vendors, but has led to little progress, other than the loss of some highly desirable functionality.

I suspect the legal department are beginning to sharpen their pens, as there was a recent circular suggesting that senior management were beginning to lose patience with the contractors. Not entirely surprising, as this is a very, very expensive system. The exact cost is confidential, but I believe it to be well in excess of $10m.

Goosemaster · Jun 21, 2008

Can you setup a simple failover over some cheap dsl or something? For 10mill I would expect a team of workers to ferry the packets there for you by hand if that's all that's left.

Also, if you see that guy give him a swift punch in the arm from me. :evil:

Modelworks · Jun 22, 2008

Most people I have dealt with offer 99.9% uptime.
Meaning the system must not be down more than a total of 45 minutes per month.
There are also places that offer 100% SLA.
I recall one that boasted, "If your system is down even one minute, you get that month free"

Jeff7181 · Jun 22, 2008

I don't think anyone would consider 99% uptime acceptable for actual mission critical systems.

What sort of uptime should 'mission critical' systems be getting?

Mark R

Diamond Member

Fardringle

Diamond Member

RebateMonger

Elite Member

spidey07

No Lifer

Mark R

Diamond Member

Goosemaster

Lifer

Modelworks

Lifer

Jeff7181

Lifer

TRENDING THREADS