This week I learned about an outage that happened at a provider of Microsoft Office applications. Which reminded me about the sad state of high availability across the industry.
More on the provider
The provider is a medium sized infrastructure vendor that is successful in providing hosted Microsoft Officer applications, basically running the servers for Outlook for their clients. They are not small with 5 corporate locations on both sides of the Atlantic, and 10 datacenters in the US and Europe. The provider is professional and has e.g. achieved SOC2 and SOC3 compliance.
The vendor offers a a 99.999% up-time guarantee - but that was definitively broken by being down for most of a workday from 7 AM till 3:30 PM.
What happened
Clients noticed in the morning, that they were not able to get their emails, send emails and work with their calendars. When calling the provider, calls went dead, the provider's website and support applications were not available. The first provider to client communication happened then over... Twitter. And Twitter remained the lifeline between provider and customers till - you may have guessed it - the Twitter account went into Twitter jail for hitting the daily limit of 1000 tweets. And while that seems generous - it's not much if Twitter is your only ways of communication with multiple hundreds customers.
What went right
The provider got the system back, tried all they can do to get customer informed (so they were obviously in the dark), offered the usual letter form the CEO in the next 24 hours and had that followed up by the COO. The vendor communicated pro-actively that they had broken the service levels, and that that they would waive the requirement to ask for re-imbursement, and re-imburse customers diretly based on their SLAs.
What went wrong
In the CEO letter the provider already offered an issue with their routers as the root cause of the outage. And while it's fine to not have the ultimate reasons 24 hours post an outage event - you need to do better than the following from the letter of the COO- 48 hours later:
"the routers connecting all our systems each received an invalid update”
As my colleague Frank Scavo pointed out - that is pretty passive language - no one did the update. Was it the provider, was it the router manufacturer etc - we do not know. No one is taking responsibility.
Moreover there was no mention why all the routers went down, why the update was not tested separately and routers were not switched in groups, why were no backup routers held back etc.
And the provider explained that the phones (VoIP) and customer support systems were down - because the provider is using its own infrastructure. And while drink your own champagne is a good argument - it is an empty glass when you have a system outage. The provider missed to explain how this happened and why e.g. their DR for their operational systems did not kick in.
The lack of an answer on both of these areas does not instill confidence to customers.
The sorry state of HA
We all know that data center components should only be switched in groups and with redundancy - but obviously this went wrong at the provider. Equally running your critical customer systems on the same infrastructure with your customers is a disaster waiting to happen - and it happened to this provider.
So why do well known and proven HA principles get broken? The reasons are manifold. Human error, overconfidence (both are my bets in this case), cutting corners cost wise, not thinking the impossible, etc are all likely. And human nature is good at discrediting highly unlikely events - but when they happen we too often do not think of it was the decision makers looking the other way back then when they came up.
MyPOV
Outages are always unfortunate and can't be planned as the Dilbert cartoon requests. It comes back to how a provider reacts, investigates, communicates and then remedies the sources. On reaction and investigation the provider was solid with a B rating - but on communication and remedies they only deserve an F.
And HA on Twitter is easy - get a 2nd and 3rd account and switch over when your main account goes to Twitter jail. Yours truly knows about the 200 tweets per hour limit well. So follow holgermu1, too. Just in case. Happened twice this year - so far. And I won't beat 400 tweets per hour - promised... now wait...