When it comes to crisis and incident management in the cloud/digital era, HOPE IS NOT A STRATEGY!
An Incident is defined as unplanned downtime, or interruption, that either partially or fully disrupts a service by offering a lesser quality of service to the users. If the Incident is major, then it is a crisis!
When it starts to affect the quality of service delivered to the customers, it becomes an issue, as most service provides have service level agreements with the consumers that often have penalties built in.
As I continue my research in these areas, and after talking to multiple clients, I have come to the realization that most enterprises are not set up to handle IT-related incidents or crises in real time. The classic legacy enterprises are set up to deal with crises in old-fashioned ways, without considering the Cloud or the SaaS model, and social media venting brings another quirk. Newer digital native companies do not put much emphasis on crisis management, from what I have seen.
Especially with the need and demand for "always-on," opportunities are more than ever to break, and incidents do not wait for a convenient time. Problems can, and often do, happen on weekends, holidays, or weeknights when no one is paying attention. When an incident happens, a properly prepared enterprise must be in a situation to identify, assess, manage, solve, and effectively communicate it to the customers.
Another key issue to note here is the difference between security and service incidents. A security incident is when either data leakage or data breach happens. The mitigation and crisis management there involves a different set of procedures from disabling the accounts to notifying stakeholders and account owners and escalating the issue to security and identity teams. A service incident is when a service disruption happens, either partially or fully. It needs to be escalated to DevOps, developers, and Ops teams. Since they are similar, some of the crisis management procedures might overlap. But if your support teams are not aware of the right escalation process, then they might be sending critical alerts up the wrong channel when minutes matter in a critical situation. For the sake of this article, I am going to be discussing only service interruptions, though a lot of parallels can be drawn to a security incident as well.
Avoid incidents when possible
Avoidance is better than fixing issues in any situation. There are many things an enterprise can do to avoid situations, such as vulnerability audits, early warning monitoring, code profile audits, release review committees, anomaly detection, etc. One should also invest in proper observability, monitoring, logging, and tracing solutions. I have written many articles on those areas as well; they are too complex to cover in detail here.
Prepare for the unexpected
With most enterprises, there is no preparation or plan of action when an incident happens. In the digital world, incidents do not wait around for days to be solved or managed. If you let social media take over, it will. Sometimes it can even have a mind of its own. When you are not telling the story, the social media pundits will be telling your story for you.
Identify the incident before others do
I wrote a few articles on this topic. In my latest article, "In the digital economy, you should fail fast, but you also must recover fast," I discuss the need for speed to find issues faster than your customers or partners can. Software development has fully adopted the DevOps and agile principles, but the Ops teams have not fully embraced the DevOps methodologies. For example, the older monitoring systems, whether they are application performance monitoring (APM), infrastructure monitoring, or digital experience monitoring (DEM) systems, can also find if there is a service interruption fairly quickly. However, identifying the microservice that is causing the problem, or the changes that went into effect that caused this issue, seem to be complex in the current landscape. I have written about the need for observability and for finding the issues faster at the speed of failure repeatedly.
Act quickly and decisively
When major incidents happen, it should be an all-hands on deck situation. As soon as a critical incident (Sev. 1) is identified, an incident commander should be assigned to the incident, a collaborative war room (virtual or physical) must be immediately opened, and proper service owners must be invited. If possible, the issue must be escalated immediately to the right owner who can solve the problem rather than going through the workflow process of L1 through L3, etc. In the collaborative war room, often finger-pointing and blaming someone else is quite common, but that will delay the process further. In addition, if too many people are invited to these collaborative war rooms, there has to be a mechanism to identify mean-time-to-innocence (MTTI) so anyone who is invited can continue their productive work by leaving if they are not directly related and cannot assist in solving the issue.
Own your story on your digital channels.
When a severity 1 or a major service interruption happens, your users need to know, your service owners need to know, and your executives need to know. In other words, everyone who has skin in the game should know. Part of it would be external communication. At a very minimum, there has to be a status page that will display the status and quality of service, so everyone is aware of the service status all the time. In addition, an initial explanation of what went wrong, what are you doing to fix it, and with a possible ETA should be posted either as a status update or on regular posts on LinkedIn, Twitter, Facebook, and other social media platforms where your enterprise brand is present. Going dark on social media will only add fuel to the fire. Your users know your services are down. If they get no updates from you, speculators, or even competitors, will spread rumors to ruin your brand.
This is where most digital companies are weak as they are not prepared, which can make or break an SMB enterprise. Real-time crisis and reputation management are crucial in those critical moments while engineers and support teams are trying to solve the problem. It is also a good idea to use sentiment analysis and reputation tools to figure out who is saying extremely negative things and to try to either take them offline to deal with them directly or respond in kind to avoid further escalation.
Do a blameless post-mortem
A common pattern I see across organizations is after the crisis is solved and the incident is fixed, everyone seems to move on to the next issue quickly. It could be because there are too many issues that the support, DevOps, and Ops teams are overwhelmed, or they do not think it is necessary to analyze what or why this happened. An especially important part of crisis/incident management is to figure out what went wrong, why it went wrong, and more importantly how can you fix this once and for good, so this will not happen ever again. After figuring out a solution, document it properly. You also need to have a repository to store these solutions so in the unfortunate incident that it happens again, you know how to solve this quickly and decisively.
Follow-up
In addition, discuss the situation with your top customers who were affected by it, what you did to solve the issue, how you fixed it so it will not repeat, but more importantly, discuss how you were prepared for the incident before it happened. This instills huge confidence in your brand. Not only will you not lose customers, but you will get more because of how you handled it.
In addition, the general advice from crisis management firms would be to cancel any extravagant events that are planned in the immediate future. If your critical services were down for days, but your executives were having a huge conference in Vegas, the social media world would be at it for days. Monitor social media platforms (LinkedIn, Twitter, Facebook at a minimum or whatever other social media platforms your company has a presence on, including negative comments on your own blogsites) for tone; you can even use AI-based sentiment analysis tools to identify still unsatisfied customers to discuss their concerns and how you can address them. Until these concerns are addressed, your incident is not completely solved.
Another best practice would be to avoid hype content or marketing buzz for a while after a major incident happens. I have seen companies go on with the plan and get a backlash from customers that they are all talk and nothing really works.
Conclusion
Let's face it: every enterprise is going to face this sooner than later. No one is invincible. The question is, are you ready to deal with it when it happens to you? The ones who handle it properly can win the customers' confidence, showing they are prepared to handle future incidents if they were to happen again.
Do you earn your customers' trust by doing this the right way, or do you lose it by botching and covering this up? That will define you going forward.
At Constellation research, we advise companies on tool selection, best practices, trends, and proper IT incident/crisis management setup for the cloud era so you can be ready when it happens to you. We also advise the customers in the RFP, POC, and vendor contract negotiation process as needed.