The following are the common terms used in Incident Management, AIOps, and Observability practice areas:

Alert Fatigue

Alert fatigue is a phenomenon that occurs when the on-call personnel and incident responders receive an overwhelming number of alerts or notifications (either in volume or in frequency), causing them to ignore, dismiss, or become indifferent and insensitive to some of the highly sensitive and critical notifications/alerts as there are too many causing fatigue. This might result at times either in support personnel missing the right alert thereby not taking the right action or taking an inappropriate action that might make matters worse.

To mitigate this, many enterprises use AIOps or Incident Management solutions that help reduce or group the number of redundant, irrelevant, or non-critical alerts either by,

  1. Prioritize the alerts so only highly critical alerts will make it to the on-call support personnel.
  2. Group the alerts so any alerts, or notifications related to a specific event will be grouped in one single bunch for analysis.
  3. Noise reduction, or dynamic filtering, by reducing the irrelevant alerts and focusing only on the critical alerts that need to be attended immediately.
  4. False alarms.

These effective solutions can reduce burnouts of on-call personnel and SREs and help them more efficient with them resolving critical incidents faster.

ChatOps

ChatOps is a collaborative communication tool and process that is commonly used in incident management. The physical war rooms have evolved into virtual collaboration channels. Generally, as soon as an incident is identified and acknowledged, one of the first steps is step is to create a collaboration channel. This place centralizes all communications and assets about incidents. It also holds information about incident progress, status, plans, and resolution (if any), which allows anyone involved to get the status and necessary information in real-time. Anyone who needs to know the information or who can assist in solving the incident can be invited to the collaboration channel as needed. Oftentimes, on-call systems, alert/notification tools, and chatbots are often included in this category as well.

Incident

An incident is defined as unplanned downtime, or interruption, that either partially or fully disrupts a service by offering a lesser quality of service to the users. If the incident is major, then it is a crisis or major incident. When it starts to affect the quality of service delivered to customers, it becomes an issue, because most service providers have service-level agreements (SLAs) with their consumers that often have penalties built in. The longer the incident remains unsolved, the more it costs an organization.

They expect and prepare for major digital incidents and handle them well when they happen. They use a mix of open-source, commercial, and homegrown tools that blend well. Most of those organizations also successfully implement the following processes.

Incident Acknowledgement

Once an incident alert/notification is generated, it needs to be acknowledged by someone either from the support or SRE team or from the service owner. An acknowledgment of an incident is not a guarantee that the incident will be fixed soon. However, this is an early indication of how alert the incident teams are and how soon they can get to incidents. While a user has taken the responsibility for the incident, this doesn’t mean it has been escalated to the right user, yet. This acknowledgment mechanism is very common in most on-call alerting/notification tools. If there is no acknowledgment, the on-call tool will continue to escalate or try to find the right person until the incident is acknowledged.

Incident Commander

An Incident Commander (IC), or the Incident Manager, is a member of the IT team responsible for managing a coordinated critical incident response especially if the incident is considered an emergency or a crisis situation. An IC has ultimate control and final say on all incident decisions. He or she is also responsible for inviting the right personnel and escalating the incident to necessary teams as necessary and ultimately responsible for efficient and quick resolution of an incident.

Incident Lifecycle

The incident lifecycle is the duration of the incident from the occurrence to the time it is resolved. While the post-mortem analysis and fixing the underlying issue so the incident never repeats again is not part of the incident lifecycle, it is an important adjacent step that must be performed to avoid the recurrence of events.  If a specific incident occurs regularly and a possible solution is known, it should be automated as well which is not part of the incident lifecycle but will help reduce the lifecycle of future incidents.

MTTA (Mean Time To Acknowledge)

Mean time to acknowledge is the measure of how long it takes to acknowledge an incident. This time shows the efficiency and responsiveness of the responders and gives confidence to the customers that an enterprise is aware that the services are down and they are working on it. As soon as the first acknowledgment is done, the status update must be updated as well – such as status pages, email alerts, pager notifications, etc.

At a high level, MTTA is calculated by dividing the total time it has taken to acknowledge all incidents by the number of total incidents over the sampling period. 

MTTI (Mean Time To Innocence)

Mean Time to Innocence is a metric that is used to prove that someone is not guilty or associated with an incident. When an incident happens it has become a common practice to invite everyone that is deemed remotely associated with the incident to the incident collaboration channel. This results in a lot of wasted time and resources. Most times it becomes difficult to identify the root cause or solve the incident efficiently because there are “too many cooks in the kitchen,” each offering different advice, knowledge, and wisdom that is neither useful nor relevant.

Many organizations also are measuring mean time to innocence (MTTI) thereby letting the teams/personnel who are not directly responsible or cannot offer help in solving the incident leave the incident collaboration channel. This allows the innocent parties to continue to be productive in their regular job rather than waste time figuring out how to resolve the unplanned downtime that is unrelated to them, and about which they have no knowledge.

However, care should be taken while measuring this metric and having the team participate in this practice. Oftentimes, the teams involved will start to blame each other in order to prove their innocence. Either the guilty party becomes defensive or totally denies their responsibility. By trying to shift blame to others, the need for a collaborative mindset and keeping customers and solving unplanned outage focus can get lost while the blame game happens.

MTTR (Mean Time To Resolution)

Mean time to resolve is the average time it takes to resolve an incident. The resolution is defined as the combination of identification of the incident, identification of the root cause, and fixing the incident. In other words, the time it takes to bring the service back to the mode of its normal operation. Resolving the current incident doesn’t guarantee such events won’t happen in the future.