Incident Management: Processes, Best Practices & Tools (2024)

Incident management for high-velocity teams

Get it free

Learn more

Service Request Management

Overview

Best practices for building a service desk

IT metrics and reporting

SLAs: The What, the Why, the How

Why first call resolution matters

Help desk

Service desk vs help desk vs ITSM

How to run IT support the DevOps way

Conversational ticketing

Customize Jira Service Management

Transitioning from email support

Service Catalog

What is a virtual agent

Understanding IT services and why they’re important

IT Asset Management

Overview

Configuration management databases

Configuration vs Asset Management

Enhance efficiency and security with asset tracking

Incident Management

Overview

IT service continuity management

Incident Communication

Templates

Workshop

Incident Response

Best Practices

Incident Commander

Aviation

Roles and responsibilities

Lifecycle

Playbook

Tools

Template

KPIs

Common metrics

Severity levels

Cost of downtime

SLA vs. SLO vs. SLI

Error budget

Reliability vs. availability

MTTF (Mean Time to Failure)

DevOps

SRE

You built it, you run it

Problem management vs. incident management

ChatOps

ITSM

Major incident management

IT incident management

Modern incident management for IT ops

Disaster recovery plans for IT ops and DevOps pros

Bug tracking best practices

Postmortem

Template

Blameless

Reports

Meeting

Timelines

5 whys

Public vs. private

Tutorials

Incident communication

On call schedule

Automating customer notifications

Handbook

Incident response

Postmortems

Template generator

Glossary

Get the handbook

2020 State of Incident Management

2021 State of Incident Management

IT Management

Overview

Problem Management

Overview

Template

Roles and responsibilities

Process

Change Management

Overview

Best practices

Roles and responsibilities

Change advisory board

Change management types

Knowledge Management

Overview

What is a knowledge base

What is knowledge-centered service (KCS)

Self-service knowledge bases

Enterprise Service Management

Overview

HR Service Management and Delivery

HR Automation best practices

Three implementation tips for ESM

Understanding the offboarding process

ITIL

Overview

DevOps vs ITIL

ITIL Service Strategy Guide

ITIL service transition

Continual service improvement

IT Operations

Overview

IT Operations Management

Overview

System Upgrade

Service mapping

Application dependency mapping

IT infrastructure

Incident management is the process used by development and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.

At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.

Incident Management: Processes, Best Practices & Tools (1)

Get our Incident Management Handbook

Download the PDF to learn incident management principles and practices, and how to apply these lessons using Jira Service Management.

Get the handbook

Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses the even-greater risk of complete failure. Incidents can vary widely in severity, ranging from an entire global web service crashingto a small number of users having intermittent errors.

An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.

The importance of incident management

Incident Management: Processes, Best Practices & Tools (2)

Atlassian’s incident management values

Incident management is one of the most critical processes an organization needs to get right. Service outages can be costly to the business and teams need an efficient way to respond to and resolve these issues quickly. Teams need a reliable method to prioritize incidents, get to resolution faster, and offer better service for users.

When teams are facing an incident they need a plan that helps them:

  • Respond effectively so they can recover fast.
  • Communicate clearly to customers, stakeholders, service owners, and others in the organization.
  • Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from resolving the issue.
  • Continuously improve to learn from these outages and apply lessons to improve a service and refine their process for the future.

Want to see how Atlassian handles major incidents? We’ve published our internal incident management handbook. Anyone is welcome to learn from it, adapt it, and use it however they see fit.

Check out the handbook

Types of incident management processes

Different types of companies tend to gravitate toward different types of incident management processes. No single process is best for all companies, so you’re likely to see various approaches across different companies.

Many teams rely on a more traditional IT-style incident management process, such as those outlined in ITIL certifications. Other teams lean toward a more Site Reliability Engineer- (SRE) or DevOps-style incident management process.

IT incident management process

An incident management process helps IT teams investigate, record, and resolve service interruptions or outages. The ITIL incident management workflow aims to reduce downtime and minimize impact on employee productivity from incidents. Using templates designed to manage incidents, you can create a repeatable incident management workflow, which ensures teams log, diagnose, and resolve incidents—and have a record of their activities.

The ITIL framework is chiefly used by IT teams running services inside businesses. Typically teams take what they need from ITIL—which covers almost every type of incident and issue and process IT teams might face—and leave the rest. ITIL is great when teams need to focus on cultivating a culture of active troubleshooting. The prescribed processes help teams track incidents and actions in a consistent manner, which improves reporting and analysis, and can lead to a healthier service and a more successful team.

Steps in the IT incident management process

Identify an incident and log it

An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it. These incident logs (i.e., tickets) typically include:

  • The name of the person reporting the incident
  • The date and time the incident is reported
  • A description of the incident (what is down or not working properly)
  • A unique identification number assigned to the incident, for tracking

Categorize

Assign a logical, intuitive category (and subcategory, as needed) to every incident. This helps you analyze your data for trends and patterns, which is a critical part of effective problem management and preventing future incidents.

Prioritize

Every incident must be prioritized. Start by assessing its impact on the business, the number of people who will be impacted, any applicable SLAs, as well as the potential financial, security, and compliance implications of the incident. Compare this incident to all other open incidents to determine its relative priority. As a best practice, define your severity and priority levels before an incident happens, making it simpler for incident managers to gauge priority quickly.

Respond

  • Initial diagnosis: Ideally, your front-line support team can see an incident through from diagnosis through close, but if they can’t, the next step is to log all the pertinent information and escalate to the next tier team.
  • Escalate: The next team takes the logged data and continues with the diagnosis process, and, if this next team can’t diagnose the incident, it escalates to the next team.
  • Communicate: The team regularly shares updates with impacted internal and external stakeholders.
  • Investigation and diagnosis: This continues on until the nature of the incident is identified. Sometimes teams bring in outside resources or other department members in to consult and assist with the resolution.
  • Resolution and recovery: In this step, the team arrives at a diagnosis and performs the necessary steps to resolve the incident. Recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
  • Closure: If the incident was escalated, it is finally passed back to the service desk to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the incident owner should check with the person who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.

DevOps and SRE incident management process

With a DevOps or SRE approach to incident management, the team that builds the service also runs it—and fixes it if it breaks. This approach has exploded in popularity alongside the growth of always-on cloud services, globally-accessed web applications, microservices, and software as a service.

Increasingly the software you rely on for life and work is not being hosted on a server in the same physical location as you. It’s likely a web-accessed application deployed in a data center for thousands or millions of users around the globe. For teams tasked with running these services, agility and speed are paramount. Any downtime has the potential to affect thousands of organizations, not just one.

An advantage of the “you build it, you run it” approach is that it offers the flexibility agile teams need, but it can also obscure who is responsible for what and when. DevOps teams can be comfortable—and successful—with less structured development processes. But it’s best to standardize on a core set of processes for incident management so there is no question how to respond in the heat of an incident, and so you can track issues and report how they’re resolved.

Three beliefs of DevOps incident management teams

  • Take turns being on call: Rather than certain team members specializing in being on call, DevOps teams typically rotate through an on call schedule where all members share the burden of possibly being woken at night to respond to an incident.
  • The engineer who built it is the best person to fix it: The central idea of the “you build it, you run it” ethos is that the people most familiar with the service (the builders) are the ones best equipped to fix an outage.
  • Build with speed, but practice accountability: When engineers know that they and their teammates are on the hook during outages, there’s added incentive to make sure you’re deploying quality code.

This approach assures fast response times and faster feedback to the teams who need to know how to build a reliable service.

We outline a very DevOps-friendly approach to incident management in our Atlassian Incident Handbook.

Incident management tools

Incident management isn’t done just with a tool, but the right blend of tools, practices, and people. Here are several of the most common tool categories for effective incident management:

  • Incident tracking: Every incident should be tracked and documented so you can identify trends and make comparisons over time.
  • Chat room: Real-time text communication is key for diagnosing and resolving the incident as a team. And it provides a rich set of data for response analysis later on.
  • Video chat: Video chat complements text chat for many incidents, team video chat can help discuss the findings and map out a response strategy.
  • Alerting system: A tool such as Jira Service Management integrates with your monitoring system and manages on-call rotations and escalations.
  • Documentation tool: A tool such as Confluence can capture incident state documents and postmortems.
  • Statuspage: Communicating status with both internal stakeholders and customers through Statuspage helps keep everyone in the loop.

Incident management topics

The Atlassian Incident Management Handbook

This handbook features the real incident management processes we've created as a global company with thousands of employees and over 200,000 customers.

Incident communication best practices

Incident communication is the process of alerting users that a service is experiencing some type of outage or degraded performance.

Incident response

Learn about the six key stages of incident response, incident types, and tools to streamline your processes for effective incident management.

On call

On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.

Tools

Explore the key features of incident management software. Learn how to choose the right tools for effective incident response and seamless operations.

Postmortem

An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.

DevOps

For teams practicing DevOps, the Incident Management (IM) process focuses on transparency and continuous improvements to the incident lifecycle.

Featured tutorials

Tutorial

Incident communication

In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.

Tutorial

On call schedule

In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.

Want to learn about incident management in Jira Service Management?

Get the guide

Incident Management: Processes, Best Practices & Tools (2024)

References

Top Articles
Latest Posts
Article information

Author: Gregorio Kreiger

Last Updated:

Views: 6609

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Gregorio Kreiger

Birthday: 1994-12-18

Address: 89212 Tracey Ramp, Sunside, MT 08453-0951

Phone: +9014805370218

Job: Customer Designer

Hobby: Mountain biking, Orienteering, Hiking, Sewing, Backpacking, Mushroom hunting, Backpacking

Introduction: My name is Gregorio Kreiger, I am a tender, brainy, enthusiastic, combative, agreeable, gentle, gentle person who loves writing and wants to share my knowledge and understanding with you.