The future of IT incident management, response, prevention (2024)

Incident management for high-velocity teams

Get it free

Learn more

Service Request Management

Overview

Best practices for building a service desk

IT metrics and reporting

SLAs: The What, the Why, the How

Why first call resolution matters

Help desk

Service desk vs help desk vs ITSM

How to run IT support the DevOps way

Conversational ticketing

Customize Jira Service Management

Transitioning from email support

Service Catalog

What is a virtual agent

Understanding IT services and why they’re important

IT Asset Management

Overview

Configuration management databases

Configuration vs Asset Management

Enhance efficiency and security with asset tracking

Incident Management

Overview

IT service continuity management

Incident Communication

Templates

Workshop

On call

On call schedules

On call pay

Alert fatigue

Improving on call

IT alerting

Escalation Policies

Tools

Template

KPIs

Common metrics

Severity levels

Cost of downtime

SLA vs. SLO vs. SLI

Error budget

Reliability vs. availability

MTTF (Mean Time to Failure)

DevOps

SRE

You built it, you run it

Problem management vs. incident management

ChatOps

ITSM

Major incident management

IT incident management

Modern incident management for IT ops

Disaster recovery plans for IT ops and DevOps pros

Bug tracking best practices

Postmortem

Template

Blameless

Reports

Meeting

Timelines

5 whys

Public vs. private

Tutorials

Incident communication

On call schedule

Automating customer notifications

Handbook

Incident response

Postmortems

Template generator

Glossary

Get the handbook

2020 State of Incident Management

2021 State of Incident Management

IT Management

Overview

Problem Management

Overview

Template

Roles and responsibilities

Process

Change Management

Overview

Best practices

Roles and responsibilities

Change advisory board

Change management types

Knowledge Management

Overview

What is a knowledge base

What is knowledge-centered service (KCS)

Self-service knowledge bases

Enterprise Service Management

Overview

HR Service Management and Delivery

HR Automation best practices

Three implementation tips for ESM

Understanding the offboarding process

ITIL

Overview

DevOps vs ITIL

ITIL Service Strategy Guide

ITIL service transition

Continual service improvement

IT Operations

Overview

IT Operations Management

Overview

System Upgrade

Service mapping

Application dependency mapping

IT infrastructure

In the past, the team tasked with responding to technology incidents was almost always IT. Often a team sitting in a network operations center, or NOC, monitored systems and responded to outages. A vendor might have built the software, but deploying and operating was the responsibility of the customer's IT Ops team. Today, with the proliferation of cloud services, the vendor builds the software and does the deploying and operating.

Yet incident management still remains a core ITSM practice. And IT has a long history of developing guidelines, managing budgets, and carrying the full burden of diagnosing, fixing, documenting, and preventing major incidents.

Of course, as with anything in tech, the past is not necessarily a predictor of the future—and currently the practice of incident management is shifting. DevOps, SecOps, and architecture teams are getting more involved. New technologies and interconnected products have changed how we manage incidents. And mindsets, practices, and team structures are changing in order to keep up.

So, how is incident management shifting and what does that mean for the future of our roles, products, processes, and teams?

A move toward decentralization

Rewind five years and ask an IT team who was responsible for incident management. The answer you’d pretty much always get was “us.”

Ask the same question now and you’re likely to hear about not only IT, but also DevOps, SecOps, and architecture teams. Many organizations are slowly shifting toward the idea of “you built you, you run it.”

The clear benefits of this approach are that it takes pressure off the IT teams and speeds up response times by shifting responsibility to the people most familiar with the code. This minimizes down time and maximizes team productivity. It also incentivizes good code. (If you’re the one waking up at 3 a.m. to resolve a bug, chances are you’ll be double and triple checking the code the next time it goes live to keep that 3 a.m. call from happening again.)

The challenge of this approach is that organizations still need some centralization. Leadership needs access to reports and documentation. Business stakeholders want updates. They want to see incident metrics like mean time to resolve and mean time to acknowledge. They expect clear incident updates, incident postmortem reports, and remediation work.

For many companies moving toward decentralization and doing it well, the answer to this challenge is technology that allows for decentralization and team autonomy to keep incident resolution nimble and still centralize information to keep the business in the loop.

The slow road to decentralization

Like with any other big change that could disrupt workflows and surface unforeseen consequences, it makes sense that many organizations are taking on decentralization in baby steps.

They start by identifying a team that is a good cultural fit for a change like this and is managing a low-risk application or product. Then they move incident management for that team’s specific application or product to that team. They train them, implement an on-call schedule, and track their progress over time, asking questions like:

  • Have they improved recovery times?
  • What cultural barriers have they run up against?
  • What tools did the IT team need to put in place?
  • What processes did they need to communicate?
  • Are better system updates coming out of that team?
  • Has the number of incidents dropped?
  • If we decide to roll this decentralization out to other teams, what can we take away from this initial test run?

These test cases work to provide a foundation for deciding whether to implement a “you built it, you support it” framework across the company and, if so, how to roll it out effectively across teams.

Decentralization means cross-team collaboration

This move toward decentralization also necessitates a move toward cross-team collaboration. If DevOps is involved in incident management, DevOps needs a seat at the table in IT incident management process meetings. If IT is still helping guide incident management practices, they need to be involved in postmortem reviews by other teams.

Each team brings their own strengths to the incident management table. IT teams are good at developing practices and documentation and following guidelines. DevOps teams are good at change and learning. SecOps can lend a security perspective.

To foster more collaboration across teams, companies doing this well are sharing information openly, fostering empathy across teams, getting rid of cross-team blame games, using chat to keep teams connected during incidents, and prioritizing incident reviews where everyone’s given a seat at the table.

The shift from reactive to proactive

In ITIL guidelines, typically incident management is seen as a separate practice from incident prevention. Both are important pieces of the ITSM puzzle, but they don’t often happen in tandem.

The problem with this approach is that it keeps incident management in a reactive state. On-call employees are tasked with putting out fires, and as soon as the fire is out, they move on to the next one. The only goal in mind is recovery—getting the system back up and running.

But recovery isn’t the whole picture. And more IT teams are realizing and embracing this over time, folding prevention into the process of incident management and using metrics like mean time to resolve instead of mean time to recovery to judge their performance.

This approach is often called problem management and its goal is to bring processes closer together—to make sure teams aren’t just responding to one fire and moving onto another, but that they respond, recover, and learn from the incident, applying those learnings to both the problem at hand and the larger product and service systems they’re managing.

Many enterprise IT organizations will have a dedicated practice for Problem Management. They typically treat it as a separate process for a separate team. At Atlassian we advocate for taking this even one step further and use a blended approach where IT Ops and developer teams include the problem management practice into their incident practices. This provides better visibly across the incident and ensures incident analysis doesn’t happen long after the incident actually happened.

Because, in the long term, there’s more value in preventing incidents than in responding to them quickly.

Staying the course with process and documentation

One of the challenges inherent in this shift to cross-team collaboration on incident management is that some teams are more relaxed than others about process and documentation.

This is one of the places where IT can provide oversight and significant value even as other teams take on management of their own products. Because nobody wants to take on a major incident bleary-eyed at 3 a.m. without a solid plan.

When folding teams into the incident management process, IT can help them answer the core questions that will determine that plan. For example:

  • What is your incident response?
  • What are the values you’ll follow?
  • How will you respond in case of an incident?
  • Where is the information you need for the critical systems you support? If it’s in multiple systems, how can you bring that information together and make it easily accessible to on-call experts?
  • Is your process and documentation collaborative and reviewable by the team?

Is your company culture ready for change?

This shift toward decentralization, collaboration, and a blending of incident and problem management requires more than simply re-distributing responsibilities and scheduling an IT pro to sit in on a DevOps postmortem. The key to success here isn’t in the technology or even the processes. It’s in creating an internal culture that supports those changes.

This is the part too many companies try to skip and it’s the foundation for a successful transition. So, what does a culture that supports decentralized, collaborative, future-focused incident management look like?

At Atlassian, we think the core components are:

Openness and information sharing

If teams don’t know and can’t access what other teams are doing, we lose opportunities for ah-ha moments that lead to better communication, processes, and products.

Customer-centered thinking

When we ask questions like “what’s really best for the customer?” sometimes the answers we come up with don’t jive with our current practices. It takes an intentional customer focus to move us toward the kind of communication, process, and structural efficiencies that ultimately make our products better for customers.

Regular health checks

How is each team doing? How are individual team members feeling about things? What can the team improve on? What are they knocking out of the park? At Atlassian, we have a team playbook that helps us check the health of our teams and introduce them to new ways of working.

Empathy

If DevOps is pointing the finger at IT and IT is rolling its proverbial eyes at the more relaxed approach of DevOps, that’s not a recipe for collaboration. Fostering empathy and connections across teams is essential if we want them to communicate, innovate, and work together well.

Empowerment

Teams should be empowered to fix problems quickly and make decisions independently whenever possible. Individuals within those teams should feel empowered to speak up if they have a question, suggestion, or concern—no matter their position on the team or their years of experience.

When junior developers feel like they can raise a hand in meetings and flag an issue—even when someone more senior was responsible for that code—the result is innovative new ideas, improved processes, and catching bugs before they go out into the code.

The future of IT incident management, response, prevention (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6633

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.