Incident management for high-velocity teams
Get it free
Service Request Management
Overview
Best practices for building a service desk
IT metrics and reporting
SLAs: The What, the Why, the How
Why first call resolution matters
Help desk
Service desk vs help desk vs ITSM
How to run IT support the DevOps way
Conversational ticketing
Customize Jira Service Management
Transitioning from email support
Service Catalog
What is a virtual agent
Understanding IT services and why they’re important
IT Asset Management
Overview
Configuration management databases
Configuration vs Asset Management
Enhance efficiency and security with asset tracking
Incident Management
Overview
IT service continuity management
Incident Communication
Templates
Workshop
Incident Response
Best Practices
Incident Commander
Aviation
Roles and responsibilities
Lifecycle
Playbook
On call
On call schedules
On call pay
Alert fatigue
Improving on call
IT alerting
Escalation Policies
Tools
Template
KPIs
Common metrics
Severity levels
Cost of downtime
SLA vs. SLO vs. SLI
Error budget
Reliability vs. availability
MTTF (Mean Time to Failure)
DevOps
SRE
You built it, you run it
Problem management vs. incident management
ChatOps
ITSM
Major incident management
IT incident management
Modern incident management for IT ops
Disaster recovery plans for IT ops and DevOps pros
Bug tracking best practices
Postmortem
Template
Blameless
Reports
Meeting
Timelines
5 whys
Public vs. private
Tutorials
Incident communication
On call schedule
Automating customer notifications
Handbook
Incident response
Postmortems
Template generator
Glossary
Get the handbook
2020 State of Incident Management
2021 State of Incident Management
IT Management
Overview
Problem Management
Overview
Template
Roles and responsibilities
Process
Change Management
Overview
Best practices
Roles and responsibilities
Change advisory board
Change management types
Knowledge Management
Overview
What is a knowledge base
What is knowledge-centered service (KCS)
Self-service knowledge bases
Enterprise Service Management
Overview
HR Service Management and Delivery
HR Automation best practices
Three implementation tips for ESM
Understanding the offboarding process
ITIL
Overview
DevOps vs ITIL
ITIL Service Strategy Guide
ITIL service transition
Continual service improvement
IT Operations
Overview
IT Operations Management
Overview
System Upgrade
Service mapping
Application dependency mapping
IT infrastructure
In the past, the team tasked with responding to technology incidents was almost always IT. Often a team sitting in a network operations center, or NOC, monitored systems and responded to outages. A vendor might have built the software, but deploying and operating was the responsibility of the customer's IT Ops team. Today, with the proliferation of cloud services, the vendor builds the software and does the deploying and operating.
Yet incident management still remains a core ITSM practice. And IT has a long history of developing guidelines, managing budgets, and carrying the full burden of diagnosing, fixing, documenting, and preventing major incidents.
Of course, as with anything in tech, the past is not necessarily a predictor of the future—and currently the practice of incident management is shifting. DevOps, SecOps, and architecture teams are getting more involved. New technologies and interconnected products have changed how we manage incidents. And mindsets, practices, and team structures are changing in order to keep up.
So, how is incident management shifting and what does that mean for the future of our roles, products, processes, and teams?
A move toward decentralization
Rewind five years and ask an IT team who was responsible for incident management. The answer you’d pretty much always get was “us.”
Ask the same question now and you’re likely to hear about not only IT, but also DevOps, SecOps, and architecture teams. Many organizations are slowly shifting toward the idea of “you built you, you run it.”
The clear benefits of this approach are that it takes pressure off the IT teams and speeds up response times by shifting responsibility to the people most familiar with the code. This minimizes down time and maximizes team productivity. It also incentivizes good code. (If you’re the one waking up at 3 a.m. to resolve a bug, chances are you’ll be double and triple checking the code the next time it goes live to keep that 3 a.m. call from happening again.)
The challenge of this approach is that organizations still need some centralization. Leadership needs access to reports and documentation. Business stakeholders want updates. They want to see incident metrics like mean time to resolve and mean time to acknowledge. They expect clear incident updates, incident postmortem reports, and remediation work.
For many companies moving toward decentralization and doing it well, the answer to this challenge is technology that allows for decentralization and team autonomy to keep incident resolution nimble and still centralize information to keep the business in the loop.
The slow road to decentralization
Like with any other big change that could disrupt workflows and surface unforeseen consequences, it makes sense that many organizations are taking on decentralization in baby steps.
They start by identifying a team that is a good cultural fit for a change like this and is managing a low-risk application or product. Then they move incident management for that team’s specific application or product to that team. They train them, implement an on-call schedule, and track their progress over time, asking questions like:
- Have they improved recovery times?
- What cultural barriers have they run up against?
- What tools did the IT team need to put in place?
- What processes did they need to communicate?
- Are better system updates coming out of that team?
- Has the number of incidents dropped?
- If we decide to roll this decentralization out to other teams, what can we take away from this initial test run?
These test cases work to provide a foundation for deciding whether to implement a “you built it, you support it” framework across the company and, if so, how to roll it out effectively across teams.
Decentralization means cross-team collaboration
This move toward decentralization also necessitates a move toward cross-team collaboration. If DevOps is involved in incident management, DevOps needs a seat at the table in IT incident management process meetings. If IT is still helping guide incident management practices, they need to be involved in postmortem reviews by other teams.
Each team brings their own strengths to the incident management table. IT teams are good at developing practices and documentation and following guidelines. DevOps teams are good at change and learning. SecOps can lend a security perspective.
To foster more collaboration across teams, companies doing this well are sharing information openly, fostering empathy across teams, getting rid of cross-team blame games, using chat to keep teams connected during incidents, and prioritizing incident reviews where everyone’s given a seat at the table.
The shift from reactive to proactive
In ITIL guidelines, typically incident management is seen as a separate practice from incident prevention. Both are important pieces of the ITSM puzzle, but they don’t often happen in tandem.
The problem with this approach is that it keeps incident management in a reactive state. On-call employees are tasked with putting out fires, and as soon as the fire is out, they move on to the next one. The only goal in mind is recovery—getting the system back up and running.
But recovery isn’t the whole picture. And more IT teams are realizing and embracing this over time, folding prevention into the process of incident management and using metrics like mean time to resolve instead of mean time to recovery to judge their performance.
This approach is often called problem management and its goal is to bring processes closer together—to make sure teams aren’t just responding to one fire and moving onto another, but that they respond, recover, and learn from the incident, applying those learnings to both the problem at hand and the larger product and service systems they’re managing.
Many enterprise IT organizations will have a dedicated practice for Problem Management. They typically treat it as a separate process for a separate team. At Atlassian we advocate for taking this even one step further and use a blended approach where IT Ops and developer teams include the problem management practice into their incident practices. This provides better visibly across the incident and ensures incident analysis doesn’t happen long after the incident actually happened.
Because, in the long term, there’s more value in preventing incidents than in responding to them quickly.
Staying the course with process and documentation
One of the challenges inherent in this shift to cross-team collaboration on incident management is that some teams are more relaxed than others about process and documentation.
This is one of the places where IT can provide oversight and significant value even as other teams take on management of their own products. Because nobody wants to take on a major incident bleary-eyed at 3 a.m. without a solid plan.
When folding teams into the incident management process, IT can help them answer the core questions that will determine that plan. For example:
- What is your incident response?
- What are the values you’ll follow?
- How will you respond in case of an incident?
- Where is the information you need for the critical systems you support? If it’s in multiple systems, how can you bring that information together and make it easily accessible to on-call experts?
- Is your process and documentation collaborative and reviewable by the team?
Is your company culture ready for change?
This shift toward decentralization, collaboration, and a blending of incident and problem management requires more than simply re-distributing responsibilities and scheduling an IT pro to sit in on a DevOps postmortem. The key to success here isn’t in the technology or even the processes. It’s in creating an internal culture that supports those changes.
This is the part too many companies try to skip and it’s the foundation for a successful transition. So, what does a culture that supports decentralized, collaborative, future-focused incident management look like?
At Atlassian, we think the core components are:
Openness and information sharing
If teams don’t know and can’t access what other teams are doing, we lose opportunities for ah-ha moments that lead to better communication, processes, and products.
Customer-centered thinking
When we ask questions like “what’s really best for the customer?” sometimes the answers we come up with don’t jive with our current practices. It takes an intentional customer focus to move us toward the kind of communication, process, and structural efficiencies that ultimately make our products better for customers.
Regular health checks
How is each team doing? How are individual team members feeling about things? What can the team improve on? What are they knocking out of the park? At Atlassian, we have a team playbook that helps us check the health of our teams and introduce them to new ways of working.
Empathy
If DevOps is pointing the finger at IT and IT is rolling its proverbial eyes at the more relaxed approach of DevOps, that’s not a recipe for collaboration. Fostering empathy and connections across teams is essential if we want them to communicate, innovate, and work together well.
Empowerment
Teams should be empowered to fix problems quickly and make decisions independently whenever possible. Individuals within those teams should feel empowered to speak up if they have a question, suggestion, or concern—no matter their position on the team or their years of experience.
When junior developers feel like they can raise a hand in meetings and flag an issue—even when someone more senior was responsible for that code—the result is innovative new ideas, improved processes, and catching bugs before they go out into the code.