What makes a good incident commander




















The challenge is being able to see the system as a whole and its individual parts at the same time. On top of that, the fact that these are complex systems only compounds the challenge of nailing the design and resulting delegation. Look for an IC who can accurately synthesize the nature of the system and the individual forces at play. Creative problem solving is key to incident collaboration. An effective IC can connect information or systems and creatively devise an action plan.

The path to resolution is often murky and hard to sort through. The capacity to visualize different scenarios can speed up and ensure successful resolution. To re state the obvious, incidents are stressful. They can last hours, sometimes days. They require mental and physical stamina and the ability to push through to results no matter how long you and the team have been at it.

A healthy dose of practicality is necessary. As the IC, you are faced with a multitude of information. You need to be able to separate the signals from the noise to make good calls. Common sense helps you accurately weigh the different possibilities to devise not only a successful solution but one that takes into account the resources, time, and complexity of the incident at hand. An outlandish solution, even one that would eventually lead to resolution, can come at massive cost: unnecessarily wasting resources, engineering and customer time.

A great IC does not just make the call that eventually leads to mitigation but rather makes the call that minimizes risk and waste and maximizes impact, efficiency, and thoughtful velocity. This entails sorting important from unimportant, relevant from irrelevant. The bigger an incident, the more likely you are to have multiple teams working on a resolution. An IC oversees communication and makes sure everyone is on the same page. They should also keep conversations focused and brief to minimize time to resolution.

Incidents are high-stakes, high-stress events—and studies show that stressed out people make worse decisions. The IC should be able and willing to pull highly stressed people off the incident team, talk the team down as needed, and consistently bring the focus back to the task at hand.

They should also, when possible, take any additional stress burden off their teams by heading off the steady stream of questions and panic coming from internal and external stakeholders. Once an incident has been resolved, the incident commander is responsible for the post-mortem process , including creating documents where teams can share their thoughts, planning post-mortem meetings, and making recommendations on how to prevent or lessen the impact of future incidents.

The core responsibilities of an incident commander are resource management, communication, and problem-solving. Anyone with these skills—from senior leadership all the way down to interns—can make a great incident commander.

Before becoming an incident commander, most companies will have you shadow other ICs to learn the ropes. In these cases, the best practice is to quietly watch and learn and hold back any questions until the incident is resolved. Since Incident commanders are responsible for guiding teams successfully through incidents, they should be well acquainted with incident response best practices and incident communication best practices.

The more well documented your process is pre-incident, the easier it will be for the IC and teams to follow in the more intense, higher stress environment an incident creates. Understanding team dynamics and the strengths and weaknesses of people on your teams leads to better delegation and faster incident resolution. Even during a major incident, team calls and Slack conversations can get off track.

The IC should be ready to stop tangents in their tracks and refocus the team on the task at hand. Sometimes all this takes is a quick verbal or written reminder. Sometimes it means pulling people off the team or bringing new people in.

The best ICs are even willing to remove the CEO or their boss from a call if that person is becoming a distraction. The best ICs are people who can stay cool and focused in a crisis. Once an incident is resolved, the IC should run a blameless postmortem to identify how the team can improve incident management and overall systems in the future. The best ICs not only guide incidents calmly toward resolution. They also work to help the company learn from the incident and make improvements.

On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management. A good practice for larger organizations is to create an official Incident Commander rotation, with a group of individuals who are on-call to support incidents that cross teams or hit customer-facing KPIs. For smaller organizations, appointing Incident Commanders often may be done more informally, but the main takeaway again is simply to ensure that there is a central point of coordination during incidents.

There are many challenges and potential failure points, compounded by the fact that incidents are inherently extremely stressful situations. However, anyone can prepare by arming themselves with best practices to address key incident management challenges. Once an incident is detected, the first point of human intervention is typically when it pages out. When that happens, instinctively your first reaction may be to start working on the problem, log into systems, and pull up dashboards.

But the first thing you need to know is the customer impact, and more specifically, what is the impact to your customers, the business, and to the bottom line. This is because the prime directive of incident response is to ensure the customer going through as little duress as possible. In other words, you want the incident to end as soon as possible. From there, you can then triangulate into what is the severity of an incident.

They have a lot in common, but what is the most important thing to keep coming back to is the customer impact. That is how you map from the signals from your technical systems back to one of those severities, which helps you drive the appropriate organizational response.

The concept of the severity level is really about what response you want to get from your organization during the incident. Customer impact is also relative. For example, for the Head of Sales, the customer impact could be Sev0 because they're afraid of losing an account, even though the customer in question views something more as an annoyance instead of a major issue. Prioritization is essential; if every incident is a Sev0, no incidents are a Sev0.

This also means that while ICs do make the call on severity with input from efforts, severity levels are not worth debating over as they can always be changed after the fact. The time spent debating over the right severity level is better spent on resolving the incident as quickly as possible to restore service and minimize customer impact.

Another common challenge in incident management is identifying ownership. For example, if you discover that a service is running slowly and is starting to reject requests, how do you understand who owns that service, who is the right subject matter expert to page, and who to escalate to?

Sometimes you may have to dig into code repositories and track down who's done the most commits, then go back to the org chart and trace it down. This is one of those things you can start to solve ahead of the incident, but during an incident, it's critical to have just enough clues to be able to find the right, responsible person to help drive the incident forward. An important thing to note is that the owner in question is not only relative to the service identified as the problem, but also relative to other interdependent services that might be affected.

It's important, but not always possible, to have a good idea of a chart or graph of who's responsible for each service. In some organizations, given the complexity of microservice architectures, this can change frequently.

Ownership can also be fluid based on who wants to take ownership, who really cares about the service, or who is willing to take on responsibility or things like communicating with stakeholders. Members of our SRE team have recalled incidents where they resorted to Twitter to find other cross-vendor folks or experts in the community via Twitter in order to help push forward the incident resolution.

Part of ownership is being resourceful in finding the people and resources online to get the problem solved. We define observability as how tooling and instrumentation come together to provide visibility into technical systems.

They work to identify and recruit the right people, with the right knowledge and skills, to formulate an effective team response. They ensure that all of the players have what they need to do their jobs; they minimize friction and promote clear communication. As a coordinator, the IC is the calm at the center of a storm—an antidote to panic and to reactive thinking.

In practice, this means:. If you remember one thing from this post, here it is: Successful ICs focus on coordination. The flow of emotion. Incidents are breeding grounds for panic and reactive behavior. The sooner you recognize a shift into reactive mode, the sooner you can act to pull them back towards a calm, focused state of mind.

The flow of information. This is largely about understanding your participants: Who is in the room? What do they already know, and what do they not know that they care about? Do you need to page another team? Is there a domain expert who can solve a thorny problem? Does an engineer who just joined the incident response understand the current status and how they can help? Did you discover something new about the incident that may be important to communicate to customers?

Has it been a while since an engineer, who agreed to perform a critical task, gave a status report? When ICs view themselves as conduits—dedicated to getting the right information to the right people—solutions tend to appear more quickly. The flow of analysis. Now we know: bad things happen. True story! Context is very important when your main job involves coordination.



0コメント

  • 1000 / 1000