Blameless postmortems of production issues

August 01, 2022

Blameless postmortems of production issues

Generic Incident Management Framework

I have worked quite a long in my career as a production engineer, people usually create unnecessary silos when they look at the production support engineers.

Production support often refers to customer services.

Well to be very honest the whole IT industry and to be precise every business in the world is customer service. that is something that makes it business.

People pay you for something for the matter of either of three things:

1) You save their effort

2) You save their time.

3) You save their money.

There have been always various myths around production support in general. One of the myths is production support is an easy and timeboxed job.

When you work on the incident management tool, it has all the functionalities to describe the issue that have been raised but a crucial part of the process is to admit the failure.

Many times you face the customer directly, you might have to face the rage they would eventually bring to the discussion due to loss of their efforts, time, or money. In this case, you might want to take look at the incident keenly and have a solution to resolve it in time.

There are certain rules you need to keep in mind when dealing with Incidents in production:

1. Description of the incidents:

What exactly has been reported needs to be verified from a production perspective.

The report of description helps out in grouping these incidents and identifying the functionality. If it belongs to any of the critical functionality in the system that could be easily get prioritized by the dev team. Considering the large number of tickets received in a day, it is always better to get all questions and descriptions get cleared regarding incidents

2. Description of the root cause:

Usually, enterprises align their application delivery teams in L1, L2, and L3 levels.L1 is client-facing professionals and L2 is asked to join for production fix or a hotfix they refer it to and L3 are usually dev teams. When a certain issue is breaking the primary flow it is expected from the team to go through logs, have access to code, and try to convey the root cause so L2 can fix it on priority if it is something major.

Similarly, separate documentation of the issue with its root cause provides valuable insights which can be used as a reference moving further when dealing with a similar kind of issue. It guides the new resources to trace some of the primary causes due to which certain functionality can break, Obviously it saves time too.

Many organizations prefer to have their RCA (Root cause analysis) as part of their email communication to emphasize it more.

3. How the incident was stabilized or fixed?

You know the incident occurs, and have knowledge about what could be the possible code of failure but that is not enough. Stakeholders do not have to be technically strong and in most cases, they are not. There is no significant reason that they should know about programming glitches, All they care about is how you are going to do damage control. If it needs to be taken in the sprint plan if you are following agile then what is the immediate workaround you can apply, which can prevent the loss incident caused?

If it needs measured code changes then how are you going to prioritize it?

to be able to give them assurance that the support team would take care of it e.g. data backfilling, proper communication to customers, and changes in process till the time fix goes on production.

4. Timeline of events and action is taken:

As a team, you could be very good at technologies or triaging but it does not make you a good auditor. As a consumer what could be more frustrating than having a delayed response and misleading answers to query which exceeded the timeline for it? People love to be heard especially when something has gone wrong at the service provider's end. So till the time they are assured that the team is looking into it and will fix it as soon as possible, it gives them comfort. So whenever we are dealing with incidents it is more helpful to record the timeline of the events as well as what action had been taken. This would also prevent blunders when someone steps in at a further level to resolve the issue.

5. How customers were affected?

When system flows break it has ripples around the business it represents. The banking domain or dispute domain, as well as SAAS business, could not agree with it more to this issue as impacts could be customer connections, money, and one or the other forms. As a part of the service provider, it is always best to know the impact which also helps in understanding the seriousness of it, when it is rightly conveyed people often come forward and try to get it fixed with collective efforts.

6. Remediation and corrections:

When you involve the dev team in such an issue, it is important to have the issue well conveyed and enable them to implement the solution which stops environmental damage further.

The best key is to have frequent communication with the stakeholders to make sure it is being taken care of.

Here are some of the rules for postmortem communication:

1.Admit failure:

You are confident about the system, and processes, and the QA team has done end-to-end testing. You have the best developers in the world but the System could fail. It is best someone accept the fact that systems could fail.

When you see something is failing just admit it, do not be overconfident.

2. Sound like a human

Blunders happen, Mistakes are part of building anything. Although the fact that everything is automated we have to digest the fact that these are set by humans and mistakes can happen by humans.

3. Have a proper communication channel:

As discussed above systems are bound to fail so it is a wise decision to be prepared for it prior. Let your customer, stakeholders, and team know in case of a major incident how you are going to keep in communication.

Many organizations have separate channels for such issues, some raised the flag on mail, where the colors of the flag had been predecided according to the severe ness of the issue, for example, issues like primary functionality or server down can be assigned to red flag.

4. Above all else, be authenticate:

You do not have to hide anything when it comes to failure. Just be true to what happened and how you are dealing with it, Customer always values support which comes up clean and authentic when it comes to incident management.

These are some of the thoughts and learnings I have from my experience which turn out to be very helpful and guiding throughout the process.

If you have any similar systems for dealing with incidents and you came across a certain optimal process I would love to hear out from you.

Also if you have any suggestions regarding my blog, drop a comment so I can do better next time.

till then...

Happy Reading..!

Thank you.

Aniket Kulkarni

Comments

कुतूहल TheCuriosityAugust 1, 2022 at 10:26 PM
Well and precisely explained
ReplyDelete
Replies

Add comment

Search This Blog

Code Never Lies