This is part 1 of a series.

As engineers we love building software. We love releasing software. We love seeing our software helping people. Diagnosing Website Issues is not always the first thing that comes to mind if we think about what we do day to day, but it is a real and critical part of our jobs.

Unfortunately, sometimes things go wrong. Sometimes wrong is “just a bit wrong”. Those cases that can generally be filed under bug - fix later. Add it to the backlog and run it through a sprint. This post is not about “just a bit wrong”.

This post is about what could be referred to under the umbrella of outages - in short, we’ll define an outage as something that has been judged as “needs to be fixed right now” by the business. As such it should be prioritised over any other work (and maybe lunch too!).  That could mean the whole site is completely inaccessible, or perhaps checkout functionality is erroring.  It may mean incorrect information is being displayed. Heck, it may mean the seemingly-worthless-page-built-years-ago has-by-the-CEO is down!

The point is - it’s time to get stuck in. Time to drop whatever you’re doing, probably sit on a daunting live call with a bunch of heavy-breathing directors who think they’re helping by being there, and figure out what the f* is going on.

Get the environment right

Before we even get to the technical investigation it is obvious but important to ensue we put ourselves in the best position to find and fix the problem. It’s probably a whole post on its own so I won’t go into detail here, but in short the things to keep in mind, especially if everyone is on a call, will be:

If you’re not being helpful - don’t try to help

It is a natural tendency to want to help, this is very true during the diagnosis or investigation phase.  There may be representatives from many teams in the virtual war room. It is likely that one or maybe two teams are there as owners of the faulting systems.  A general rule of thumb is that they probably know their system better than you do.  In tech land, root causes range from missing commas to government attacks: the possibilities are endless!  While it is great fun to have a think about what could be going on, doing so vocally can become distracting to the “core” team members.

Bosses: one for you too - other than once at the start, the teams don’t need updates about how critical this is - they know! Added pressure can cloud the mind when it needs to be focused. 

Accountability comes later

Yes we get paid to do what we do. Wouldn’t be here otherwise. So yes we should absolutely be held accountable for correctness and performance of our systems. That goes for teams and individuals.  That said, mid-incident it is not at all helpful to mention “oh Dave made a change to X” or “team Porcupine made a commit before our latest release”. The focus should be on what the data are telling us and pinpointing a solution. Nothing more during the incident. Why? Well, for one it can lead you astray. Let’s say “Dave” has a reputation of creating problems in production.  Hearing a mention about “Dave’s change” could send you down a rabbit hole proving that a change did cause this current problem. Data, data, data is what begs to be followed.

Once the dust has settled and users are happily on our site or app again, root cause analysis can (and should) begin.  The changes that may lead to are out of scope of this discussion.

Mitigation, where faster, comes before fixing

This still falls under environment because it is a mindset, and a simple one at that. 

Can we rollback? Can we divert traffic? Can a feature be temporarily switched off?  Your users are the reason you got a pay check. They don’t want to wait while you come up with an elegant code change to solve the problem.  They just want to buy some dang tickets!Always keep the user at front of mind and let their needs guide the immediate mitigation.

Don’t panic

There is a skill involved in tackling production issues.  A wonderfully satisfying skill.  Take it all in. Learn, learn, learn. While outages are not a great place to be, once we’re in one then we may as well take advantage of them. Panic clouds the mind and therefore slows the mitigation process.  Assume good intent of those around you (everyone just wants stuff to work) and dive right in.  Let’s figure out how in the next part of this series.

Thanks for reading. Here’s a plug to our awesome BytesMatter front end monitoring solution!