This is part 2 in a series about diagnosing website issues.  Hopefully it gives us a framework to constructively and efficiently work through website outages or issues instead of working in a state of stress or panic.

Thanks to Part 1, we know what it means to be in an outage. Everyone knows their place and we’re ready to figure out what on earth is going on. Now, obviously you’re not ACTUALLY reading this in the middle of an outage. If you are, go away and fix the problem! You’re reading this because you’ve been through a few and they have been stressful.  You’re looking for an approach that will make it slightly less painful next time (yes, there will be a next time. If there’s not then you’re probably not building anything).  What follows here is some advice about the kind of output you can extract from your system to help you out when there are problems.

As a general rule with regard to outages, we need to know a few things quickly as possible.

  • WHEN something is going wrong
  • WHAT is going wrong
  • WHERE in our system is it going wrong

HOW and WHY are usually only found during, or perhaps after, the investigation. If we’re going to continue referencing the 5 W’s, then I don’t care about WHO - that is out of scope for this discussion.

Turning on the lights

In order to know what, when, and where things are going wrong, we first need to know IF or WHETHER things are going wrong.  This is where logs and operational metrics come in to play.  These two, when done right, work together very powerfully to get us through issues.  As you continue reading, it’s good to think of operational metrics as the signposts pointing us towards the problem, and logs are the detailed information about the problem.

Now, you may be reading this thinking “Who needs metrics? I can just search the logs for patterns”.  The problem there is that you have to know what you’re searching for. You may be able to do this if your level of familiarity with a system is high, but logs can produce a lot of noise and it can still take time to glean information from them. (Graphed) metrics could confirm this at a glance. 

Keep in mind outages and issues can happen while you’re on holiday and your team need to be able to deal with them too. An absent savant is a useless savant. Also, the days of hero dev’s are over - that’s an archaic notion.

So now we’re convinced that having both logs and operational metrics is a good thing. Where to start? As far as levels of sophistication go, if you’ve got nothing in place already then start with logs. Keeping with the signpost analogy, we’ve still got a chance of finding our way when there are no signposts - it may just take a little longer.

Logs should cover things, well, worth logging. Log maintenance is one of those background tasks that we do from time to time because for whatever reason the logs are too noisy. That’s fine and certainly don’t let that scare you into adding new logs. The only advice I could give is perhaps to try being sensible about log levels. Ideally we’ll always track everything and just have the ability to quickly filter by log level. The practicality however is that we may be wasting a lot of money and network by flooding the system with DEBUG messages. This generally means we find production systems with ERROR-only level logs. So if we’re trying to cover things worth logging, start by logging at points like catches on network failures and bad data. And log detail. If its worth logging then its worth logging completely. If something has gone wrong then a reasonable but somewhat undesirable outcome is to do a release with more logging, so try to preempt that. It is better to go back and tame noisy logs than not have the good ones in the first instance.

Metrics are signposts

The operational value that metrics bring cannot be overstated. From telling us something has fallen over to highlighting that something else is about to fall over. Metrics open the door to us being in control of our (potentially complex) systems. The alternative is to just hope things are working. Clearly that is not acceptable.

That said, we can be smart about metrics so that they help us out even more when in a bind.

Consider the following:

  try {
      makeNetworkRequest()
  } catch () {
      addFailureMetric()
  }

The “addFailureMetric” above may point to

  • A complete failure of the service we’re calling
  • Intermittent failures that the user doesn’t actually see
  • 400’s, 500’s, 401’s, who knows! Now don’t get me wrong, if the code is firing a metric (or even better if we’re externally building a metric based off a log stream) then that’s a hundred miles ahead of not having anything there. If you’re don’t have metrics - get some!  They’ll still help you get to the problem, albeit potentially slowly.  During outages within complex systems, we’re in a stronger position if we pre-arm ourselves with meaningful information. This may help us find issues within our own code faster, or it may give us something meaningful to pass on to another team.

Whenever we look to implement or improve operational metrics and logs in our systems, we’re really just wanting something that gives us:

  • A broad indicator that something is wrong
  • A pointer to where to look in the code The first point should always be actionable.  If you ever look at a metric and say “oh that happens all the time, ignore” then delete that metric. It is an operational norm and not worth knowing.  If your reaction to that statement is one of disgust or nerves that “when it really shoots up I’ll pay attention” then make sure it only fires in a situation where you would act on it. This is obvious stuff but it is incredible how we keep repeating history with noisy metrics.

As for the second item, a “pointer to where to look”, this is what gets us out of a bind quickly.  For example, instead of our general failure metric above, we could have:

  try {
      makeNetworkRequest()
  } catch (error) {
      log(error)
      switch (error.statusCode) {
          Case 400:
              AddMetricSomeoneSentABooboo();
          Case: 500:
              AddMetricLolItsThemNotUs();
          // etc
      }
  }

This way we can can consolidate all the metric down stream and still have our overall metric covering all failures. Now however we also get a bit more information about WHY the failures happened.

Of course we could go even deeper and parse out failure reasons behind each code.  Just keep in mind it’s a balance between clarity and cardinality. So again, put the detail in the log!

Pro tip - if you’re already logging but don’t have metrics you could derive metrics from log messages downstream. Some will argue it is a better strategy to do this regardless.

Let’s diverge for a second. For a fun subtle difference on the above, consider this change

  try {
      makeNetworkRequest()
  } catch () {
      addFailureMetric()
      retry()
  }

The addition of the “retry()” has the potential to make the failure metric too noisy. In this case the failure metric is, on its own, NOT a reason to get out of bed. We’d need to be a bit smart about perhaps separating it into something like

  let retryCount = 0;
  try {
      makeNetworkRequest()
      retryCount = 0;
  } catch () {
      if (retryCount < 5) {
          retryCount += 1;
          addRetryMetric();
          return retry();
      }

      addFailureMetric();
      throw();
  }

It can be argued here that “addRetryMetric()” is a nice to have, and “addFailureMetric()” is a must have. The former may let us know that a dependency is starting to buckle for something really bad happens.

Visibility

It’s great the we now have logs and metrics, but if said logs and metrics are sitting in a file on the server then they’re pretty useless. I think for this post we won’t cover how to gather logs and metrics centrally, let’s just stress that it is absolutely critical to have them GATHERED and SURFACED in a central place. Logs need to be easily accessibly when needed and metrics need to be in visible at all times in the form of a dashboard (if you truly believe your alerting means you don’t need a dashboard then you’re missing out - alerts are tuned and changed and sometimes mean we miss blips in the system. Dashboards are also wonderful conversation starters for all people. You’re winning if your non-technical boss can come up to you and ask why something looks like it is spiking).

Alerting

We are human. We miss things. We’re in meetings. Heck, some of us sleep! Let computers do what they do well for this - watch stuff and tell us if something changes. It’s not more complicated than that.

Alerting can be the difference between getting told something is wrong and fixing it (or at worst starting to fix it) before it frustrates too many users.

Start somewhere, anywhere. You absolutely will not get alerting right immediately, and you absolutely must have alerting in place.

You’ll soon get annoyed when your phone pings you every 2 minutes. Don’t let this discourage you and please don’t just start ignoring these alerts. Ignored alerts are absolutely pointless. Just like app push notifications, operational alerts need to be timely and actionable. Implementing good i.e timely and actionable alerts takes time and that needs to be acknowledged. The only advice I’d give here is perhaps to have some kind of sandbox mechanism for newly implemented alerts. That may mean a different alerting channel, or it may mean only alert during working hours. But take the time to tune that alert so that when it goes off, then you actually act.

Tags: #frontend #metrics