June 10, 2021

Waterfalls made of RUM

3 minutes read

Debugging frontend problems should be simple. The reality is it is not always straightforward. It’s a potentially stressful exercise, given that diagnosis usually happens because of an outage of sorts. Outages mean users aren’t happy, or we’re losing money or users, or both. Okay, so we know we’re stressed. Now we have to work with a mix of system familiarity, crossed with a logical, often robotic, exercise of looking for clues and excluding suspects until a root cause is found. Of course there’s always a bit of luck that can help us out too.

Generally though we tend to look for certain patterns. Is the page loading at all. Does it become unresponsive after some action? Which browsers are affected? Are ajax calls succesful?

The list goes on, and on (hopefully not ad infinitum). Successfully navigating site outages depends on how quickly and effectively we can surface the information to find these patterns. One feature we’re very proud of here is waterfall samples from real user data! It is unbelievably powerful to be able to go see very quickly a real page load waterfall from a slow load.

As an example, here we can see a really slow page load. Slow page loads over time

That’s generally a very slow load time. This is interesting information, but we can still dig deeper to find out what is the cause of the poor performance.

Clicking on one of the datapoints takes us immediately to a real user sample of that load for the metric we’re looking at - a waterfall made of RUM:

Waterfall showing slow time to first byte

All the calls, all the timings. It’s like turning on a flashlight in a dark room.

We can also see the “session summary” box at the top of the page, which contains some meta data about the request: Session summary showing device type is mobile

This shows, among other things, that the page was loaded on a mobile device from the United Kingdom.

Back to the investigation. In this case we can immediately see a bunch of waits - the time from the request for the resource being sent until the first byte is received. We can also see that the waits are limited to our own domain. External domains appear to be unaffected.

A lot of these waits are for static resources.

Having looked at waterfalls other slow loads in this time period, we noticed that the pattern was very similar and it turns out our edge caching setup was incorrect.

This became some low hanging fruit for us to pick up. The investigation didn’t take long at all, thanks to the simple surfacing of some pretty powerful information.

Thanks for reading. Our goal is to surface helpful information to make debugging and investigating issues quick and rewarding experience. Give us a try - we have a generous free tier!