The Structure of your Success

When you’re talking about the performance of your site, the first advice you’ll get is to measure it. Measure the round trip time of the average user’s TCP connection. Measure the template rendering time. Measure the response time of the memcached cluster. Last time, we talked about the myriad ways you can visualize that data. Timing measurements aren’t everything, though. This time, let’s talk about the other side of the coin. Let’s talk about the structure of a request to your web app.

The Whole Sandwich: Performance by Layer

Structure is simply the call graph of your application. Complex structure is what makes distributed apps hard to debug and even harder to optimize. Each trace has its own structure, and the relationships within that graph can make or break the performance of a web app.

When measuring performance, people typically throw away structural information and keep only timing information.  But in order to answer sophisticated questions about web application performance, it helps to know about both the performance and structure of your app.  To make this concrete, let’s track down a problem in a PHP webapp using Tracelytics.

 stacked area chart

(click to enlarge)

We’ll start at the app page, looking at the stacked area chart. Remember that this visualization shows us the average request time, broken down by the time spent in each layer. The first thing to notice here is that the performance is wildly inconsistent. Secondly, the variation seems to be largely in PHP and php_mysql, where we have client-side instrumentation of MySQL queries. The largest spikes here put the response time between 200 and 500 ms, but more typically 50 to 100ms. Nothing too surprising here yet — the variance is a bit worrying, but with fewer than 9,000 traces per week we don’t expect to see the most consistent measurements.

Perhaps this variance is due to poorly performing queries–why don’t we take a look at an individual layer of our stack next?

Examining Layer Performance

Let’s take a look at the queries the app is running by drilling down on the php_mysql layer, where we might expect to find some slow queries.  You might do this with the slow query log, but we can get more info out of a heatmap:

heatmap

(click to enlarge)

What we find is something a little less obvious.  Recall that the heatmap shows us the distribution of calls, broken down by duration, over a period of time–like a bunch of histograms next to each other. From the stacked area chart, we saw a lot of time spent in php_mysql, but this graph shows that nearly every call (up to 98%, the default cutoff) completes faster than 12ms, with the majority returning faster than 2ms. Even more striking, the heatmap is showing a total count of more than 2 million calls — which might  sound like a lot of queries per trace given the 9,000 traces we saw in the stacked area chart.  Where are they all coming from?

Structured Data to the Rescue 

trace view

(click to enlarge)

By now, what’s going on may be obvious, but let’s examine the trace view, a visualization of the structure of an individual request, to confirm our intuition.  This particular request actually made over 1600 calls to MySQL, most of which were very similar. While each query examined individually is fast, well-indexed, and nicely behaved, the overall request is inconsistent and slow. This is a classic case of putting a query inside of a loop. If we instead retrieved all the needed data at once, we could iterate over it much faster, saving a lot of overhead in round trips and query execution.

In this case, the structure of the request unnecessarily introduced both an absolute delay and an increase in variability to the page load time. Each layer, examined individually, was performing well, but that’s not enough to guarantee good performance. The structure of the application and the quality of the code play just as much of a role as DB tuning and scaling metrics.

“The key to performance is elegance, not battalions of special cases.”
– Jon Bentley and Douglas McIlroy