Here at Tracelytics, we love it when a customer thanks us for helping them track down a hard-to-diagnose performance problem. Over the past year, as we’ve helped more customers, we’ve noticed a few common patterns pop up, so we’re blogging about a few of them and their solutions.
While every web stack is unique, when you start scaling there are often low-hanging fruit waiting to be plucked (and fixed) that eventually everyone will encounter. They may not be glamorous, but they are common — and easy to fix once diagnosed.
This week, we’ll break down two common classes of performance problems we see in typical web apps — one for ops engineers, one for developers — and general approaches to solving them.
The underprovisioned backend
You’re getting some publicity and seeing more traffic, but your app is getting slower, and you’re not sure why. You log into Tracelytics and see a graph like the one below. What does it mean? Why do requests seem to be spending so long in Apache or Nginx when you suspect it’s your app that’s slow? Shouldn’t they be spending all that time in the your app server backend (e.g., PHP, Rails/Unicorn, WSGI, etc)? Or maybe your app isn’t slow at all and there’s something wrong with your web server?
That growing amount of “web server time” indicates your requests are spinning their heels queued up in front of your app server, waiting to be processed. Your backend can only handle a certain number of requests per second, and that rate is determined by two important factors: the average performance of individual requests, and the number of workers (threads, processes, server instances) provisioned for your backend.
A lot of teams spend a lot of time thinking about the former concern — writing clean, well-performing code, worrying about database indexes, etc — but then get bitten by the latter when they start to scale.
Worse yet, real-world user traffic is highly variable, making it hard to figure out whether the slow performance you were seeing during a traffic spike was more to blame on the code or was because of a misconfigured deployment or underprovisioned backend. One useful feature of Tracelytics in this scenario is our layer breakdown, which charts web server & app layer latencies over time to help you pin blame accordingly.
Of course, it’s always a combination of both factors, but for many teams with scaling challenges, often the simplest, lowest-hanging performance mistake they make is simply not provisioning workers fast enough to meet user demand. Do the math — if you have, for example, just 8 Rails Unicorn workers available, and know that on average your dynamic pages take ~200ms per request, then your deployment can support at best only 40 requests per second.
Without a well-provisioned backend, even if most of your pages would load right away under typical conditions, a sudden spike or even a slow-growing, unnoticed rise in demand could quickly eat up all available workers. Requests start to queue up in the web server or load balancer before they ever make it to the app layer, and users experience slow page load times or timeout errors.
It’s tricky to pick the right number of processes & threads that maximize concurrency on each app server without exceeding resource limits (CPU, memory). There are different operational concerns for every kind of backend and server environment — which are outside the scope of this article — but it’s not uncommon to find that the default configuration settings distributed with your server software are hugely inefficient without tuning.
But in the face of a 10x traffic spike, all that tuning won’t necessarily help if you’re not well-provisioned in anticipation already, and/or can’t quickly add capacity (more servers) when needed. Cloud services like Amazon’s EC2 Auto Scaling are a great way to schedule or trigger capacity increases and decreases to meet user demand. And don’t forget to exercise your stack with load testing, too — the screen shot above was actually taken from Dan’s post on load-testing the Reddit source.
The looped query
You wrote some code that looped over a set of related items — for example, a user’s comments or tags — and added a database query or service request inside the loop. Maybe you were working inside a loop in a template file, and just needed to grab a little more metadata inside that loop, so you called back to your app from the template.
You may have thought at the time: “But this query to grab the current tag’s last-updated date will return so quickly! It’s on a table with a great index! How could this be a performance problem later?” Or maybe you were following 37 Signals’ advice to “scale later” and just wanted to get it working quickly.
At scale it’s not just about individual query performance: all those extra round-trips to the database can really hurt once you hit hundreds or thousands of items. Even if your database can fire back a response to each query in say, 1ms, the database adds a whole second of latency to run 1000 per-item lookup queries. Add in the time it takes your DB driver or ORM to format query arguments, send them over the network, wait for and process the response, and you could be adding several more seconds of unnecessary latency and extra work.
The screenshot above is taken from a trace of a WordPress page using a “comment karma” plugin that issues a SQL query for each comment. While each query takes only a millisecond or less, running 708 of them in sequence adds a couple of seconds of MySQL-related time, and the back-and-forth legwork in PHP to process the request & response adds more time, all for a total of four seconds to load all those comments!
Move that query out of the loop! Grab all the values that you need (e.g. all of a user’s tags, comments, or friends) with one big request, or in fixed-sized batches. Then once your app starts filling up with real data, you’ll be able to handle it gracefully. Double-check template code for per-item requests inside for loops.
Again, these issues aren’t glamorous, but we see them enough that they seemed like a good start to this series. In future posts we’ll look at more general classes of problems as well as specific pitfalls affecting certain web stacks. And as always we’re curious to hear your performance stories, too!