Past a certain point, web applications outgrow a single server deployment.
Companies either want to increase their availability, scalability, or both! To
do this, they deploy their application across multiple servers with a load
balancer in front to distribute incoming requests. Big companies may need
thousands of servers running their web application to handle the load.
In this post we’re going to focus on the ways that a single load balancer might
distribute HTTP requests to a set of servers. We’ll start from the bottom and
work our way up to modern load balancing algorithms.
Visualising the problem
Let’s start at the beginning: a single load
balancer sending requests to a single server. Requests are being
sent at a rate of 1 request per second (RPS), and each request reduces in size as the server processes it.
For a lot of websites, this setup works just fine. Modern servers are powerful and can handle a lot of requests. But what happens when they can’t keep up?
Here we see that a rate of 3 RPS causes some requests to get dropped. If
a request arrives at the server while another request
is being processed, the server will drop it. This will result in an error being shown to the
user and is something we want to avoid. We can add another server to our load
balancer to fix this.
No more dropped requests! The way our load
balancer is behaving here, sending a request to each server in turn, is called “round robin” load balancing.
It’s one of the simplest forms of load balancing, and works well when your servers are all equally powerful and your requests are all equally expensive.
When round robin doesn’t cut it
In the real world, it’s rare for servers to be
equally powerful and requests to be equally
expensive. Even if you use the exact same server
hardware, performance may differ. Applications may have to service many
different types of requests, and these will likely
have different performance characteristics.
Let’s see what happens when we vary request cost.
In the following simulation, requests aren’t
equally expensive. You’ll be able to see this by some requests taking longer to shrink than others.
While most requests get served successfully, we do
drop some. One of the ways we can mitigate this is
to have a “request queue.”
Request queues help us deal with uncertainty, but it’s a trade-off. We will
drop fewer requests,
but at the cost of some requests having a higher
latency. If you watch the above simulation long enough, you might notice the
requests subtly changing colour. The longer they go
without being served, the more their colour will change. You’ll also notice that
thanks to the request cost variance, servers start to exhibit an imbalance. Queues will get
backed up on servers that get unlucky and have to
serve multiple expensive requests in a row. If
a queue is full, we will drop the request.
Everything said above applies equally to servers
that vary in power. In the next simulation we also vary the power of each
server, which is represented visually with a darker
shade of grey.
The servers are given a random power value, but odds
are some are less powerful than others and quickly start to drop requests. At the same
time, the more powerful servers sit idle most of the
time. This scenario shows the key weakness of round robin: variance.
Despite its flaws, however, round robin is still the default HTTP load balancing
method for nginx.
Improving on round robin
It’s possible to tweak round robin to perform better with variance. There’s an
algorithm called called “weighted round robin” which involves getting humans
to tag each server with a weight that dictates how
many requests to send to it.
In this simulation, we use each server’s known power
value as its weight, and we give more powerful servers more requests as we
loop through them.
While this handles the variance of server power
better than vanilla round robin, we still have request variance to contend with. In practice, getting
humans to set the weight by hand falls apart quickly. Boiling server performance down to a single number is hard, and
would require careful load testing with real workloads. This is rarely done, so
another variant of weighted round robin calculates weights dynamically by using
a proxy metric: latency.
It stands to reason that if one server serves
requests 3 times faster than another server, it’s probably 3 times faster and should receive
3 times more requests than the other server.
I’ve added text to each server this time that shows
the average latency of the last 3 requests served.
We then decide whether to send 1, 2, or 3 requests
to each server based on the relative differences in
the latencies. The result is very similar to the initial weighted round robin
simulation, but there’s no need to specify the weight of each server up front. This algorithm will also be able to adapt
to changes in server performance over time. This is
called “dynamic weighted round robin.”
Let’s see how it handles a complex situation, with high variance in both server power and request
cost. The following simulation uses randomised values, so feel free to refresh
the page a few times to see it adapt to new variants.
Moving away from round robin
Dynamic weighted round robin seems to account well for variance in both server power and request
cost. But what if I told you we could do even better, and with a simpler
This is called “least connections” load balancing.
Because the load balancer sits between the
server and the user, it can accurately keep track
of how many outstanding requests each server has. Then when a new
request comes in and it’s time to determine where to send it, it knows
which servers have the least work to do and
This algorithm performs extremely well regardless how much variance exists.
It cuts through uncertainty by maintaining an accurate understanding of what
each server is doing. It also has the benefit of
being very simple to implement. For these reasons, you will find this algorithm
as the default HTTP load balancing method for AWS’s load balancers. It’s also
an option in nginx, and well worth experimenting with if you’ve never changed
Let’s see this in action in a similarly complex simulation, the same parameters
we gave the dynamic weighted round robin algorithm above. Again, these
parameters are randomised within given ranges, so refresh the page to see new
While this algorithm is a great balance between simplicity and performance, it’s
not immune to dropping requests. However, what you’ll notice is that the only
time this algorithm drops requests is when there is literally no more queue space
available. It will make sure all available resources are in use, and that makes
it a great default choice for most workloads.
Optimizing for latency
Up until now I’ve been avoiding a crucial part of the discussion: what we’re
optimising for. Implicitly, I’ve been considering dropped requests to be
really bad and seeking to avoid them. This is a nice goal, but it’s not the
metric we most want to optimise for in an HTTP load
What we’re often more concerned about is latency. This is measured in
milliseconds from the moment a request is created
to the moment it has been served. When we’re discussing latency in this context,
it is common to talk about different “percentiles.” For example, the 50th
percentile (also called the “median”) is defined as the millisecond value for
which 50% of requests are below, and 50% are above.
I ran 3 simulations with identical parameters for 60 seconds and took a variety
of measurements every second. Each simulation varied only by the load balancing
algorithm used. Let’s compare the medians for each of the 3 simulations:
You might not have expected it, but round robin has the best median latency. If
we weren’t looking at any other data points, we’d miss the full story. Let’s
take a look at the 95th and 99th percentiles.
Note: there’s no colour difference between the different percentiles for each
load balancing algorithm. Higher percentiles will always be higher on the graph.
We see that round robin doesn’t perform well in the higher percentiles. How can
it be that round robin has a great median, but bad 95th and 99th percentiles?
In round robin, the state of each server isn’t
considered, so you’ll get quite a lot of requests
going to servers that are idle. This is how we get
the low 50th percentile. On the flip side, we’ll also happily send requests to servers that are
overloaded, hence the bad 95th and 99th percentiles.
We can take a look at the full data in histogram form:
I chose the parameters for these simulations to avoid dropping any requests. This
guarantees we compare the same number of data points for all 3 algorithms.
Let’s run the simulations again but with an increased RPS value, designed to
push all of the algorithms past what they can handle. The following is a graph
of cumulative requests dropped over time.
Least connections handles overload much better, but the cost of doing that is
slightly higher 95th and 99th percentile latencies. Depending on your use-case,
this might be a worthwhile trade-off.
One last algorithm
If we really want to optimise for latency, we need an algorithm that takes
latency into account. Wouldn’t it be great if we could combine the dynamic
weighted round robin algorithm with the least connections algorithm? The latency
of weighted round robin and the resilience of least connections.
Turns out we’re not the first people to have this thought. Below is a simulation
using an algorithm called “peak exponentially weighted moving average” (or
PEWMA). It’s a long and complex name but hang in there, I’ll break down how it
works in a moment.
I’ve set specific parameters for this simulation that are guaranteed to exhibit
an expected behaviour. If you watch closely, you’ll notice that the algorithm
just stops sending requests to the leftmost server after a while. It does this because it figures out
that all of the other servers are faster, and
there’s no need to send requests to the slowest
one. That will just result in requests with a
So how does it do this? It combines techniques from dynamic weighted round robin
with techniques from least connections, and sprinkles a little bit of its own
magic on top.
For each server, the algorithm keeps track of the
latency from the last N requests. Instead of using
this to calculate an average, it sums the values but with an exponentially
decreasing scale factor. This results in a value where the older a latency is,
the less it contributes to the sum. Recent requests
influence the calculation more than old ones.
That value is then taken and multiplied by the number of open connections to the
server and the result is the value we use to choose
which server to send the next request to. Lower is better.
So how does it compare? First let’s take a look at the 50th, 95th, and 99th
percentiles when compared against the least connections data from earlier.
We see a marked improvement across the board! It’s far more pronounced at the
higher percentiles, but consistently present for the median as well. Here we
can see the same data in histogram form.
How about dropped requests?
It starts out performing better, but over time performs worse than least
connections. This makes sense. PEWMA is opportunistic in that it tries to get
the best latency, and this means it may sometimes leave a
server less than fully loaded.
I want to add here that PEWMA has a lot of parameters that can be tweaked. The
implementation I wrote for this post uses a configuration that seemed to work
well for the situations I tested it in, but further tweaking could get you
better results vs least connections. This is one of the downsides of PEWMA vs
least connections: extra complexity.
I spent a long time on this post. It was difficult to balance realism against
ease of understanding, but I feel good about where I landed. I’m hopeful that
being able to see how these complex systems behave in practice, in ideal and
less-than-ideal scenarios, helps you grow an intuitive understanding of when
they would best apply to your workloads.
Obligatory disclaimer: You must always benchmark your own workloads over
taking advice from the Internet as gospel. My simulations here ignore some real
life constraints (server slow start, network latency), and are set up to display
specific properties of each algorithm. They aren’t realistic benchmarks to be
taken at face value.
To round this out, I leave you with a version of the simulation that lets you
tweak most of the parameters in real time. Have fun!