# Table of Contents

I have spent the last few years building and running software systems that implement relevance features11 Search, navigation, recommendations, personalisation, etc. The business owners would tell you that what makes our solution unique is “ai”, but I would claim it’s rather about a rapid pace of innovation and experimentation, afforded by building infrastructure, tooling, and frameworks tohigh quality standards. of e-commerce platforms. In that field, Black Friday and all of Cyber Week are by far the most intense annual traffic peaks. Fortunately, they are sort of similar from year to year, so it is possible to roughly forecast the additional traffic volume for each online store, based on things like vertical, campaign scheduling, and so on.22 The shape of the traffic volume over cyber week is also somewhat predictable, so one can use information about traffic volumes early in the week to forecast how things will turn out later in the week.

A neat thing about the way these systems have been built is that it is almost exclusively cpu usage that is affected by increased traffic volume, and the problem is (to a first approximation, anyway) embarassingly parallel. So to cope with increased traffic, one simply allocates more cpu cores to it – either vertically by assigning the job to a more beefy machine, or horizontally by load balancing across a larger number of machines.

# First Question

If a medium-large online store has 40 cpu cores allocated to it, and we know they expect to receive twice the usual amount of traffic, how many cpu cores should they get to accommodate this increase without degraded performance, where performance is measured as latency or throughput?

As much as I’d like to give you the answer straight away, this is one of those cases where I believe most of the value is in the journey there, not in the conclusion. As you’ll find if you scroll down to the section “A Shortcut”, the conclusion is quite trivial. But we can learn a lot from the way we get there.

# Simplifications

I’m going to argue from queueing theory, and the discussion below assumes the usual caveats we hear when someone brings up queueing theory: the results are based on a single-server fifo system with Markovian arrivals and processing times. Real world systems typically have heavy-tailed workloads (not Markovian)33 Remarkably, real-world interactive systems typically still have Markovian arrivals – and this is indeed required by the analysis below., there are multiple workers (not a single-server system), and each server uses timeslicing concurrency (not fifo.)

These real-world properties are not accounted for by the simpler model. However!

• Timeslicing concurrency (surprisingly!) doesn’t change the key relationships used below, and in fact makes them apply to a wider range of workload distributions. This happens because timeslicing removes some of the bad effects of workload variability.
• When viewed from a distance, a system that load balances over c identitical workers looks a lot like a system with one server that is c times as powerful as the individual ones, as long as it’s kept at somewhat high utilisation.

These two points means that if we squint somewhat, we can apply the analysis below also to real-world systems. And in fact, I have and I liked the results. Otherwise I wouldn’t be writing this article! The theory won’t perfectly predict what happens in practise, but in my experience it has been close enough to save a lot of money.

# Wrong Answer

Back to the question: if a system on 40 cpu cores needs to meet twice the regular amount of traffic without lost performance, how many cpu cores would it need?

This is one of those trick questions from queueing theory. It is tempting to blurt out 80. It seems to make intuitive sense.

It’s also wrong. To see why, let’s first appeal to intuition, and then look at the equations. If we run a system on 80 cores next to the same system on 40 cores, and drive them as hard as we can, we will see that by the time the 80-core system has responded to 2000 requests, the 40-core system will only have had time for 1000. In other words, increasing the core count to 80 will double the performance of the system, measured as both throughput and response time – even when the traffic volume doubles!

We can derive this from theory, too. The response time $$W_{\textrm{baseline}}$$ of a system is

$W_{\textrm{baseline}} = \frac{1}{\mu - \lambda}$

where $$\mu$$ is the number of requests per second the server can do when it works as fast as it can, and $$\lambda$$ is the incoming request rate. If we double both $$\mu$$ (give the server twice as many resources) and $$\lambda$$ (double the request volume) then we get

$W_{\textrm{new}} = \frac{1}{2\mu - 2\lambda} = \frac{1}{2} \frac{1}{\mu - \lambda} = \frac{1}{2} W_{\textrm{baseline}}.$

In other words, the response time is halved for the system with both request rate and resources are doubled. Since throughput

$X = \frac{1}{W},$

it’s clear it will be doubled too.

# Right Answer

The reason I bring up the equations above is that they help us get to the correct solution. What we want to figure out is what value of $$k$$ improvement to the server resources gives us the same performance as the baseline system when we double the request volume. Mathematically speaking,

$W_{\textrm{new}} = \frac{1}{k\mu - 2\lambda} = \frac{1}{\mu - \lambda} = W_{\textrm{baseline}}.$

If we rearrange this and simplify, we find that the solution is at

$k = 1 + \frac{\lambda}{\mu}.$

The quantity $$\lambda$$ / $$\mu$$ is sometimes known as $$R$$, and it has an intuitive interpretation: it’s the utilisation of the system, i.e. the fraction of the the system is busy at the current load.44 For cpu based loads such as the one discussed here, can get the utilisation out of a computer by running vmstat 1, finding the column for cpu idle, and taking the inverse of that.

We find that how much we need to increase the cpu core count actually depends on how heavily utilised the system is in its baseline state!

This shouldn’t come as a surprise: if the system spends most of its time idle, it can probably handle the increased traffic just fine with no change at all. If it’s close to fully utilised, we may need to increase the resources a lot. It’s instructive to plot $$k$$ as a function of $$R$$:

This was specifically for handling double the traffic with no decrease in performance. If the system is currently almost completely idle, we don’t need to change anything – 40 cores will be sufficient also for double the traffic. Then it increases – if the system is currently actively processing stuff 50 % of the time, we need the new system to have 1.5 times the cores of the baseline system, so 20 more cores than before, and so on.

The more general solution comes out of the same equation, and looks like

$k = 1 + (x - 1)R,$

where $$x$$ represents how much we expect traffic to increase (so for example 1.14 if we expect a 14 % increase, or 2 if we expect it to double, and so on.)

This is not particularly interesting to look at because it follows the same pattern as before: at $$R=0$$ we have $$k=1$$ (no change to the system needed), and then it increases in a straight line until $$R=1$$ where $$k=x$$.

Now that we know this, we can move on to more interesting questions, but first, some notes on planning.

# Notes On Planning

I want to get two things out of the way before we move on. This might be stating the obvious, but bears saying, just in case.

Our projection of future traffic volume is not exact knowledge. It’s a probability distribution. Maybe we think the most likely increase is 2×, but there’s also a fair chance it could be 3×. Maybe there’s a small chance it is 5×. In a freak scenario, it could even be 10×. We shouldn’t rely on the most likely occurring. Instead, pick a risk level we’re willing to live with. Maybe if we scale the system for 4× the traffic, we think there’s only a 5 % risk of there being more traffic, and this risk is worth what we save from not making the system unnecessarily powerful.

The diagram above might give you the impression that it’s okay to run at 100 % utilisation. Note that the diagram above shows how much we should scale up to retain existing performance. What is the existing performance at 100 % utilisation? Let’s look at what response time is at various levels of utilisation. (This is the equation for $$W$$ from before, in graphical form.)

If the system is idle when a request comes in, it can be expected to be served in the fastest possible time, which for this system is 20 ms. As utilisation increases, the risk increases that the system is busy when a request comes in, and that the request, consequently, has to wait for longer to be processed. At 80 % utilisation, response times are already 5× worse than at idle. As we approach 100 % utilisation, things get worse and in fact, when the system is fully utilised, the response times will be infinite.

# Better Questions

With that out of the way, we can tackle another problem with our methodology so far. It’s been a bit crude. If the equation indicates that we need to increase the core count of the system by 5 % to meet increased demand, will we really do that, or will we just accept that it suffers a little in performance? Most customers accept that the system runs a little slower when it’s under high load.

Let’s say we can tolerate that $$W_{\textrm{new}}$$ is 40 % slower than $$W_{\textrm{baseline}}$$. Then the equation of interest is

$k = \frac{1}{1.4} + \left(x - \frac{1}{1.4}\right)R.$

This doesn’t change the shape of the line, it just offsets it slightly so that it no longer indicates a need for scaling up by small amounts.

Another way to look at it is to see how much the system would slow down from a traffic increase of $$x$$, assuming we don’t make any change to the system at all. We get this out of

$w = \frac{1-R}{1 - xR}.$

Visually, for some values of $$x$$:

The plot is a little tricky to read (and I wish I had time to come up with a better way to visualise this) but we’ll try anyway. The Y axis does not incidate response time, but rather how much the response time is slowed down due to increased traffic, compared to the current response time of the system.

We find that if we are facing only modest increases in traffic, we will see fairly little slowdown even at high utilisation levels: e.g. a 10 % increase does not lead to a significant slowdown even at 70 % utilisation. On the other hand, if the system is utilised at a modest 30 %, it cannot take a high traffic increase like 5×.

Based on this, you might start to sense what the fastest and easiest way to make these computations are. We’ll get to that.

Another question that may be of interest is a combination of the last two: “given utilisation $$R$$, and if we tolerate a slowdown of $$w$$, how much more traffic can we take without needing any improvement of the system?” The answer is the solution to

$x = \frac{Rw^2 + w - 1}{Rw}.$

Let’s say we tolerate at most a 40 % slowdown, i.e. $$w=1.4$$. Then, visually, we have as a function of utilisation:

We see that the traffic increase we can take without changing the system drops off quickly as utilisation increases beyond idle. This means if we are uncertain about how much traffic will increase, we get a lot of mileage out of reducing the current utilisation from e.g. 10 % to 5 % – but it doesn’t do much to decrease it from 50 % to 30 %.

# A Shortcut

This has, finally, lead us to the shortcut. Given the characteristics of the system we’re working with, we can forget most of everything above. We don’t need to know about slowdowns, or response times, or whatever. The one thing we need to target is utilisation.

Before a traffic increase, we need utilisation to be so low, that once it’s multiplied by the traffic increase, it’s still reasonably low (e.g. under 50 %.)

As long as we keep utilisation down, response times and throughput will be maintained at values somewhat similar to what they were before. This is very simple to do in practise, with division being the only maths we need:

1. Figure out what the largest increase in traffic you resonably expect is. We can use 6× as an example.
2. Compute the baseline utilisation that leads to a utilisation of 50 % after the traffic increase. In our example, this would be 0.5/6 = 0.08.
3. If the cpu utilisation is already below 8 %, you’re fine! If it’s not, keep adding cpu cores until it is. Done!

Try it and let me know if it works! I’m genuinely interested in more experiences with this. It seems too simple to work, but so far, it has, in my experience.