Garbage In, Garbage Out: Your Load Test Results Are Only as Reliable as Your Test Environment

Making sure your load test results are accurate and actionable starts long before you run your tests — so this post is all about how to set up your test environment.

This series of articles is based on my nearly 20 years of experience with performance testing in gaming, e-commerce, network infrastructure and finance. I maintain the open source load testing tool Locust, and recently started Locust Cloud so you don’t have to do the heavy lifting of setting up and maintaining load testing infrastructure.

Matching production performance

A test environment for functional testing might simply be a virtual environment with all systems running scaled down to a minimum, without a lot of resource utilization. But for performance testing, ideally you want an environment that closely replicates your production environment’s performance.

Have a permanent production-scale test environment

The first option is to have a test environment that matches your production environment’s performance. This sounds obvious, but is often skipped due to the cost of resources. If you have a big operation, it can be prohibitively expensive to run an exact copy of production. If that’s the case for you, there are a couple of other ways to approach this challenge:

Use dynamic scaling

If you have a modern hosting solution, you should have the flexibility to scale up your test environment on demand. So, when you want to run performance tests, it’s just a matter of telling AWS (for example) to apportion 10x the number of servers as usual to match production’s performance (or whatever that number is for you). That way you can use a “regular” test environment but scale it up for when you want to do load testing.

Create a scaled-down test environment

You can also create a test environment that’s a fraction of the size of production and test performance proportionally. So, with a test environment half the size of production, you could extrapolate that in production, with double the amount of resources, you will achieve double the throughput that you achieved in your performance test.

This approach isn’t as reliable as dynamic scaling (it really depends on your application or service’s architecture as to whether the assumption of double throughput is reasonable), but it is a common workaround if you’re not cloud ready or otherwise don’t have flexibility to scale up and down on the fly.

Matching background load

Beyond resources, your test environment also needs to match production in terms of load — not just your expected traffic but also background activity and processes such as:

Running reports (scheduled and manual)
Database cleanup and indexing jobs
Backup creation

Matching production data

As with background load, you also want to reproduce the volume of data your application is interacting with in production. Ideally you can generate synthetic data that matches your production data in volume and performance characteristics, but another way is using a copy of your production data in your test environment. For a lot of cases there may be legal implications with using consumer data. Sometimes you can work around this by anonymizing the data so that it no longer can be tied to individuals, and there are even some specialized tools for this.

Getting predictable performance

When performance testing, you need to know that any performance changes in your results are real — that is, caused by changes in the test scenario, the code in the system you’re testing, or its setup. To ensure performance is as predictable as possible, you want to minimize noise or uncontrolled changes between test runs

There’s a time and a place for autoscaling

If you’re running a modern system on cloud infrastructure, you have probably set it up so it’s generally using no more resources than it actually needs, but more can be spun up in response to a surge in demand. Autoscaling is great for production, because it allows you to use resources in a flexible way.

But when you want to prioritize predictability, as in performance testing, autoscaling can confuse things. If you have autoscaling enabled in your test environment, something as simple as running a performance test could trigger autoscaling, so that a subsequent test might show you results for your system running with more resources than usual. Unless you disable autoscaling or at least monitor it, you might get a false impression of your system’s performance without it.

(Note: Of course, there may be occasions when you specifically want to test autoscaling in your system, for example if you are spike testing and want to track how long it takes for more resources to become available under a higher load.)

What happens if you have to share your test environment?

Ideally, you want a separate test environment from those used for functional testing. If the application you’re testing is modern and cloud aware, you should be able to set up a fresh environment and scale it up to where you can test performance.

However, it’s common to end up having to share the test environment with someone. For you as a load tester, someone conducting functional tests isn’t going to throw off your load tests (as long as they are just behaving like an individual user, not doing heavy work like producing reports or database maintenance). But if you’re maxing out the system, your coworker doing functional testing will experience degraded performance and possibly errors, and might confuse this with a “real” functional issue.

The solution here isn’t primarily technical; it’s about communicating proactively with others. Let them know when you’re going to be load testing. You should do so even if you don’t expect interruptions, so they know to talk to you if they run into errors or things are being slow. If you know other people are depending on the system, you need to monitor your tests extra closely, to be able to back off quickly if you start seeing errors or unacceptable response times.

Finally: Set up infrastructure for load generation

Before you can actually run tests in your test environment at any kind of scale, you’ll need servers for load generation. Hopefully you can easily launch virtual machines in your VPC for this purpose, but even then you need to take care when doing so.

The load generators need to have not just good compute performance, but also proper networking and OS setup — and they need to be consistent, because, again, you really don’t want random variations.

Apart from the machines generating the load itself, you’ll need to launch some “master” process that instructs the workers what to do (a.k.a. “controller” in LoadRunner or JMeter). The master process is also typically responsible for aggregating and visualizing the results/data that the workers produce. The master process usually doesn’t have as high resource requirements as the load generators/“workers”, but it still adds complexity to your setup.

If you’re at Google scale, you probably have the resources (not just hardware, but time and skill) to set this up, but very often it makes more sense to outsource this using a plug-and-play SaaS solution (like the one we happen to be building). Another advantage of “load testing as a service” is that it allows you to spin up load generators only when you need them, which is good for the planet as well as for your wallet. I highly recommend this route as opposed to starting out with the DIY route and then considering a SaaS alternative, to avoid the upfront investment in time and resources. Not all SaaS solutions are equal though, so here are some things to be aware of when choosing one:

Does the vendor offer flexible pricing without having to commit for a long time?
Is the load testing tool used open source? Proprietary load test solutions (e.g. LoadRunner) can become costly and inflexible due to vendor lock-in and challenges with extensibility.
If the vendor uses some form of dual-licensing, make sure the open source version still has the features you need. Some tools are lacking important features in their open source version (e.g. Gatling, which doesn’t support distributed tests in its free version).

A load tester’s work is never done

Load testing is better to set up sooner than later, but refining and maintaining your test environment will be an ongoing process. The next post in our series will dig into some of the details of how to write the test scenarios themselves and ensure your results are reliable, but a performance test is never more reliable than the environment you run it on. So don’t forget these basics!