Successfully Scaling Automated Tests - feature image

Successfully Scaling Automated Tests

Share with your network...

Testing is a risk-mitigating activity that takes time. Ideally, we want to cover as many behaviors as possible in the shortest amount of time to get fast feedback on quality. Automating tests that would otherwise require manual execution is a great way to shorten testing cycles, but even automated tests can still be slow.

For example, consider a suite with 300 end-to-end tests. If each test takes 1 minute to run, then the whole suite will take 5 hours to complete. That’s a long duration – and those are modest numbers. I’ve worked on suites with thousands of tests where the average individual test might take several minutes to complete.

The only way to speed up is to scale up. But how? Should we profile tests to see which ones run the longest? Or should we just pay for faster hardware to execute tests faster? What are the most impactful ways to scale up tests?

Here are the five main steps I take whenever I need to scale automated tests, in order of impact:

#1. Run tests in parallel

By far the most effective way to shorten total test execution time is to run tests in parallel. Consider that 5-hour, 300-test suite we mentioned earlier. If we ran it with 5 parallel workers, then its duration time would drop from 5 hours to 1 hour!

Run Tests in Parallel

Almost all test frameworks these days can launch tests in parallel, usually via a command line option or some sort of plugin. The challenge, however, is not in the tooling: it is in the way tests are written. Every single test in the suite must be independent of the others, or else tests will inevitably start failing when running them in parallel. “Independence” means that one test does not affect another. There are two primary aspects of test case independence:

  1. Individuality: A test can run alone by itself without needing other tests to set things up first. By extension, any subset of tests in the suite can run in any order.
  2. Isolated data: Tests do not collide on any shared test data. One test does not modify or delete data that another test uses.

A good way to verify independence is to randomize the order of tests in the suite. If tests are not truly independent, then they will yield intermittent failures due to improper setup and collisions. Untangling interdependencies can be a daunting task, but it is absolutely necessary for stable parallel execution – and by extension for scaling tests.

#2. Thoroughly implement proper waiting

Improper waiting is one of the most pervasive problems in test automation. An automated test must wait for the system under test to be ready before interacting with it. Waiting is especially critical for UI testing, where, for example, a page must load before clicking a button.

The two antipatterns I see most frequently are either to not handle waiting at all or to use hard waits. Not waiting at all will inevitably cause flaky failures. A hard wait (or “hard sleep”) forces the test to wait for the full amount of time given (say, 30 seconds), even if the system becomes ready sooner. They are an easy way to wait for the system to be ready, but they are still terrible. Testers are incentivized to provide larger-than-necessary sleep times as a safety measure, which slows down tests tremendously. Plus, hard waits could still result in flaky failures if the system is slow and needs even more time.

The proper pattern is to use smart waits wherever there are race conditions between the automated test and the system under test. A “smart” wait repeatedly checks if the system is ready. It needs a condition to check and a timeout. For example, suppose a smart wait is waiting for a button to appear on a page within a 30 second timeout. If the button loads in 2 seconds, then the smart wait will exit after 2 seconds, as soon as it discovers the button has appeared.

READ MORE  Preparing Your Systems for Peak Season Load Testing Success

Proper Waiting Strategies

Smart waits are the proper, performant way to handle waiting. Modern web test frameworks like Playwright and Cypress handle smart waiting automatically. Selenium can also perform smart waiting, but it is not automatic – it must be programmed implicitly or explicitly.

#3. Use performant test execution infrastructure

The infrastructure used to run automation is just as important as the tests themselves, and it must be scaled commensurately. There are multiple aspects to consider.

First, make sure to use up-to-date packages, software development kits (SDKs), and runtimes. Older versions may not have all the latest performance enhancements. Years ago, I led a .NET test project built on C#, SpecFlow, and Selenium. When we ported the project from .NET Framework 4.7 (the older Windows-exclusive runtime) to .NET 5 (the newer multi-platform runtime), test execution speed nearly doubled! We didn’t change any of the tests or any of the other infrastructure. Simply updating the version of .NET cut the test suite duration time in half. It was quite a pleasant surprise.

Second, make sure the computing resources are strong. Use machines with fast processors, plenty of memory, and solid-state storage rather than disks. Sometimes, teams skimp on testing resources and provide underpowered machines unsuitable for high-scale testing. Scaling up these resources with cloud platforms like AWS or Azure is thankfully straightforward – it just costs more money.

Efficient Infrastructure

Third, scale out as well as scale up. One machine will be limited in the amount of load it can handle. Spreading that load across multiple machines is the only way to reach “high” scale. Scaling out could be as simple as partitioning a test suite into subsets, running each partition on a different machine, and then aggregating results together once complete. For web testing, Selenium Grid can offload browser sessions by running them across remote machines.

I can provide additional advice for web testing. Not all browsers run tests as quickly.

Here’s my anecdotal ranking of browsers in order of fastest to slowest execution speed:

  1. Google Chrome
  2. Microsoft Edge
  3. Apple Safari
  4. Mozilla Firefox
  5. Microsoft Internet Explorer

For Selenium tests, the optimal ratio of tests-to-processors is 3-to-4. So, if the test machine has 4 processor cores, run the tests in parallel with 3 workers. The same ratio applies to Selenium Grid: across all nodes, limit the number of permitted browser sessions to 75% of the available processors. The extra 25% capacity handles all the other processes happening on the machine, and it provides a safety factor if anything goes wrong. Do not exceed a ratio of 1-test-to-processor, however. More than 1 test per processor will force the machine to do more context switching and will lengthen total execution time. Also note that third-party vendors providing Selenium-based execution grids historically slow down tests by a factor of 2x to 4x.

Furthermore, don’t use Internet Explorer. It is no longer supported and practically impossible to scale. Tests can run only one instance of IE at a time per machine, no matter how many processors the machine has.

So, why are infrastructure improvements the third step and not the first step? Test automation design principles must be addressed first. Tests that lack independence or proper waiting are fundamentally flawed, and they will wreak havoc on any well-intentioned test project – to the point of ruin. Paying for faster VMs would simply be flushing money down the drain.

READ MORE  How Automated Testing Uncovers Hidden Defects in Warehouse Management Systems

#4. Write balanced atomic tests

An “atomic” test is one that covers exactly one main behavior. Rather than writing one long test that takes a “grand tour” of an application, testers should write short, atomic tests that cover separate behaviors individually. In addition to making tests more understandable and maintainable, atomic tests are easier to parallelize and shorten total suite duration time.

For example, let’s say there is one long test that takes 10 minutes to complete. Let’s also suppose that this one test actually covers 4 separate behaviors. We could break down this test into 4 separate, smaller tests. Suppose each of those 4 tests takes about 3 minutes to complete. If we ran the new tests serially, it would take 12 minutes – 2 minutes longer than the original grand tour, likely due to additional setup and cleanup steps for each test. However, if we ran the new tests with 2 parallel workers, the total test duration time would drop to 6 minutes. With 4 parallel workers, duration would drop to 3 minutes. That’s significant.

Atomic Tests

Long tests are not well-suited to high-scale execution. Atomic tests are. When looking for “long pole” tests, I will take a typical test run and sort tests by their duration times. If any tests take an unusually long time, I’ll drill into them to find out why. Usually, I’ll discover that those long poles are trying to cover too many behaviors at one time. Splitting the grand tour into atomic tests will undo the skew it caused for total execution time. Sometimes, I’ve even removed “long poles” because they become too much hassle (e.g., cost) to keep as part of a high-scale suite.

#5. Arrange tests efficiently

The steps a test takes can be inefficient, especially when setting up or “arranging” the test. Oftentimes, I discover that a test is setting up more data than it actually needs, or that a test is using slow mechanisms to set up the system. Testers should do anything they can to optimize setup steps because they are not part of the main behavior for “acting” and “asserting.”

For example, let’s say we are testing the checkout behavior for an online store. The first thing a user must do is log into the app. The slow, traditional way for login would be to navigate to the login page, enter username and password, click a button, and wait. Every test needs to repeat this same operation. A faster way to perform login could be to POST credentials via an API call, cache the authentication cookie, and inject it into the browser session during test setup. This one-time login mechanism could save seconds per test, which could significantly reduce total testing time.

Next, the test must place items in the shopping cart before attempting checkout. The slow, traditional way would be to navigate through catalog pages and add items one at a time. A faster way could be to call an API or insert data directly into session storage to populate the cart. Again, this could save several seconds to several minutes per test.

Arrange Tests Efficiently

Caching authentication cookies and leveraging APIs for test data management are just two techniques for arranging tests more efficiently. Making these improvements can be risky, though. They require more advanced programming skills and system knowledge to implement. They could easily break if APIs change. They are also the least impactful of all five steps in the scaling playbook because setup improvements are tedious to identify and affect individual tests rather than the whole suite. Nevertheless, optimized setup steps can still be significant.

READ MORE  Automated Regression Testing Strategies for Complex Platforms

How much scaling is enough?

In theory, we could scale a test suite endlessly with refactoring and more powerful infrastructure. Most teams would probably say that they want more coverage in less time, but the business must ask, how much is enough? I think this question should be reframed:

How would your operations improve if you could complete X amount of coverage in Y minutes?

This question gets to the heart of cost/benefit analysis. The values of X and Y would vary by team, but the sentiment remains the same. Here are some examples of this framing:

  1. My team runs our tests only once a sprint because they take 20 hours in total to run. We often need to run subsets of the tests across multiple days, and inevitably we need to rerun a sizable portion of them due to failures. If we could run our full test suite in under 8 hours, then we could set it up to run nightly. That would free up our testers’ time to focus less on monitoring execution and more on finding defects and developing new tests. It would also catch issues during the sprint rather than at the end, preventing delays for fixes.
  2. Every time a developer commits a new change, the Continuous Integration server kicks off a pipeline that builds the app and runs a few tests. The pipeline includes a small smoke test suite that takes 15 minutes to run about 2% of our tests. We also run the full test suite overnight because it takes a few hours. Unfortunately, we have discovered that the smoke suite pretty much never fails, but the nightly suite fails every other night due to new bugs. If we could increase the smoke suite’s coverage to 20% without increasing the pipeline’s duration time, then the smoke suite would start catching failures. Developers would be informed right away when their changes caused failures, rather than relying on testers to triage failures each morning and guess who’s changes caused what failures. Developers could even see failures in their pull requests before merging bugs into the main branch.

If the benefits of faster feedback justify the cost of scaling, then pursue it! It only makes sense. After reaching the “next level” of scale, reassess. Is it working? Is it delivering value? Do we need to scale further? You will eventually discover the right level of scale for your team’s needs.

Following the playbook

In short, if you want to successfully scale your automated tests, follow these five steps in order:

  1. Run tests in parallel
  2. Thoroughly implement proper waiting
  3. Use performant test execution infrastructure
  4. Write balanced atomic tests
  5. Arrange tests efficiently

These steps have served me well. Remember that .NET test project I mentioned earlier? My team and I ultimately scaled that project to reliably run over 1000 Selenium-based end-to-end tests in under 15 minutes flat with up to 100 parallel workers and an in-house Selenium Grid instance. We took our team from zero automated testing to Continuous Testing, shrinking release cycles from 1 week to 2 days.

Give this playbook a try, and see what scaling your automated tests can do for your team!

This post was written by:
Andrew Knight
Principal Architect

Share with your network...