Measure your test pipeline before trying to improve efficiency with batching

In projects with lots of contributors and long running tests, it seems like common sense to batch features together in order to reduce the overhead of running the long tests. In this post I try to prove whether this is a valid thing to do.

The code I used for this simulation is available on gitlab

The Scenario

Let’s say we have 100 developers all contributing to a large and complex code base. In order to release a new version of the system, we have to carry out a load of tests and these tests include some end-to-end tests that are pretty slow. Let’s say on average we get around 24 merge requests per day and a test run takes about an hour. In this situation it looks like we are only just going to make it, if the number of merge requests per day increases then we are going to end up with an ever increasing backlog. In order to increase the potential capacity we can batch together a few merge requests and test those. What happens if a test fails though? We can bisect the batch to find the merge request that caused the tests to fail and remove that one.

This was in fact a real scenario in a previous contract of mine, in this case the time it took to merge a feature varied wildly since a bisection could take a few test runs to merge a batch, or it could take just 1 test run to merge. From previous jobs I had an intuition that in addition to leveling out the lead time, running each merge request in turn would actually be quicker on average. This turned out to be more or less correct however the amazing flakiness of the tests made it difficult for people to believe since the lead time was still very high. I wanted to prove that this should always be the case, not just from a small amount of anecdotal evidence.

Method

I’m sure that someone with the correct mathematical expertise could come up with a proof however since I am not an expert, I decided to simulate the scenario instead, this should be pretty simple with some Haskell!

In order to avoid confusion, let’s define some terms that we are going to use in the rest of the article.

merge request

A developer writes some code and requests that this code be merged into the master branch, this is a merge request and could take the form of a pull request, a single commit or a patch, depending on your workflow. Most people can think “merge request” == “pull request”.

test run

Some tests are run against the code, this is likely made up of multiple tests. If one test fails then the test run fails.

merge request batch

A number of merge requests grouped together. As an example a common workflow is to have a staging branch and a master branch. Developers create a pull request from their feature branch into staging. The staging branch is then tested, usually a combination of end to end tests and manual tests. It might be common for there to be 5 or 6 pull requests merged into staging and then tested however in this simulation we will have batches of thousands in order to calculate the averages more accurately.

failure probability

We will talk quite a bit about the probability of failure. What this means is that if a developer writes some code, there is an n percent chance that this will fail a test run. You may not have thought about this before but it can quite easily be measured.

Test cases

A merge request can be modeled as a boolean value, either it passes all tests or it doesn’t. We are assuming that a test will always pass if the code is good (tests are not flaky).

We can then model a batch of merge requests as a list of booleans. One variable we have is how many merge requests are good and how many are bad. We can set a mean failure rate i.e. how likely is a merge request to fail? For simplicity I have used an integer for this, the percentage of merge requests that fail. We can then make a batch of merge requests by creating a list of 100 booleans where n are False and 100 - n are True and then shuffle them into a random order.

A test run simply checks if any False values exist in the batch.

Bisection

If there is a failing merge request in the batch then we can split the batch in 2 and check each sub-batch again. Unfortunately this isn’t a simple binary search since we may have multiple failing merge requests, we must therefore run tests against a sub-batch even if the opposing sub-batch failed. If a batch of 2 tests fails and one of those tests passes by itself, we can assume that the other test fails (in a scenario with flaky tests we cannot make this assumption). Consider an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
let a = goodCode
b = badCode
c = goodCode
d = goodCode

[a, b, c, d] -> Failed

[a, b] -> Failed

[c, d] -> Passed

[a] -> Passed

[b] -> Don't need to check

In this case we had to run the tests 4 times in order to merge all good requests, the same number of test runs as running all merge requests individually.

Batch Size

Intuition told me that bigger batch sizes would perform worse, we can model the batch size by using n = batchSize instead of 100. We then need to scale the number of True and False values accordingly. We end up with a list of batchSize boolean values, some of them True and some False.

As it turns out, my intuition was wrong, batch size has no effect other than to reduce the standard deviation, i.e. if we run a bigger batch size we get a more accurate value of the number of test runs that will be required on average. This simplifies the code nicely.

Results

I used Chart to plot the results and increased the batch size to increase accuracy. I am able to produce a graph of test failure probability vs the number of test runs required on average. I added a line to the plot with a constant test runs value of the batch size, this represents the situation if we don’t batch at all.

test runs

We can see from this chart that if 24% or more merge requests fail then it is more efficient not to batch our merge requests.

What about if our test runs are a bit flaky and they do fail occasionally when they should pass? One way to retry quite efficiently is to run tests against tests that we assume are bad because they failed when run with a test that passes by itself. We can simulate this in the following way:

1
2
3
4
5
6
7
8
9
10
let a = goodCode
b = goodCode

[a, b] -> Failed due to a flaky test

[a] -> Passed

-- we could assume that the right side will fail however we know that the original batch
-- could have failed due to a flaky test, we will therefore run it anyway
[b] -> Passed

If we make that addition to the algorithm we get the following results:

flaky test runs

We can see that now if 16% or more merge requests fail then it is more efficient not to batch our merge requests.

Conclusions

So how does all this help us? Well, if your merge requests tend to fail less than 24% of the time then you can gain some speed by batching. 24% might seem quite high, surely the code we write doesn’t fail that often? In reality it’s not that high especially if you have loads of imperfect end to end tests. Be honest, how green is your pipeline? If your tests are a bit flaky but more than 16% of merge requests still pass on average then you can use the alternative bisection algorithm and make some gains.

Of course this is just a simulation and there are other factors which affect reality. It’s also worth noting that this simulation assumes that no heuristics or intelligence is used to pick out which tests are likely to be causing the failures. Think of the scenario where you use a staging branch and carry out tests there (possibly manual) before merging to master. In this scenario it is common for developers to inspect the test results and intelligently decide which merge requests to remove, thus increasing the chances of staging passing next time without having to bisect. Experience has shown me that the ability to pick out bad merge requests tends to degrade as the team grows and as the number of merge requests in staging increases since the interactions between merge requests rapidly becomes more complex.

We can say that whether batching is worth it or not depends only on the failure rate of merge requests and the algorithm (whether intelligent or not) used to find failing tests, it doesn’t matter how long your tests take. In the case of a simple bisection algorithm we can also say that it doesn’t matter what the batch size is either.

There are other reasons why you may wish to avoid batching such as reducing the risk involved in releases by deploying smaller changes. Additionally we can say that improving code quality and code reviews can allow methods of increasing efficiency such as batching. Having multiple ‘tiers’ of testing could also make batching worth it since you could decrease the likelihood of failing merge requests in a batch as other, faster tests have already passed for those merge requests.

The most important thing I will take away from this though is that if I am in a situation where batching is being considered (or removing existing batching) then I will make sure I spend some time measuring failure rates first before spending any engineering effort on building the test pipeline.