Looking At Quality · Continuously Improving

Quality is more complicated than we often think and we would do well to dig deeper.

In the context of software engineering what is quality? Why do we care about it? How does it help us? When we think about “quality” we usually go straight to QA or “Quality Assurance” and from there we think of testing, however testing is just one small facet of quality. In reality, we might think that quality is about preventing bugs but we can go even further back to basics; why do we care about bugs?

Bugs

I will define a bug as an unexpected behavior of your system that causes damage to the user. This in turn causes damage to the business (loss of clients, reputation, fines etc). A bug reaching production doesn’t always have a fixed cost though, usually what’s important is how long that bug is in production. Of course, if you’re building an air traffic control system then a bug in production for just a moment could mean death but in reality this is as unusual scenario, for most software engineers damage is somewhat proportional to the time a bug is in production.

If we accept this then 3 things become equally important:

Minimze the chance of a bug getting to production
Minimze the time it takes to detect bugs in production
Minimze the time it takes to remove bugs from production

So what can we do about these 3 problems?

Testing

This is of course a massive subject but I will summarize my experience in order to be able to see how it fits in to a more holistic quality strategy.

First, I consider regression testing and exploratory testing as two very different things:

Exploratory Testing

A creative process whereby a human tries to discover behaviors that have not been thought about. These behaviors may or may not be bugs, the point is that nobody has previously thought about what the behavior should be.

Exploratory testing is useful before the first release of a project however after that it does not need to occur before code is released. In fact, experience has taught me that it is essential that it is carried out in production (or a mirror of production), otherwise it will morph into regression testing.

Effective exploratory testing minimizes the time it takes to detect bugs in production.

Regression Testing

A procedural process whereby test cases are written before testing is carried out. The intent of these test cases is to ensure that the specified behavior is maintained as the system is changed. The writing of these tests can be used to aid a creative activity, for example test-driven development, however it is still a matter of describing the expected behviour in an executable form (this could be ‘executed’ by a human or a computer).

It is vital to understand some of the properties of regression tests:

they are proof that a previously defined behavior works under a specific system state
in complex systems it may be impossible to define the system state exactly and hence tests can be ‘flaky’
there is a cost in writing and maintaining them as the system changes
their usefulness in finding the cause of a bug (and therefor fixing it) varies greatly

The last point is quite important and is one of the reasons for the ‘test pyramid’ that is considered best practice; higher-level, more opaque tests provide a developer with less information about why the test failed. We therefor prefer to write unit tests over end to end tests, when a unit test fails we can usually fix things pretty quickly however when an end to end test fails we might need to do a lengthy investigation to discover what is wrong and fix it. Of course the trade off is that a unit test only tests a small part of the system, there are many complex interactions that can occur in the entire system and end to end might be the only way to test this.

Effective regression testing minimizes the chance of a bug getting to production.

Code Quality

In my experience this is an area that people do not relate to the number of bugs in the system, despite every developer knowing it has a strong relationship. When deadlines are looming the solution seems to be “let’s just do loads of manual testing, we can sort out the tech debt later”. This is somewhat short-sighted as refactoring the code may well lead to discovery of existing bugs in addition to avoiding bugs in the future.

Of course, code quality is again a huge topic with many conflicting views but as an example here are some things that can affect it:

code reviews
pair programming
choice of language
choice of tools
level of experience developers
size of code-base
time spent refactoring
coding standards

However what is important is that you do something.

Code quality affects both the chance of a bug making it to production and the time it takes to remove a bug from production

Monitoring

Bugs always get in to production but that’s not the end of the story. How do you find out that a bug exists as quickly as possible? If a user complains then its already too late however with good monitoring many bugs can be detected before this happens. The first time I worked somewhere with good monitoring it was a revelation, it wasn’t just about alerts when things go wrong, engineers would try to come up with new graphs and new stats that would show unexpected things happening, much the same as exploratory testing. In addition to finding real issues, monitoring can show that something will be a problem in the future, a quality dimension that is not really discoverable any other way.

There are a huge number of things you can do in this space and you should not underestimate the value it can provide.

Effective monitoring minimizes the time it takes to detect bugs in production.

Code Integration Pipeline

So lets take it to the next step, you’ve found a bug in production and no users have reported it yet but it’s still there, you need to fix it ASAP. This involves the following steps:

find the cause
fix the issue locally
review the fix
build and test the new code
release the fix to production

The faster you can get through this pipeline, the less the potential damage to your business. Additionally, the faster your pipeline, the more frequent and therefor smaller your releases will be and smaller releases reduce the likelihood of bugs.

A fast code integration pipeline both minimizes the time it takes to remove bugs from production and the chances of a bug making it to production

Conclusions

I consider there to be 5 equally important facets of quality:

code quality
regression testing (this includes unit, integration, end to end and manual)
monitoring
exploratory testing
code delivery pipeline performance

I have a lot of opinions on all of the topics I’ve talked about here and I may go into more detail about those in later articles but the take-away from this article should be that we need to stop equating quality with tests. If you want to avoid the damage to your business caused by engineering errors then you need to look equally at all the aspects of quality I’ve mentioned above. Personally I would even go so far as to say that automated testing is usually less value for money than some of the other aspects.