Per-test Code Coverage
Insights into software test execution from an elaborate form of code coverage, an otherwise much reviled software metric.
Code Coverage 🙈🙊🙉
If a test cannot execute code, it cannot verify its expected behavior. Code coverage is a metric that tells developers the extent to which a project’s test suite is executing, or covering, the project’s shippable code. In other words, for a software developer, code coverage assesses the efficacy of a test suite (a collection of tests).
Typically, this metric is computed and reported as a single number, specifically a percentage — the percentage of the entire project’s shippable code that the project’s test suite executes (or covers).
The pass rates of the project’s tests (the number of tests that passed divided by the total number of tests) are not important to code coverage. As long as a test suite is able to cover every executable line in a project, even as each test in the suite fails, the suite is considered to be “effective”.
Indeed, a code coverage of 100%, with a pass-rate of 0% might just mean that the project’s test suite is good at executing all of its lines of code, and in the process has revealed project’s own failures (or bugs) that need fixes. This would be an example of the ideal test suite — can’t say much for the project though.
In such a scheme it is easy to think of code coverage as a linear metric:
100% → perfect coverage → perfect bug discovering efficacy
50% → imperfect coverage → imperfect bug discovering efficacy
0% → no coverage → no way to discover bugs
But it is easy to game code coverage, such that you can get a test suite to achieve 100% coverage, while not uncovering any actual bugs in the code. And hence the air-quotes around “effective”.
Let’s dig through this a little more. Consider a project, whose test suite is 100%. How does such a test suite go about executing the project’s code?
Does a single test in the test suite execute every executable line of code?
Does every test execute precisely one method/function in the project?
Do all tests execute every method/function in the project?
Is there some distribution of work involved in how the tests collectively achieve 100% coverage?
Do tests execute every line of code exactly once, just barely eking out the 100% coverage?
Do tests execute the same lines of code repeatedly, in different configurations, or through code paths?
Do the tests execute a single code path through all executable lines of code?
Do tests execute every line of code through every conceivable code path imaginable?
Do the tests execute code that is run routinely by the project’s end-users?
Do the tests execute code that offer the highest (business?) value to its stakeholders?
I can pick at every single bullet point above and describe a way to generate a test suite such that it achieves 100% or near 100% code coverage, while being very ineffective as exposing bugs in the software.
Nuance: it matters a great deal ⚛️🤝
Two men face execution. One says, “You fool! As if it matters how a man falls down.”
His companion replies: “When the fall’s all that’s left, it matters a great deal.”
It matters a great deal how a test suite and its individual tests execute a project’s code. A single number or percentage value cannot capture the nuance and complexities involved.
Framing code coverage only as a “bottom line” percentage value — as done throughout its use — misses the test suite’s complexities. Thus birthing meaningless flame wars around any discussion re:code coverage.
Code coverage is not /just/ a metric. Instead, code coverage presents a systematic formulation, a template, when thinking about test runs and how code is executed. Consider the elements involved:
A project consists of:
executable product code that is deployed to servers or shipped to end users as applications;
a test suite that executes the product code to test the code’s expected behavior.
The test suite itself contains tens, hundreds, and often thousands of individual tests or test cases;
each such test executed different parts of the product code.
Product code in turn can be decomposed into:
files and/or classes; which contain …
methods and/or functions; which contain …
individual lines of code.
Lines of code can execute in sequentially or in loops, or may run only under certain conditions.
Code lines within methods, or functions, execute other methods and functions causing dependencies between methods.
Dependent methods can often reside within a single file, or class; or may be scattered across multiple classes and files.
Clearly there is a lot going on, in any program. But the most important bit in the list above is that tests execute different parts of the product code, specifically its different methods. And so, we can think about tabulating those tests and methods in a classic spreadsheet or matrix:
tests as rows
methods as columns
and for each intersecting cell in the matrix, where a test executes a method, we can mark with a cross or a check mark, and leave the rest empty.
From any standard course on software testing, we know such a matrix as a test execution matrix, or a test coverage matrix, or a test matrix.
Per-Test Coverage: Code Coverage in a Matrix 🀟
A test matrix provides a detailed overview of how each test executes a portion of methods (or lines of code) within the product code. The figure below is illustrates an example test matrix.
Count the black beans in the figure above, and we see that a test suite composed of Tests 0—9, with 100% code coverage. We also see it is not so simple. For instance, Test 0 is executing Methods A, E an H. Similarly, Tests 0, 3 and 9 execute Method A. Interestingly, Test 0 executes Method A one configuration (i.e., alongside Methods E & H), while Test 9 executes Method A differently: with Methods D & E.
The Test coverage of both tests (0 and 9) is the same — they both execute 3 methods. But Tests 0 and 9 clearly test different methods, even as they both execute Method A. If we turned off Test 0 or 9, the code coverage from the remaining test suite will still be 100%. But we miss testing specific code paths when we turn off Test 0, or Test 9. We need them — both.
This kind of code-coverage per-test, offers insights into a project’s test execution in a powerful, visual way that percentages cannot. Take a look at the test matrices of real world projects and the variety they have to offer.
Maven, Jsoup, CommonsIO
Software Testing textbooks are littered with examples of test matrices. Last year, I collaborated with my graduate school advisor and labmate to extend those text book depictions into fully interactive, matrix visualizations of test suites and their executions. We eventually published our workat VISSOFT 2021. We visualized the test suites of different Java projects. And what we found with those early visualizations was fascinating. Each project showed a different characteristic.
Some were sparse from the relations between tests and methods — suggesting tests were very focused in how they were executing tests. Most were full of test-to-methods relations — suggesting that most tests were executing every method within that project. I expected this to a certain extent, but the variety was intriguing.
Consider how the text matrix for Maven looks:
Look at the long horizontal and vertical lines suggesting near-system level tests within Maven. At the same time, consider the more diagonal waves of purple and yellow dots forming at the bottom-right and top-left corners of Maven’s test matrix. That shows a more 1-to-1 relation between some tests and the methods they execute — almost as if they are unit tests. So, it would seem that Maven has a good mix of unit- and system-styled tests.
Now consider a more radical example, Jsoup:
Things are a bit more … ‘congested’. The entire matrix is filled almost exclusively with vertical and horizontal lines — suggesting that most Jsoup tests seem to execute the same chunks of Jsoup’s product code, repeatedly. It is as if every test is like a system test.
Now, let’s look at the other extreme, Commons-IO:
Things are much more … ‘peaceful’. Notice the diagonal clusters, suggesting that tests take on a more focused approach: testing one thing at a time, almost like unit-tests.
Maven, Jsoup, Commons-IO: three different projects, with high levels of code coverage. But their test coverage matrices take 3 distinct forms. Each form suggests a different set of capabilities for the respective test suites, and how the developers of these projects are thinking about testing their code.
Stating that just because these projects have similar test coverage levels, their test suites must have the same diagnostic capabilities, is unfounded.
What we do not know
We can see that test matrices of different projects can take up different forms. But we do not yet know how a test suite’s diagnostic or bug-capturing capabilities changes with such forms. A lot of this depends on many factors:
how the project is being used in the wild by its intended users (user feedback plays a major role in how resourcing is diverted in software engineering);
the actual bugs reported (consider that not all bugs in code get reported, and often tests are geared towards issues that do get reported or that developers anticipate);
the nature of the projects themselves (writing tests for a build orchestration tool like Maven, is very different from writing tests for a 3rd party library like Commons-IO).
But even before correlating with such complex factors, the next immediate chapter of this line of investigation needs to explore more projects and their tests to see if other distinct patterns start to emerge.
It is an exciting time to be back into research again!