EXPERIENCE WITH THE COST OF DIFFERENT COVERAGE GOALS FOR TESTING
By: Brian Marick
In coverage-based testing, coverage conditions are generated from the program text. For example, a
branch generates two conditions: that it be taken in the false direction and in the true direction. The proportion
of coverage conditions a test suite exercises can be used as an estimate of its quality. Some coverage
conditions are impossible to exerise, and some are more cost-effectively eliminated by static analysis.
The remainder, the feasible coverage conditions, are the subject of this paper.
What percentage of branch, loop, multi-condition, and weak mutation coverage can be expected from
thorough unit testing? Seven units from application programs were tested using an extension of traditional
black box testing. Nearly 100% feasible coverage was consistently achieved. Except for weak mutation,
the additional cost of reaching 100% feasible coverage required only a few percent of the total time. The
high cost for weak mutation was due to the time spent identifying impossible coverage conditions.
Because the incremental cost of coverage is low, it is reasonable to set a unit testing goal of 100% for
branch, loop, multi-condition, and a subset of weak mutation coverage. However, reaching that goal after
measuring coverage is less important than nearly reaching it with the initial black box test suite. A low initial
coverage signals a problem in the testing process.
One strategy for testing large systems is to test the low-level components ("units") thoroughly before combining
them. The expected benefit is that failures will be found early, when they are cheaper to diagnose
and correct, and that the cost of later integration or system testing will be reduced. One way of defining
"thoroughly" is through coverage measures. This paper addresses these questions:
(1) What types of coverage should be measured?
(2) How much coverage should be expected from black box unit testing?
(3) What should be done if it is not achieved?
This strategy is expensive; in the last section, I discuss ways of reducing its cost. A different strategy is to
test units less thoroughly, or not at all, and put more effort into integration and system testing. The results
of this study have little relevance for that strategy, though they may provide some evidence in favor of the
Note on terminology: "unit" is often defined differently in different organizations. I define a unit to be a
single routine or a small group of closely related routines, such as a main routine and several helper routines.
Units are normally less than 100 lines of code.
Coverage is measured by instrumenting a program to determine how thoroughly a test suite exercises it.
There are two broad classes of coverage measures. Path-based coverage requires the execution of particular
components of the program, such as statements, branches, or complete paths. Fault-based coverage
requires that the test suite exercise the program in a way that would reveal likely faults.
1.1.1. Path-based Coverage
Branch coverage requires every branch to be executed in both directions. For this code
if (arg > 0)
counter = 0; /* Reinitialize */
branch coverage requires that the IF’s test evaluate to both TRUE and FALSE.
Many kinds of faults may not be detected by branch coverage. Consider a test-at-the-top loop that is
intended to sum up the elements of an array:
sum = 0;
while (i > 0)
i -= 1;
sum = pointer[i]; /* Should be += */
This program will give the wrong answer for any array longer than one. This is an important class of
faults: those that are only revealed when loops are iterated more than once. Branch coverage does not
force the detection of these faults, since it merely requires that the loop test be TRUE at least once and
FALSE at least once. It does not require that the loop ever be executed more than once. Further, it doesn’t
require that the loop ever be skipped (that is, that i initially be zero or less). Loop coverage [Howden78]
[Beizer83] requires these two things.
Multi-condition coverage [Myers79] is an extension to branch coverage. In an expression like
if (A && B)
multi-condition coverage requires that A be TRUE in some test, A be FALSE in some test, B be TRUE in
some test, and B be FALSE in some test. Multi-condition coverage is stronger than branch coverage; these
A == 1, B == 1
A == 0, B == 1
satisfy branch coverage, but do not satisfy multi-condition coverage.
There are other coverage measures, such as dataflow coverage [Rapps85] [Ntafos84]. They are not measured
in this experiment, so they are not discussed here.
1.1.2. Fault-based Coverage
In weak mutation coverage [Howden82] [Hamlet77], we suppose that a program contains a particular simple
kind of fault. One such fault might be using <= instead of the correct < in an expression like this:
if (A <= B)
Given this program, a weak mutation coverage system would produce a message like
"gcc.c", line 488: operator <= might be <
This message would be produced until the program was executed over a test case such that (A £B ) has a
different value than (A
(A £B )¹(A
or, equivalently, that
Notice the similarity of this requirement to the old testing advice: "always check boundary conditions".
Suppose we execute the program and satisfy this coverage condition. In this case, the incorrect program
(the one we’re executing) will take the wrong branch. Our hope is that this incorrect program state will
persist until the output, at which point we’ll see it and say "Bug!". Of course, the effect might not persist.
However, there’s evidence [Marick90] [Offutt91] that over 90% of such faults will be detected by weak
mutation coverage. Of course, not all faults are such simple deviations from the correct program; probably
a small minority are [Marick90]. However, the hope of weak mutation testing is that a test suite thorough
enough to detect such simple faults will also detect more complicated faults; this is called the coupling
effect [DeMillo78]. The coupling effect has been shown to hold in some special cases [Offutt89], but more
experiments are needed to confirm it in general.
There are two kinds of weak mutation coverage. Operator coverage is as already described -- we require
test cases that distinguish operators from other operators. Operand coverage is similar -- for any operand,
such as a variable, we assume that it ought to have been some other one. That is, in
we require test cases that distinguish A from B (coverage condition A ¹B ), A from C, A from D, and so on.
Since there may be a very large number of possible alternate variables, there are usually a very large
number of coverage conditions to satisfy. The hope is that a relatively few test cases will satisfy most of
1.2. Feasible Coverage Conditions
All coverage techniques, not just mutation coverage, generate coverage conditions to satisfy. (A branch,
for example, generates two conditions: one that the branch must be taken true, and one that it must be
taken false.) Ideally, one would like all coverage conditions to be satisfied: 100% coverage. However, a
condition may be impossible to satisfy. For example, consider this loop header:
for(i = 0; i < 4; i++)
Two loop conditions cannot be satisfied: that the loop be taken zero times, and that the loop be taken once.
Not all impossible conditions are this obvious. Programmer "sanity checks" also generate them:
phys = logical_to_physical(log);
if (NULL_PDEV == phys)
fatal_error("Program error: no mapping.");
The programmer’s assumption is that every logical device has, at this point, a physical device. If it is
correct, the branch can never be taken true. Showing the branch is impossible means independently checking
this assumption. There’s no infallible procedure for doing that.
In some cases, eliminating a coverage condition may not be worth the cost. Consider this code:
Suppose that halt_system is known to work and that the unlikely error condition would be tremendously
difficult to generate. In this case, a convincing argument that unlikely_error_condition indeed
corresponds to the actual error condition would probably be sufficient. Static analysis - a correctness argument
based solely on the program text - is enough. But suppose the code looked like this:
Error recovery is notoriously error-prone and often fails the simplest tests. Relying on static analysis
would usually not be justified, because such analysis is often incorrect -- it makes the same flawed implicit
assumptions that the designer did.
Coverage conditions that are possible and worth testing are called feasible. Distinguishing these from the
others is potentially time-consuming and annoyingly imprecise. A common shortcut is to set goals of
around 85% for branch coverage (see, for example, [Su91]) and consider the remaining 15% infeasible by
assumption. It is better to examine each branch separately, even if against less deterministic rules.
1.3. Coverage in Context
For the sake of efficiency, testing should be done so that many coverage conditions are satisfied without the
tester having to think about them. Since a single black box test can satisfy many coverage conditions, it is
most efficient to write and run black box tests first, then add new tests to achieve coverage.
Test design is actually a two stage process, though the two are usually not described separately. In the first
stage, test conditions are created. A test condition is a requirement that at least one test case must satisfy.
Next, test cases are designed. The goal is to satisfy as many conditions with as few test cases as possible
[Myers79]. This is partly a matter of cost, since fewer test cases will (usually) require less time, but more a
matter of effectiveness: complex test cases are more "challenging" to the program and are more likely to
find faults through chance.
After black box tests are designed, written, and run, a coverage tool reports unsatisfied coverage conditions.
Those which are feasible are the test conditions used to generate new test cases, test cases that
improve the test suite. Our hope is that there will be few of them.
1.4. The Experiment
Production use of a branch coverage tool [Zang91] gave convincing evidence of the usefulness of coverage.
Another tool, GCT, was built to investigate other coverage measures. It was developed at the same
time as a variant black box testing technique. To tune the tool and technique before putting them to routine
use, they were used on arbitrarily selected code from readily available programs. This early experience
was somewhat surprising: high coverage was routinely achieved. A more thorough study, with better
record-keeping and a stricter definition of feasibility, was started. Consistently high coverage was again
seen, together with a low incremental cost for achieving 100% feasible coverage, except for weak mutation
coverage. These results suggest that 100% feasible coverage is a reasonable testing goal for unit testing.
Further, the consistency suggests that coverage can be used as a "signal" in the industrial quality control
sense [DeVor91]: as a normally constant measure which, when it varies, indicates a potential problem in
the testing process.
2. Materials and Methods
2.1. The Programs Tested
Seven units were tested. One was an entire application program. The others were routines or pairs of routines
chosen from larger programs in widespread use. They include GNU make, GNU diff, and the RCS
revision control system. All are written in C. They are representative of UNIX application programs.
The code tested ranged in size from 30 to 272 lines of code, excluding comments, blank lines, and lines
containing only braces. The mean size was 83 lines. Size can also be measured by total number of coverage
conditions to be satisfied, which ranged from 176 to 923, mean 479. On average, the different types of
coverage made up these percentages of the total:
Branch 7.5%; Loop 3.0%; Multi 2.7% Operator 16.1% ; Operand 70.7%
In the case of the application program, the manual page was used as a specification. In the other cases,
there were no specifications, so I wrote them from the code.
The specifications were written in a rigorous format as a list of preconditions (that describe allowable
inputs to the program) and postconditions (each of which describes a particular result and the conditions
under which it occurs). The format is partially derived from [Perry89]; an example is given in Appendix
2.2. The Test Procedure
The programs were tested in five stages. Each stage produced a new test suite that addressed the shortcomings of its predecessors.
2.2.1. Applying a Variant Black Box Technique
The first two test suites were developed using a variant black box testing technique. It is less a new technique
than a codification of good practice. Its first stage follows these steps:
Black 1: Test conditions are methodically generated from the form of the specification. For example, a
precondition of the form "X must be true" generates two test conditions: "X is true" and "X is false",
regardless of what X actually is. These test conditions are further refined by processing connectives like
AND and OR (using rules similar to cause-effect testing [Myers79]). For example, "X AND Y" generates
three test cases:
X is true, Y is true.
X is false, Y is true.
X is true, Y is false.
Black 2: Next, test conditions are generated from the content of the specification. Specifications contain
cliches [Rich90]. A search of a circular list is a typical cliche. Certain data types are also used in a cliched
way. For example, the UNIX pathname as a slash-separated string is implicit in many specifications.
Cliches are identified by looking at the nouns and verbs in the specification: "search", "list", "pathname".
The implementations of these cliches often contain cliched faults.1 For example:
(1) If the specification includes a search for an element of a circular list, one test condition is that the list
does not include the element. The expectation is that the search might go into an infinite loop.
(2) If a function decomposes and reconstructs UNIX pathnames, one test condition is that it be given a
pathname of the form "X//Y", because programmers often fail to remember that two slashes are
equivalent to one.
Because experienced testers know cliched faults, they use them when generating tests. However, writing
the cliches and faults down in a catalog reduces dependency on experience and memory. Such a catalog
has been written; sample entries are given in Appendix B.
Black 3: These test conditions are combined into test cases. A test case is a precise description of particular
input values and expected results.
The next stage is called broken box testing2. It exposes information that the specification hides from the
user, but that the tester needs to know. For example, a user needn’t know that a routine uses a hash table,
but a tester would want to probe hash table collision handling. There are two steps:
Broken 1: The code is scanned, looking for important operations and types that aren’t visible in the
specification. Types are often recognized because of comments about the use of a variable. (A variable’s
declaration does not contain all the information needed; an integer may be a count, a range, a percentage,
or an index, each of which produces different test conditions.) Cliched operations are often distinct blocks
of code separated by blank lines or comments. Of course, the key way you recognize a cliche is by having
seen it often before. Once found, these cliches are then treated exactly as if they had been found in the
specification. No attempt is made to find and satisfy coverage conditions. (The name indicates this: the
box is broken open enough for us to see gross features, but we don’t look at detail.)
Broken 2: These new test conditions are combined into new test cases.
In production use, a tester presented with a specification and a finished program would omit step Black3.
Test conditions would be derived from both the specification and the code, then combined together. This
would minimize the size of the test suite. For this experiment, the two test suites were kept separate, in
order to see what the contribution of looking at the code would be. This also simulates the more desirable
case where the tests are designed before the code is written. (Doing so reduces elapsed time, since tests
can be written while the code is being written. Further, the act of writing concrete tests often discovers
errors in the specification, and it’s best to discover those early.)
In this experiment, the separation is artificial. I wrote the specifications for six of the programs, laid them
aside for a month, then wrote the test cases, hoping that willful forgetfulness would make the black box
testing less informed by the implementation.
2.2.2. Applying Coverage
After the black and broken box tests were run, three coverage test suites were written. At each stage,
cumulative coverage from the previous stages was measured, and a test suite that reached 100% feasible
coverage on the next coverage goal was written and run. The first suite reached 100% feasible branch and
loop coverage, the next 100% feasible multi-condition coverage, and the last 100% feasible weak mutation
coverage. The stages are in order of increasing difficulty. There is no point in considering weak mutation
coverage while there are still unexecuted branches, since eliminating the branches will eliminate many
weak mutation conditions without the tester having to think about them.
In each stage, each coverage condition was first classified. This meant:
(1) For impossible conditions, an argument was made that the condition was indeed impossible. This
argument was not written down (which reduces the time required).
(2) Only weak mutation conditions could be considered not worthwhile. The rule (necessarily imprecise)
is given in the next section. In addition to the argument for infeasibility, an argument was made
that the code was correct as written. (This was always trivial.) The arguments were not written
(3) For feasible conditions, a test condition was written down. It described the coverage condition in
terms of the unit’s input variables.
After all test conditions were collected, they were combined into test cases in the usual way.
220.127.116.11. Feasibility Rules
Weak mutation coverage conditions were never ruled out because of the difficulty of eliminating them, but
only when a convincing argument could be made that the required tests would have very little chance of
revealing faults. That is, the argument is that the coupling effect will not hold for a condition. Here are
three typical cases:
(1) Suppose array is an input array and array=0 is the first statement in the program. If array is
initially always 0, GCT will complain that the initial and final value of the array are never different.
However, a different initial value could never have any effect on the program.
(2) In one program, fopen() always returned the same file pointer. Since the program doesn’t manipulate
the file pointer, except to pass it to fread() and fclose(), a different file pointer would not detect a
(3) A constant 0 is in a line of code executed by only one test case. In that test case, an earlier loop
leaves an index variable with the value 0. GCT complains that the constant might be replaced by the
variable. That index variable is completely unused after the loop, it has been left with other values in
other tests, and the results of the loop do not affect the execution of the statement in question. Writing
a new test that also executes the statement, but with a different value of the index variable, is
18.104.22.168. Weak mutation coverage
Weak mutation coverage tools can vary considerably in what they measure. GCT began with the singletoken
transformations described in [Offutt88] and [Appelbe??], eliminating those that are not applicable to
C. New transformations were added to handle C operators, structures, and unions. Space does not permit a
full enumeration, but the extensions are straightforward. See [Agrawal89] for another way of applying
mutation to C.
Three extensions increased the cost of weak mutation coverage:
(1) In an expression like (variable < expression ), GCT requires more than that variable ¹alternate. It
requires that variable
(2) Weak sufficiency also applies to compound operands. For example, when considering the operator
*ptr, GCT requires *ptr ¹*other_ptr , not just ptr ¹other_ptr . (Note: [Howden82] handles compound
operands this way. It’s mentioned here because it’s not an obvious extension from the transformations
given in [Offutt88], especially since it complicates the implementation somewhat.)
(3) Variable operands are required to actually vary; they cannot remain constant.
See [Agrawal89] for another way of applying mutation to C.
2.2.3. What was Measured
For each stage, the following was measured:
(1) The time spent designing tests. This included time spent finding test conditions, ruling out infeasible
coverage conditions, deciding that code not worth covering (e.g., potential weak mutation faults) was
correct, and designing test cases from test conditions. The time spent actually writing the tests was
not measured, since it is dominated by extraneous factors. (Can the program be tested from the command
line, or does support code have to be written? Are the inputs easy to provide, like integers, or
do they have to be built, like linked lists?)
(2) The number of test conditions and test cases written down.
(3) The percent of coverage, of all types, achieved. This is the percent of total coverage conditions, not
just feasible ones.
(4) The number of feasible, impossible, and not worthwhile coverage conditions.
2.2.4. An Example
LC is a 272-line C program that counts lines of code and comments in C programs. It contains 923 coverage
The manpage was used as the starting specification; 101 test conditions were generated from it. These test
conditions could be satisfied by 36 test cases. Deriving the test conditions and designing the test cases took
2.25 hours. Four faults were found.3
These coverages were achieved in black box testing:
More to see http://www.testing.com/writings/experience.pdf
... to read more articles, visit http://sqa.fyicenter.com/art/