EXPERIENCE WITH THE COST OF DIFFERENT COVERAGE GOALS FOR TESTING

Software QA FYI - SQAFYI

By: Brian Marick

In coverage-based testing, coverage conditions are generated from the program text. For example, a branch generates two conditions: that it be taken in the false direction and in the true direction. The proportion of coverage conditions a test suite exercises can be used as an estimate of its quality. Some coverage conditions are impossible to exerise, and some are more cost-effectively eliminated by static analysis. The remainder, the feasible coverage conditions, are the subject of this paper.

What percentage of branch, loop, multi-condition, and weak mutation coverage can be expected from thorough unit testing? Seven units from application programs were tested using an extension of traditional black box testing. Nearly 100% feasible coverage was consistently achieved. Except for weak mutation, the additional cost of reaching 100% feasible coverage required only a few percent of the total time. The high cost for weak mutation was due to the time spent identifying impossible coverage conditions.

Because the incremental cost of coverage is low, it is reasonable to set a unit testing goal of 100% for branch, loop, multi-condition, and a subset of weak mutation coverage. However, reaching that goal after measuring coverage is less important than nearly reaching it with the initial black box test suite. A low initial coverage signals a problem in the testing process.

1. Introduction One strategy for testing large systems is to test the low-level components ("units") thoroughly before combining them. The expected benefit is that failures will be found early, when they are cheaper to diagnose and correct, and that the cost of later integration or system testing will be reduced. One way of defining "thoroughly" is through coverage measures. This paper addresses these questions:

(1) What types of coverage should be measured?

(2) How much coverage should be expected from black box unit testing?

(3) What should be done if it is not achieved?

This strategy is expensive; in the last section, I discuss ways of reducing its cost. A different strategy is to test units less thoroughly, or not at all, and put more effort into integration and system testing. The results of this study have little relevance for that strategy, though they may provide some evidence in favor of the first strategy.

Note on terminology: "unit" is often defined differently in different organizations. I define a unit to be a single routine or a small group of closely related routines, such as a main routine and several helper routines. Units are normally less than 100 lines of code.

1.1. Coverage

Coverage is measured by instrumenting a program to determine how thoroughly a test suite exercises it. There are two broad classes of coverage measures. Path-based coverage requires the execution of particular components of the program, such as statements, branches, or complete paths. Fault-based coverage requires that the test suite exercise the program in a way that would reveal likely faults.

1.1.1. Path-based Coverage

Branch coverage requires every branch to be executed in both directions. For this code

if (arg > 0) { counter = 0; /* Reinitialize */ }

branch coverage requires that the IF’s test evaluate to both TRUE and FALSE. Many kinds of faults may not be detected by branch coverage. Consider a test-at-the-top loop that is intended to sum up the elements of an array:

sum = 0; while (i > 0) { i -= 1; sum = pointer[i]; /* Should be += */ }

This program will give the wrong answer for any array longer than one. This is an important class of faults: those that are only revealed when loops are iterated more than once. Branch coverage does not force the detection of these faults, since it merely requires that the loop test be TRUE at least once and FALSE at least once. It does not require that the loop ever be executed more than once. Further, it doesn’t require that the loop ever be skipped (that is, that i initially be zero or less). Loop coverage [Howden78] [Beizer83] requires these two things.

Multi-condition coverage [Myers79] is an extension to branch coverage. In an expression like

if (A && B)

multi-condition coverage requires that A be TRUE in some test, A be FALSE in some test, B be TRUE in some test, and B be FALSE in some test. Multi-condition coverage is stronger than branch coverage; these two inputs

A == 1, B == 1

A == 0, B == 1

satisfy branch coverage, but do not satisfy multi-condition coverage. There are other coverage measures, such as dataflow coverage [Rapps85] [Ntafos84]. They are not measured in this experiment, so they are not discussed here.

1.1.2. Fault-based Coverage

In weak mutation coverage [Howden82] [Hamlet77], we suppose that a program contains a particular simple kind of fault. One such fault might be using <= instead of the correct < in an expression like this:

if (A <= B)

Given this program, a weak mutation coverage system would produce a message like "gcc.c", line 488: operator <= might be < This message would be produced until the program was executed over a test case such that (A £B ) has a different value than (A
(A £B )¹(A
or, equivalently, that

A =B

Notice the similarity of this requirement to the old testing advice: "always check boundary conditions".

Suppose we execute the program and satisfy this coverage condition. In this case, the incorrect program (the one we’re executing) will take the wrong branch. Our hope is that this incorrect program state will persist until the output, at which point we’ll see it and say "Bug!". Of course, the effect might not persist. However, there’s evidence [Marick90] [Offutt91] that over 90% of such faults will be detected by weak mutation coverage. Of course, not all faults are such simple deviations from the correct program; probably a small minority are [Marick90]. However, the hope of weak mutation testing is that a test suite thorough enough to detect such simple faults will also detect more complicated faults; this is called the coupling effect [DeMillo78]. The coupling effect has been shown to hold in some special cases [Offutt89], but more experiments are needed to confirm it in general.

There are two kinds of weak mutation coverage. Operator coverage is as already described -- we require test cases that distinguish operators from other operators. Operand coverage is similar -- for any operand, such as a variable, we assume that it ought to have been some other one. That is, in

my_function(A)

we require test cases that distinguish A from B (coverage condition A ¹B ), A from C, A from D, and so on. Since there may be a very large number of possible alternate variables, there are usually a very large number of coverage conditions to satisfy. The hope is that a relatively few test cases will satisfy most of them.

1.2. Feasible Coverage Conditions

All coverage techniques, not just mutation coverage, generate coverage conditions to satisfy. (A branch, for example, generates two conditions: one that the branch must be taken true, and one that it must be taken false.) Ideally, one would like all coverage conditions to be satisfied: 100% coverage. However, a condition may be impossible to satisfy. For example, consider this loop header:

for(i = 0; i < 4; i++)

Two loop conditions cannot be satisfied: that the loop be taken zero times, and that the loop be taken once. Not all impossible conditions are this obvious. Programmer "sanity checks" also generate them: phys = logical_to_physical(log);

if (NULL_PDEV == phys) fatal_error("Program error: no mapping.");

The programmer’s assumption is that every logical device has, at this point, a physical device. If it is correct, the branch can never be taken true. Showing the branch is impossible means independently checking this assumption. There’s no infallible procedure for doing that.

In some cases, eliminating a coverage condition may not be worth the cost. Consider this code:

if (unlikely_error_condition) halt_system();

Suppose that halt_system is known to work and that the unlikely error condition would be tremendously difficult to generate. In this case, a convincing argument that unlikely_error_condition indeed corresponds to the actual error condition would probably be sufficient. Static analysis - a correctness argument based solely on the program text - is enough. But suppose the code looked like this:

if (unlikely_error_condition) recover_and_continue();

Error recovery is notoriously error-prone and often fails the simplest tests. Relying on static analysis would usually not be justified, because such analysis is often incorrect -- it makes the same flawed implicit assumptions that the designer did.

Coverage conditions that are possible and worth testing are called feasible. Distinguishing these from the others is potentially time-consuming and annoyingly imprecise. A common shortcut is to set goals of around 85% for branch coverage (see, for example, [Su91]) and consider the remaining 15% infeasible by assumption. It is better to examine each branch separately, even if against less deterministic rules.

1.3. Coverage in Context

For the sake of efficiency, testing should be done so that many coverage conditions are satisfied without the tester having to think about them. Since a single black box test can satisfy many coverage conditions, it is most efficient to write and run black box tests first, then add new tests to achieve coverage.

Test design is actually a two stage process, though the two are usually not described separately. In the first stage, test conditions are created. A test condition is a requirement that at least one test case must satisfy. Next, test cases are designed. The goal is to satisfy as many conditions with as few test cases as possible [Myers79]. This is partly a matter of cost, since fewer test cases will (usually) require less time, but more a matter of effectiveness: complex test cases are more "challenging" to the program and are more likely to find faults through chance.

After black box tests are designed, written, and run, a coverage tool reports unsatisfied coverage conditions. Those which are feasible are the test conditions used to generate new test cases, test cases that improve the test suite. Our hope is that there will be few of them.

1.4. The Experiment

Production use of a branch coverage tool [Zang91] gave convincing evidence of the usefulness of coverage. Another tool, GCT, was built to investigate other coverage measures. It was developed at the same time as a variant black box testing technique. To tune the tool and technique before putting them to routine use, they were used on arbitrarily selected code from readily available programs. This early experience was somewhat surprising: high coverage was routinely achieved. A more thorough study, with better record-keeping and a stricter definition of feasibility, was started. Consistently high coverage was again seen, together with a low incremental cost for achieving 100% feasible coverage, except for weak mutation coverage. These results suggest that 100% feasible coverage is a reasonable testing goal for unit testing.

Further, the consistency suggests that coverage can be used as a "signal" in the industrial quality control sense [DeVor91]: as a normally constant measure which, when it varies, indicates a potential problem in the testing process.

2. Materials and Methods

2.1. The Programs Tested

Seven units were tested. One was an entire application program. The others were routines or pairs of routines chosen from larger programs in widespread use. They include GNU make, GNU diff, and the RCS revision control system. All are written in C. They are representative of UNIX application programs.

The code tested ranged in size from 30 to 272 lines of code, excluding comments, blank lines, and lines containing only braces. The mean size was 83 lines. Size can also be measured by total number of coverage conditions to be satisfied, which ranged from 176 to 923, mean 479. On average, the different types of coverage made up these percentages of the total:

Branch 7.5%; Loop 3.0%; Multi 2.7% Operator 16.1% ; Operand 70.7%

In the case of the application program, the manual page was used as a specification. In the other cases, there were no specifications, so I wrote them from the code. The specifications were written in a rigorous format as a list of preconditions (that describe allowable inputs to the program) and postconditions (each of which describes a particular result and the conditions under which it occurs). The format is partially derived from [Perry89]; an example is given in Appendix A.

2.2. The Test Procedure

The programs were tested in five stages. Each stage produced a new test suite that addressed the shortcomings of its predecessors.

2.2.1. Applying a Variant Black Box Technique

The first two test suites were developed using a variant black box testing technique. It is less a new technique than a codification of good practice. Its first stage follows these steps:

Black 1: Test conditions are methodically generated from the form of the specification. For example, a precondition of the form "X must be true" generates two test conditions: "X is true" and "X is false", regardless of what X actually is. These test conditions are further refined by processing connectives like AND and OR (using rules similar to cause-effect testing [Myers79]). For example, "X AND Y" generates three test cases:

X is true, Y is true. X is false, Y is true. X is true, Y is false.

Black 2: Next, test conditions are generated from the content of the specification. Specifications contain cliches [Rich90]. A search of a circular list is a typical cliche. Certain data types are also used in a cliched way. For example, the UNIX pathname as a slash-separated string is implicit in many specifications. Cliches are identified by looking at the nouns and verbs in the specification: "search", "list", "pathname".

The implementations of these cliches often contain cliched faults.1 For example:

(1) If the specification includes a search for an element of a circular list, one test condition is that the list does not include the element. The expectation is that the search might go into an infinite loop.

(2) If a function decomposes and reconstructs UNIX pathnames, one test condition is that it be given a pathname of the form "X//Y", because programmers often fail to remember that two slashes are equivalent to one.

Because experienced testers know cliched faults, they use them when generating tests. However, writing the cliches and faults down in a catalog reduces dependency on experience and memory. Such a catalog has been written; sample entries are given in Appendix B.

Black 3: These test conditions are combined into test cases. A test case is a precise description of particular input values and expected results.

The next stage is called broken box testing2. It exposes information that the specification hides from the user, but that the tester needs to know. For example, a user needn’t know that a routine uses a hash table, but a tester would want to probe hash table collision handling. There are two steps:

Broken 1: The code is scanned, looking for important operations and types that aren’t visible in the specification. Types are often recognized because of comments about the use of a variable. (A variable’s declaration does not contain all the information needed; an integer may be a count, a range, a percentage, or an index, each of which produces different test conditions.) Cliched operations are often distinct blocks of code separated by blank lines or comments. Of course, the key way you recognize a cliche is by having seen it often before. Once found, these cliches are then treated exactly as if they had been found in the specification. No attempt is made to find and satisfy coverage conditions. (The name indicates this: the box is broken open enough for us to see gross features, but we don’t look at detail.)

Broken 2: These new test conditions are combined into new test cases.

In production use, a tester presented with a specification and a finished program would omit step Black3. Test conditions would be derived from both the specification and the code, then combined together. This would minimize the size of the test suite. For this experiment, the two test suites were kept separate, in order to see what the contribution of looking at the code would be. This also simulates the more desirable case where the tests are designed before the code is written. (Doing so reduces elapsed time, since tests can be written while the code is being written. Further, the act of writing concrete tests often discovers errors in the specification, and it’s best to discover those early.) In this experiment, the separation is artificial. I wrote the specifications for six of the programs, laid them aside for a month, then wrote the test cases, hoping that willful forgetfulness would make the black box testing less informed by the implementation.

2.2.2. Applying Coverage

After the black and broken box tests were run, three coverage test suites were written. At each stage, cumulative coverage from the previous stages was measured, and a test suite that reached 100% feasible coverage on the next coverage goal was written and run. The first suite reached 100% feasible branch and loop coverage, the next 100% feasible multi-condition coverage, and the last 100% feasible weak mutation coverage. The stages are in order of increasing difficulty. There is no point in considering weak mutation coverage while there are still unexecuted branches, since eliminating the branches will eliminate many weak mutation conditions without the tester having to think about them.

In each stage, each coverage condition was first classified. This meant: (1) For impossible conditions, an argument was made that the condition was indeed impossible. This argument was not written down (which reduces the time required). (2) Only weak mutation conditions could be considered not worthwhile. The rule (necessarily imprecise) is given in the next section. In addition to the argument for infeasibility, an argument was made that the code was correct as written. (This was always trivial.) The arguments were not written down.

(3) For feasible conditions, a test condition was written down. It described the coverage condition in terms of the unit’s input variables.

After all test conditions were collected, they were combined into test cases in the usual way.

2.2.2.1. Feasibility Rules

Weak mutation coverage conditions were never ruled out because of the difficulty of eliminating them, but only when a convincing argument could be made that the required tests would have very little chance of revealing faults. That is, the argument is that the coupling effect will not hold for a condition. Here are three typical cases:

(1) Suppose array is an input array and array[0]=0 is the first statement in the program. If array[0] is initially always 0, GCT will complain that the initial and final value of the array are never different. However, a different initial value could never have any effect on the program.

(2) In one program, fopen() always returned the same file pointer. Since the program doesn’t manipulate the file pointer, except to pass it to fread() and fclose(), a different file pointer would not detect a fault.

(3) A constant 0 is in a line of code executed by only one test case. In that test case, an earlier loop leaves an index variable with the value 0. GCT complains that the constant might be replaced by the variable. That index variable is completely unused after the loop, it has been left with other values in other tests, and the results of the loop do not affect the execution of the statement in question. Writing a new test that also executes the statement, but with a different value of the index variable, is probably useless.

2.2.2.2. Weak mutation coverage

Weak mutation coverage tools can vary considerably in what they measure. GCT began with the singletoken transformations described in [Offutt88] and [Appelbe??], eliminating those that are not applicable to C. New transformations were added to handle C operators, structures, and unions. Space does not permit a full enumeration, but the extensions are straightforward. See [Agrawal89] for another way of applying mutation to C.

Three extensions increased the cost of weak mutation coverage:

(1) In an expression like (variable < expression ), GCT requires more than that variable ¹alternate. It requires that variable
(2) Weak sufficiency also applies to compound operands. For example, when considering the operator *ptr, GCT requires *ptr ¹*other_ptr , not just ptr ¹other_ptr . (Note: [Howden82] handles compound operands this way. It’s mentioned here because it’s not an obvious extension from the transformations given in [Offutt88], especially since it complicates the implementation somewhat.)

(3) Variable operands are required to actually vary; they cannot remain constant. See [Agrawal89] for another way of applying mutation to C.

2.2.3. What was Measured

For each stage, the following was measured:

(1) The time spent designing tests. This included time spent finding test conditions, ruling out infeasible coverage conditions, deciding that code not worth covering (e.g., potential weak mutation faults) was correct, and designing test cases from test conditions. The time spent actually writing the tests was not measured, since it is dominated by extraneous factors. (Can the program be tested from the command line, or does support code have to be written? Are the inputs easy to provide, like integers, or do they have to be built, like linked lists?)

(2) The number of test conditions and test cases written down.

(3) The percent of coverage, of all types, achieved. This is the percent of total coverage conditions, not just feasible ones.

(4) The number of feasible, impossible, and not worthwhile coverage conditions.

2.2.4. An Example

LC is a 272-line C program that counts lines of code and comments in C programs. It contains 923 coverage conditions.

The manpage was used as the starting specification; 101 test conditions were generated from it. These test conditions could be satisfied by 36 test cases. Deriving the test conditions and designing the test cases took

2.25 hours. Four faults were found.3 These coverages were achieved in black box testing:

More to see http://www.testing.com/writings/experience.pdf

Other Resource

... to read more articles, visit http://sqa.fyicenter.com/art/

EXPERIENCE WITH THE COST OF DIFFERENT COVERAGE GOALS FOR TESTING