Automated Test Generation And Verified Software ?
By: John Rushby
Abstract. Testing remains the principal means of verification in many
certification regimes. Formal methods of verification will coexist with
testing and should be developed in ways that improve, supplement, and
exploit the value of testing. I describe automated test generation, which
uses technology from formal methods to mechanize the construction of
test cases, and discuss some of the research challenges in this area.
By testing I mean observation of a program in execution under controlled conditions.
Observations are compared against an explicit or informal oracle to detect
bugs or confirm correctness. Much of the testing process is automated in modern
development environments, but construction of test cases (i.e., the specific
experiments to be performed) remains a largely manual process.
Testing is the method by which most software is verified today. This is true
for safety critical software as well as the commodity variety: the highest level of
flight critical software (DO-178B Level A) is required to be tested to a structural
code coverage criterion known as MC/DC (Modified Condition/Decision Coverage)
. And although formal methods of analysis and verification are becoming
sanctioned, even desired, by some certification regimes, testing continues to be
required also—because it can expose different kinds of problems (e.g., compiler
bugs), can examine the program in its system context, and increases the diversity
of evidence available.
The weakness of testing is well-known to the formal methods and verification
communities—it can only show the presence of bugs—but those communities are
now beginning to recognize its strength: it can show the presence of bugs—often,
very effectively. It is a great advantage in verification if the software to be verified
is actually correct, so inexpensive methods for revealing incorrectness early in
the development and verification process are necessary for verified software to
be economically viable.
Thus, testing is not a rival to formal methods of verification, but a valuable
and complementary adjunct. It is worthwhile to study how each can support the
other, both in the technology that they employ, and in their contribution to the
overall goal of cost-effective verification.
In this regard, the most significant recent development in testing has been
the application of technologies from verification (notably, model-checking, SAT
solving, and constraint satisfaction) to automate the generation of test cases.
Automated test generation poses urgent opportunities and challenges: there are
many technical challenges in achieving effective automation, there is a wealth of
opportunity in the different ways that automated testing can be used, and there
are serious implications for traditional certification regimes, and opportunities
for innovative ones; there are also opportunities for theoretical research in the
relationship between testing and verification, and for empirical inquiry into their
In this position paper, I briefly survey the topics mentioned above, and suggest
research directions for the development and use of automated test generation
2 Technology for Automated Test Generation
Much of the process of test execution and monitoring is automated in modern
software development practice. But the generation of test cases has remained
a labor-intensive manual task. Methods are now becoming available that can
automate this process.
A simple test-generation goal is to find an input that will drive execution of
a (deterministic, loop-free) program along a particular path in its control flow
graph. By performing symbolic execution along the desired path and conjoining
the predicates that guard its branch points, we can calculate the condition that
the desired test input must satisfy. Then, by constraint satisfaction, we can find
a specific input that provides the desired test case. This method generalizes to
find tests for other structural coverage criteria, and for programs with loops,
and for those that are reactive systems (i.e., that take an input at each step).
A major impetus for practical application of this approach was the realization
that (for finite state systems) it can be performed by an off-the-shelf model
checker: we simply check the property “always not P,” where P is a formula
that specifies the desired structural criterion, and the counterexample produced
by the model checker is then the test case desired . Different kinds of structural
or specification-based tests can be generated by choosing suitable P.
Using a model checker to generate tests in this way can be very straightforward
in model-based development, where we have an executable specification
for the program that is in, or is easily translated to, the language of a model
checker: the tests are generated from the executable specification, which then
provides the oracle when these are applied to the generated program. There are
many pragmatic issues in the selection of explicit-state, symbolic, or bounded
model checkers for this task  and it is, of course, possible to construct special-
ized test generators that use the technology of model checking but customize it
appropriately for this application.
The test generation task becomes more challenging when tests are to be
generated directly from a low-level program description, such as C code, when
the the path required is very long (e.g., when it is necessary to exhaust a loop
counter), when the program is not finite state, and when nondeterminism is
When tests are to be generated directly from C code, or similar, it is natural
to adopt techniques from software model checking. These seldom translate the
program directly into the language of the model checker but usually first abstract
it in some way. Predicate abstraction  is the most common approach,
and discovery of suitable predicates is automated very effectively in the lazyabstraction
approach . Abstractions for test generation are not necessarily
the same as those used for verification. For the latter, the abstraction needs to
be conservative (i.e., it should have more behaviors than the concrete program),
whereas in the former case we generally desire that any test generated from the
abstraction should be feasible in the concrete program (i.e., the abstraction may
have fewer behaviors than the concrete program) . This impacts the method
for constructing the abstraction, and the choice of theorem proving or constraint
satisfaction methods employed .
When very long test sequences are needed to reach a desired test target, it is
sometimes possible to generate them using specialized model checking methods
(e.g., those based on an ATPG engine ), or by generating the test incrementally
(so that each subproblem is within reach of the model checker). Some of the most
effective current approaches for generating long test sequences use combinations
of methods. For example, random test generation rapidly produces many long
paths through the program; to reach an uncovered test target, we find a location
“nearby” (e.g., measured by Hamming distance on the state variables) that has
been reached by random testing and then use model checking or constraint
satisfaction to extend the path from that nearby location to the one desired .
An alternative approach is to reduce the size of the model that represents the
program (it is easier to find longer paths in smaller models): this can be done
by standard model checking reductions such as slicing and cone of influence
reduction, and also by the predicate abstraction techniques mentioned above.
Traditional model checking technology must be extended or adapted when
the program is not finite state. In some cases, an infinite state bounded model
checker can be used (i.e., a bounded model checker that uses a decision procedure
for satisfiability modulo theories (SMT)  rather than a Boolean SAT
solver) . In other cases, such as those where inputs to the program are complex
data structures (e.g., trees represented as linked lists), we can randomly
or exhaustively generate all inputs up to some specified size. Straightforward
approaches can be very inefficient (e.g., very few randomly generated list structures
represent a valid red-black tree) and redundant (i.e., they generate many
inputs that are structurally “isomorphic” to each other), so that it is best to
that domain .
The test generation problem changes significantly when the program under
test is nondeterministic, or when part of the testing environment is not under
the control of the tester (e.g., testing an embedded system in its operational
environment). In these cases, we cannot generate test sequences independently
of their actual execution: it is necessary to observe the behavior of the system in
response to the test generated so far and to generate the next input in a way that
advances the purpose of the test. Thus, test generation becomes a problem of
controller synthesis; methods for solving this problem can use technology similar
to model checking but can seldom use an off-the-shelf model checker .
The problem becomes yet more difficult when the test environment includes
mechanical systems: for example, testing the shift controller of an automatic
gearbox in its full system context with a (real or simulated) gearbox attached.
Here, the test generation problem is escalated to one of controller synthesis in a
hybrid system (i.e., one whose description includes differential equations). This
is a challenging problem, but a plausible approach is to replace the hybrid system
elements of the modeled environment by conservative discrete approximations,
and then use methods for test generation in nondeterministic systems . As
in the case of predicate abstraction, the notion of “conservative” that is suitable
for test generation may differ from that used in verification.
3 Selection of Test Targets
The previous section has sketched how test cases can be generated automatically;
the next problem is to determine how to make good use of this capability.
One approach uses test generation to help developers explore their emerging designs
: a designer might say “show me a run that puts control at this point
with x 0.” This approach is very well-suited to model-based design environments
(i.e., those where the design is executable), but is less so for traditional
programming. An approach that has proven useful in traditional programming
is random test generation at the unit level. In some programming environments,
each unit is automatically subjected to random testing against desired properties
if these have been specified, or generic ones (e.g., no exceptions) as it is
checked in (Haskell QuickCheck  is the progenitor of this approach). A similar
approach can be used in theorem proving environments: before attempting
to prove a putative theorem, first try to refute it by random test generation 
(in PVS, this can also be tried during an interactive proof, if the current proof
goal looks intractable). These simple approaches are highly effective in practice.
More challenging tests can be achieved by exhaustive generation of inputs up
to some bounded size . In Extreme Programming, tests take on much of the
rˆole played by specifications in more traditional development methods , and
automated, incremental test generation can support this approach .
More traditional uses of testing are for systematic debugging, and for validation
and verification. In tests developed by humans, the first of these is generally
driven by some explicit or implicit hypotheses about likely kinds of bugs, while
the others are driven by systematic “coverage” of requirements and code.
One simple fault hypothesis is that errors are often made at the boundaries of
conditions (e.g., the substitution of < for ) and some automated test generators
target these cases . Another hypothesis is that compound decisions (e.g.,
A^B_C) may be constructed incorrectly so tests should target the “meaningful
impact”  of each condition within the decision (i.e., each must be shown able
to independently affect the outcome).1 It turns out that these ideas are related:
boundary testing for x y is equivalent to rewriting the decision as x < y_x = y
and then testing for meaningful impact of the two conditions. The classes of faults
detected by popular test criteria for compound decisions have been analyzed by
Kuhn  and extended by others [24, 25].
Requirements- or specification-based testing is most easily automated when
the requirements or specification are provided in executable form—as is commonly
done in model based development. Here, we can use the methods sketched
in Section 2 to generate tests that explore portions of the specified behavior. The
usual idea is that a good set of tests should thoroughly explore the control structure
of the specification; typical criteria for such structural coverage are to reach
every control state, to take every transition between control states, and more
elaborate variants that explore the conditions within the decisions that control
selection of transitions (as in the meaningful impact criteria mentioned earlier).
Structural coverage criteria can be augmented by “test purposes”  that describe
the kind of tests we want to generate (e.g., those in which the gear input
to a gearbox shift selector changes at each step, but only to an adjacent value), or
by predicates that describe relationships that should be explored (e.g., a queue
is empty, full, or in between) . Test purposes and predicates are related to
predicate abstraction and can be used to reduce the statespace of the model, and
thereby ease the model checking task underlying the test generation. Generating
a separate test for each coverage target produces inefficient test sets that contain
many short tests and much redundancy, so recent methods attempt to construct
more efficient “tours” that visit many targets in each test [3, 27].
Requirements-based testing is more difficult when requirements are specified
as properties. One approach is to translate the properties into automata (i.e.,
synchronous observers), then target structural coverage in the automata.
4 Testing for Verification
Certification regimes for which testing is an important component generally
require evidence that the testing has been thorough. DO-178B Level A (which
applies to the highest level of flight-critical software in civil aircraft) is typical:
it requires MC/DC code coverage. The expectation is that tests are generated
by consideration of requirements and their execution is monitored to measure
coverage of the code. As the industry moves toward model-based development,
... to read more articles, visit http://sqa.fyicenter.com/art/