Test Smarter, Not Harder
By: Scott Sehlhorst
Introduction: Complexity Leads to Futility
Imagine we are developing a web page for customizing a laptop purchase.
If you’ve never configured a laptop online before, take a look at Dell’s “customize it”
page for an entry level laptop. The web page presents eleven questions to the user that have
from two to seven responses each. The user has to choose from two options in the first
control, two in the second, and so on. The user has seven possible choices for the last
When we look at all of the controls combined, the user has to make
(2,2,2,2,2,3,2,2,3,4,7) choices. This is a simple configuration problem. The number of
possible laptop configurations that could be requested by the user is the product of all of
the choices. In this very simple page, there are 32,256 possibilities. At the time of this
writing, the page for customizing Dell’s high-end laptop has a not dissimilar set of controls,
with more choices in each control: (3,3,3,2,4,2,4,2,2,3,7,4,4). The user of this page can
request any of 2,322,432 different laptop configurations! If Dell were to add one more
control presenting five different choices, there would be over ten million possible
Creating a test suite that tries all two million combinations for a high end laptop
could be automated, but even if every test took one tenth of second to run, the suite would
take over 64 hours! Dell changes their product offerings in less time than that.
Then again, if we use a server farm to distribute the test suite across ten machines we
could run it in about 6 hours. Ignoring the fact that we would be running this type of test
for each customization page Dell has, 6 hours is not unreasonable.
Validating the two million results is where the really big problem is waiting for us.
We can’t rely on people to manually validate all of the outputs–it is just too expensive. We
could write another program, which inspects those outputs and evaluates them using a
rules-based system (“If the user selects 1GB of RAM, then the configuration must include
1GB of RAM” and “The price for the final system must be adjusted by the price-impact of
1GB of RAM relative to the base system price for this model.”)
There are some good rules-based validation tools out there, but they are either
custom software, or so general as to require a large investment to make them applicable to
a particular customer. With a rules-based inspection system, we have the cost of
maintaining the rules. The validation rules are going to have to be updated regularly, as
Dell changes the way they position, configure, and price their laptops.
Since we aren’t Dell, we don’t have the scale (billions of dollars of revenue) to justify
this level of investment. The bottom line for us is that we can’t afford to exhaustively test
every combination. Dell’s shareholders require them to grow their business, and these
configuration pages are the vehicle by which Dell generates billions of dollars in revenue.
They have to test it. The cost of errors (crashes, lost sales, mis-priced items, invalid
combinations of features) is too high. With this level of risk, the cost of not testing (the cost
of poor quality) is extremely high.
We Can’t Afford to Test It
I was able to attend a training session with Kent Beck a few years ago. I was also honored
to be able to enjoy a large steak and some cold beer with him that night after the training.
When asked how he responds to people who complain about the cost of quality, Kent told
us he has a very simple answer: “If testing costs more than not testing then don’t do it.”
I agree. There are few situations where the cost of quality1 exceeds the cost of poor
quality. These are situations where the needed infrastructure, test-development time, and
maintenance costs outweigh the expected cost of having a bug. (The “expected cost”2 is the
likelihood (as a percentage) of the bug manifesting in the field, multiplied by the cost of
dealing with the bug.)
The techniques described in this article are designed to reduce the cost of quality, to
make it even less likely that “not testing” is the best answer.
Just Test Everything, It’s Automated!
Two “solutions” that we have to consider are to test nothing and to test everything. We
would consider testing nothing if we can’t afford to test the software. When people don’t
appreciate the complexities of testing or the limitations of automated testing, they are
inclined to want to “test everything.” Testing everything is much easier said than done.
Have you ever been on a project where the manager said something like, ˇ§I demand
full testing coverage of the software. Our policy is zero tolerance. We wonˇ¦t have bad
quality on my watch.ˇ¨?
What we struggle with here is the lack of appreciation for what it means to have ˇ§full
coverageˇ¨ or any other guarantee of a particular defect rate.
There are no absolutes in a sufficiently complex systemˇVbut thatˇ¦s ok. There are
statistics, confidence levels, and risk-management plans. As engineers and software
developers, our brains are wired to deal with the expected, likely, and probable futures. We
have to help our less-technical brethren understand these conceptsˇVor at least put them in
We may get asked, ˇ§Why canˇ¦t we just test every combination of inputs to make sure
we get the right outputs? We have an automated test suiteˇVjust fill it up and run it!ˇ¨
We need to resist the urge to respond by saying, ˇ§Monkeys with typewriters will have
completed the works of Shakespeare before we finish a single run of our test suite!ˇ¨
Solving the Problem
There are a lot of applications that have millions or billons of combinations of inputs. They
have automated testing. They have solutions to this problem. We just finished discussing
how impractical it is to test exhaustively, so how do companies test their complex
In the rest of the article, we will explore the following approaches to solving the problem.
„h Random sampling
„h Pairwise testing
„h N-wise testing
We will also explore the impact that changing the order of operations has on our testing
approach, and the methods for testing when the sequence matters.
Early on in the software testing world, someone realized that by randomly checking
different combinations of inputs, they would eventually find the bugs. Imagine software
that has one million possible combinations of inputs (half as complex as our previous
example). Each random sample would give us 0.000001% coverage of all possible user
sessions. If we run 1,000 tests, we would still only have 0.001% coverage of the application.
Thankfully, statistics can help us make statements about our quality levels. But we
can’t use “coverage” as our key measurement of quality. We have to think about things a
little bit differently. What we want to do is express a level of confidence about a level of
quality. We need to determine the sample size, or number of tests, that we need to run to
make a statistical statement about the quality of the application.
First we define a quality goal–we want to assure that our software is 99% bug free.
That means that up to 1% of the user sessions would exhibit a bug. To be 100% confident
that this statement is true, we would need to test at least 99% of the possible user sessions,
or over 990,000 tests.
By adding a level of confidence to our analysis, we can use sampling (selecting a
subset of the whole, and extrapolating those results as being characteristic of the whole) to
describe the quality of our software. We will leverage the mathematical work that has been
developed to determine how to run polls.
We define our goal to be that we have 99% confidence that the software is 99% bug
free. The 99% level of confidence means that if we ran our sample repeatedly, 99% of the
time, the results would be within the margin of error. Since our goal is 99% bug free code,
we will test for 100% passing of tests, with a 1% margin of error.
How many samples do we need, if there are one million combinations, to identify the
level of quality with a 99% confidence, and a 1% margin of error? The math for this is
readily available, and calculators for determining sample size are online and free. Using
this polling approach, we find that the number of samples we require to determine the
quality level with a 1% error and 99% confidence is 16,369.
If we test 16,369 user sessions and find 100% success, we have established a 99%
confidence that our quality is at least at a 99% level. We only have 99% quality, because we
have found 100% quality in our tests, with a 1% margin of error.
This approach scales for very large numbers of combinations. Consider the following
table, where our goal is to establish 99% confidence in a 99% quality level. Each row in the
following table represents an increasingly complex software application. Complexity is
defined as the number of unique combinations of possible inputs).
... to read more articles, visit http://sqa.fyicenter.com/art/