Test Smarter, Not Harder

Software QA FYI - SQAFYI

By: Scott Sehlhorst

Introduction: Complexity Leads to Futility

Imagine we are developing a web page for customizing a laptop purchase.

If you’ve never configured a laptop online before, take a look at Dell’s “customize it” page for an entry level laptop. The web page presents eleven questions to the user that have from two to seven responses each. The user has to choose from two options in the first control, two in the second, and so on. The user has seven possible choices for the last control.

When we look at all of the controls combined, the user has to make (2,2,2,2,2,3,2,2,3,4,7) choices. This is a simple configuration problem. The number of possible laptop configurations that could be requested by the user is the product of all of the choices. In this very simple page, there are 32,256 possibilities. At the time of this writing, the page for customizing Dell’s high-end laptop has a not dissimilar set of controls, with more choices in each control: (3,3,3,2,4,2,4,2,2,3,7,4,4). The user of this page can request any of 2,322,432 different laptop configurations! If Dell were to add one more control presenting five different choices, there would be over ten million possible combinations!

Creating a test suite that tries all two million combinations for a high end laptop could be automated, but even if every test took one tenth of second to run, the suite would take over 64 hours! Dell changes their product offerings in less time than that. Then again, if we use a server farm to distribute the test suite across ten machines we could run it in about 6 hours. Ignoring the fact that we would be running this type of test for each customization page Dell has, 6 hours is not unreasonable.

Validating the two million results is where the really big problem is waiting for us. We can’t rely on people to manually validate all of the outputs–it is just too expensive. We could write another program, which inspects those outputs and evaluates them using a rules-based system (“If the user selects 1GB of RAM, then the configuration must include 1GB of RAM” and “The price for the final system must be adjusted by the price-impact of 1GB of RAM relative to the base system price for this model.”)

There are some good rules-based validation tools out there, but they are either custom software, or so general as to require a large investment to make them applicable to a particular customer. With a rules-based inspection system, we have the cost of maintaining the rules. The validation rules are going to have to be updated regularly, as Dell changes the way they position, configure, and price their laptops.

Since we aren’t Dell, we don’t have the scale (billions of dollars of revenue) to justify this level of investment. The bottom line for us is that we can’t afford to exhaustively test every combination. Dell’s shareholders require them to grow their business, and these configuration pages are the vehicle by which Dell generates billions of dollars in revenue. They have to test it. The cost of errors (crashes, lost sales, mis-priced items, invalid combinations of features) is too high. With this level of risk, the cost of not testing (the cost of poor quality) is extremely high.

We Can’t Afford to Test It
I was able to attend a training session with Kent Beck a few years ago. I was also honored to be able to enjoy a large steak and some cold beer with him that night after the training. When asked how he responds to people who complain about the cost of quality, Kent told us he has a very simple answer: “If testing costs more than not testing then don’t do it.” I agree. There are few situations where the cost of quality1 exceeds the cost of poor quality. These are situations where the needed infrastructure, test-development time, and maintenance costs outweigh the expected cost of having a bug. (The “expected cost”2 is the likelihood (as a percentage) of the bug manifesting in the field, multiplied by the cost of dealing with the bug.)

The techniques described in this article are designed to reduce the cost of quality, to make it even less likely that “not testing” is the best answer.

Just Test Everything, It’s Automated!
Two “solutions” that we have to consider are to test nothing and to test everything. We would consider testing nothing if we can’t afford to test the software. When people don’t appreciate the complexities of testing or the limitations of automated testing, they are inclined to want to “test everything.” Testing everything is much easier said than done.

Have you ever been on a project where the manager said something like, ¡§I demand full testing coverage of the software. Our policy is zero tolerance. We won¡¦t have bad quality on my watch.¡¨?

What we struggle with here is the lack of appreciation for what it means to have ¡§full coverage¡¨ or any other guarantee of a particular defect rate.

There are no absolutes in a sufficiently complex system¡Vbut that¡¦s ok. There are statistics, confidence levels, and risk-management plans. As engineers and software developers, our brains are wired to deal with the expected, likely, and probable futures. We have to help our less-technical brethren understand these concepts¡Vor at least put them in perspective.

We may get asked, ¡§Why can¡¦t we just test every combination of inputs to make sure we get the right outputs? We have an automated test suite¡Vjust fill it up and run it!¡¨ We need to resist the urge to respond by saying, ¡§Monkeys with typewriters will have completed the works of Shakespeare before we finish a single run of our test suite!¡¨

Solving the Problem
There are a lot of applications that have millions or billons of combinations of inputs. They have automated testing. They have solutions to this problem. We just finished discussing how impractical it is to test exhaustively, so how do companies test their complex software?
In the rest of the article, we will explore the following approaches to solving the problem.
„h Random sampling
„h Pairwise testing
„h N-wise testing

We will also explore the impact that changing the order of operations has on our testing approach, and the methods for testing when the sequence matters.

RANDOM SAMPLING
Early on in the software testing world, someone realized that by randomly checking different combinations of inputs, they would eventually find the bugs. Imagine software that has one million possible combinations of inputs (half as complex as our previous example). Each random sample would give us 0.000001% coverage of all possible user sessions. If we run 1,000 tests, we would still only have 0.001% coverage of the application.

Thankfully, statistics can help us make statements about our quality levels. But we can’t use “coverage” as our key measurement of quality. We have to think about things a little bit differently. What we want to do is express a level of confidence about a level of quality. We need to determine the sample size, or number of tests, that we need to run to make a statistical statement about the quality of the application.

First we define a quality goal–we want to assure that our software is 99% bug free. That means that up to 1% of the user sessions would exhibit a bug. To be 100% confident that this statement is true, we would need to test at least 99% of the possible user sessions, or over 990,000 tests.

By adding a level of confidence to our analysis, we can use sampling (selecting a subset of the whole, and extrapolating those results as being characteristic of the whole) to describe the quality of our software. We will leverage the mathematical work that has been developed to determine how to run polls.

We define our goal to be that we have 99% confidence that the software is 99% bug free. The 99% level of confidence means that if we ran our sample repeatedly, 99% of the time, the results would be within the margin of error. Since our goal is 99% bug free code, we will test for 100% passing of tests, with a 1% margin of error.

How many samples do we need, if there are one million combinations, to identify the level of quality with a 99% confidence, and a 1% margin of error? The math for this is readily available, and calculators for determining sample size are online and free. Using this polling approach, we find that the number of samples we require to determine the quality level with a 1% error and 99% confidence is 16,369.

If we test 16,369 user sessions and find 100% success, we have established a 99% confidence that our quality is at least at a 99% level. We only have 99% quality, because we have found 100% quality in our tests, with a 1% margin of error.

This approach scales for very large numbers of combinations. Consider the following table, where our goal is to establish 99% confidence in a 99% quality level. Each row in the following table represents an increasingly complex software application. Complexity is defined as the number of unique combinations of possible inputs).

Full article...

Other Resource

... to read more articles, visit http://sqa.fyicenter.com/art/

Test Smarter, Not Harder