Software QA FYI - SQAFYI

Measurement of the Extent of Testing

By: Cem Kaner

The Measurement Framework

To evaluate any proposed measure (or metric), I propose that we ask the following ten questions. For additional details, see Kaner (1999):

1. What is the purpose of this measure? Some measures are used on a confidential basis between friends. The goal is to spot trends, and perhaps to follow those up with additional investigation, coaching, or exploration of new techniques. A measure used only for this purpose can be useful and safe even if it is relatively weak and indirect. Higher standards must apply as the measurements become more public or as consequences (rewards or punishments) become attached to them.

2. What is the scope of this measure? Circumstances differ across groups, projects, managers, companies, countries. The wider the range of situations and people you want to cover with the method, the wider the range of issues that can invalidate or be impacted by the measure.

3. What attribute are we trying to measure? If you only have a fuzzy idea of what you are trying to measure, your measure will probably bear only a fuzzy relationship to whatever you had in mind.

4. What is the natural scale of the attribute? We might measure a table’s length in inches, but what units should we use for extent of testing?

5. What is the natural variability of the attribute? If you measure two supposedly identical tables, their lengths are probably slightly different. Similarly, your weight varies a little bit from day to day. What are the inherent sources of variation of "extent of testing"?

6. What instrument are we using to measure the attribute and what reading do we take from the instrument? You might measure length with a ruler and come up with a reading (a measurement) of 6 inches.

7. What is the natural scale of the instrument? Fenton & Pfleeger (1997) discuss this is detail.

8. What is the natural variability of the readings? This is normally studied in terms of "measurement error."

9. What is the relationship of the attribute to the instrument? What mechanism causes an increase in the reading as a function of an increase in the attribute? If we increase the attribute by 20%, what will show up in the next measurement? Will we see a 20% increase? Any increase?

10. What are the natural and foreseeable side effects of using this instrument? If we change our circumstances or behavior in order to improve the measured result, what impact are we going to have on the attribute? Will a 20% increase in our measurement imply a 20% improvement in the underlying attribute? Can we change our behavior in a way that optimizes the measured result but without improving the underlying attribute at all? What else will we affect when we do what we do to raise the measured result? Austin (1996) explores the dangers of measurement–the unintended side effects that result–in detail, across industries. Hoffman (2000) describes several specific side effects that he has seen during his consultations to software companies.

The phrase "directness of measurement" is often bandied about in the literature. If "directness of measurement" means anything, it must imply something about the monotonicity (preferably linearity?) of the relationship between attribute and instrument as considered in both #9 and #10. The phrases "hard measure" and "soft measure" are also bandied about a lot. Let me suggest that a measure is "hard" to the extent that we can describe and experimentally investigate a mechanism that underlies the relationship between an attribute and its measure and that accurately predicts the rate of change in one variable as a function of the other. Exploring the Question

When someone asks us for a report on the amount of testing that we’ve completed, what do they mean? The question is ambiguous. At an early point in a class he teaches, James Bach has students (experienced testers) do some testing of a simple program. At the end of the exercise, James asks them (on a scale from 0 to 100, where 100 means completely tested) how much testing they’ve done, or how much they would get done if they tested for another 8 hours. Different students give strikingly different answers based on essentially the same data. They also justify their answers quite differently. We found the same broad variation at STMR (a collection of experienced test managers and test management consultants). If someone asks you how much of the testing you (or your group) has done, and you say 30%, you might be basing your answer on any combination of the following dimensions (or on some others that I’ve missed):

* Product coverage: In principle, this compares the testing done with the total amount of testing of the product that could be done. But that can’t be right because the amount of testing possible is infinite (Kaner, 1997) and 30% of infinity is unachievable. Instead, this percentage is offered with reference to a model (which might be implicit or explicit) of what would be good enough testing. Of the population of tests needed for sufficient testing, 30% have been done. This measure is inherently dynamic. At the start of testing, you might decide that certain tests need not be done. During testing, you’ll probably discover that some of your assumptions were wrong. If someone finds bugs that you didn’t expect or you learn something new about market expectations, you might decide to add new tests. If you do, then what you used to call 30% coverage might now be better estimated as 20%.

* Agreement-based: Compare to what has been promised (or planned and committed to). Given a list of tasks that you’ve agreed to do (for example, see Kaner, 1996), what percentage have you finished? However, if you add tasks or deepen your work in some areas, your percentage "complete" will need revision.

* Project-history based: Compare this project with some others that are similar or that can bracket your expectations for this one.

* Risk-based: One form of risk-based estimation translates into agreement-based estimation. You do a risk analysis of the product, derive a set of tests (or of areas to be tested, with time boxes for the work), and then report what you’ve done against what you intended or agreed. (For example, see Amland, 1999.) A different estimator is less formal. When someone asks you how much you have gotten done, rather than saying 30%, you tell the person about the most critical areas of risk remaining in the product. Some areas are untested, others are undertested, others are awaiting difficult bug fixes. Assuming that nothing else bad comes up, you say, it will take about X weeks to deal with these, and then the product will probably be ready to release.

* Evaluation (of testing). This is another form of risk evaluation. How has the testing been? Are you finding surprises that indicate that your testing is weaker than expected? Should you reconsider your plans for the testing effort?

* Results: How much testing have you gotten done? For example, you might reports bug statistics or statistics on tests completed.

* Results and obstacles: When someone asks you how much testing you’ve actually gotten done, what is their purpose for collecting that information? Perhaps it is predictive (she wants an estimate of when the product can be shipped) or perhaps it is diagnostic or disciplinary. If the hidden question is not just "How far are you?" but is also "Why have you only gotten this far?" then your report of 30% might mean "30% of what was humanly possible to do during this period" and it might be accompanied by a list of obstacles that wasted your time or blocked some testing tasks.

* Effort: You report that you’ve tested for 400 hours or for 4 calendar weeks. Again, depending on the purpose of the question, this might be the best thing to report. Perhaps the underlying question is whether the you (or the testing group) are doing much work (or are able to focus your time on testing tasks). Or perhaps you work under a fixed budget or schedule, and you are reporting that you have consumed 30% of your resources. (If you are testing 10 projects at the same time, such measures can be particularly useful.)

Another Look at the Question

* When we ask, "What is a good measure of the extent of testing?" what are we looking for? There are several related questions to consider:

* The measurement question: What are we trying to measure (what is the attribute) and how can we measure it?

* The communication question: How should we tell management how much testing is done and how much is left?

* The control question: Can we use measures of the extent of testing to help us control the testing project? Which measure(s) will be effective?

* The side effect question. What risks are inherent in measuring the extent of testing? In what ways might the process of measuring (and of managing on the basis of the measures) have results that we didn’t intend or anticipate?

* The organizational effects question. In what ways might the use of this type of measure affect the organization?

* The requirements question: How does this type of measure serve our testing requirements? How do our requirements push us to choose measures?

These questions overlap with the measurement framework analysis but in discussions, they seem to stimulate different answers and different ideas. Therefore, for now, I’m keeping them as an independently useful list. The Rest of this Paper

The material that follows lists and organizes some of the ideas and examples that we (see the Acknowledgement, above) collected or developed over the last year. I have filtered out many suggestions but the lists that remain are still very broad. My intent is to show a range of thinking, to provide you with a collection of ideas from many sources, not (yet) to recommend that you use a particular measure or combination of them. The ordering and grouping of ideas here are for convenience. The material could be reorganized in several other ways, and probably in some better ways. Suggestions are welcome. Let me stress that I do not endorse or recommend the listed measures. I think that some of them are likely to cause more harm than good. My objective is to list ideas that reasonable people in the field have found useful, even if reasonable people disagree over their value. If you are considering using one, or using a combination of them, I suggest that you evaluate the proposed measure using the measurement framework described above. The sections below do not make a good enough connection with the software measurement literature. Several sources (for example, Zuse, 1997, and Ross Collard’s software testing course notes) provide additional measures or details. As I suggested at the start, this paper reports work in progress. The next report will provide details that this one lacks. Coverage-Based Measures

"Coverage" is sometimes interpreted in terms of a specific measure, usually statement coverage or branch coverage. You achieve 100% statement coverage if you execute every statement (such as, every line) in the program. You achieve 100% branch coverage if you execute every statement and take every branch from one statement to another. (So, if there were four ways to reach a given statement, you would try all four.) I’ll call this type of coverage (coverage based on statements, branches, perhaps also logical conditions), code coverage.

Code coverage is a tidy measure–it is easy to count, unambiguous, and easy to explain. Unfortunately, this measure carries risks (Marick, 1999). It is easy (and not uncommon) to write a set of relatively weak tests that hit all of the statements and conditions but don’t necessarily hit them very hard. Additionally, code coverage is incomplete. For examples of incompleteness:

*

Testing the lines of code that are there does not necessarily reveal the problems arising from the code that is not there. Marick (2000) summarizes data from cases in which 22% to 54% of the errors found were faults of omission. *

You can achieve 100% code coverage while missing errors that would have been found by a simple data flow analysis. (Richard Bender provides a clear and simple example of this in his excellent course on Requirements Based Testing.) *

Code coverage doesn’t address interrupts (there is an implicit branch from every statement in the program to the interrupt handler and back, but because it is implicit–wired into the processor rather than written into the program directly–it just doesn’t show up in a test of every visible line of code) or other multi-tasking issues. *

Table-driven programming is puzzling for code coverage because much of the work done is in the table entries, which are neither lines of code nor branches. *

User interface errors, device incompatibilities, and other interactions with the environment are likely to be under-considered in a test suite based on code coverage. *

Errors that take time to make themselves visible, such as wild pointers or stack corruption or memory leaks, might not yield visible failures until the same sub-path has been executed many times. Repeatedly hitting the same lines of code and the same branches doesn’t add any extent-of-testing credit to the code coverage measure.

Sometimes, the most important coverage measure has nothing to do with code coverage. For example, I worked on a product that had to print well. This was an essential benefit of the product. We selected 80 printers for detailed compatibility testing. We tracked the percentage of those 80 printers that the program could pass. This was a coverage measure, but it had nothing to do with lines of code. We might pass through exactly the same code (in our program) when testing two printers but fail with only one of them (because of problems in their driver or firmware).

You have a coverage measure if you can imagine any kind of testing that can be done, and a way to calculate what percent of that kind of testing you’ve done. Similarly, you have a coverage measure if you can calculate what percentage of testing you’ve done of some aspect of the program or its environment. As with the 80 printers, you might artificially restrict the population of tests (we decided that testing 80 printers was good enough under the circumstances) and compute the percentage of that population that you have run.

Here are some examples of coverage measures. Some of this list comes from Kaner (1995), which provides some additional context and discussion.

* Line coverage: Test every line of code (Or Statement coverage: test every statement).

* Coverage of non-dead code, code that can be reached by a user.

* Branch coverage: Test every line, and every branch on multi-branch lines.

* N-length sub-path coverage: Test every sub-path through the program of length N. For example, in a 10,000 line program, test every possible 10-line sequence of execution.

* Path coverage: Test every path through the program, from entry to exit. The number of paths is impossibly large to test. (See Myers, 1979, and Chapter 2 of Kaner, Falk, and Nguyen, 1993).

* Multicondition or predicate coverage: Force every logical operand to take every possible value. Two different conditions within the same test may result in the same branch, and so branch coverage would only require the testing of one of them. (See Myers, 1979, for multiple condition coverage, and Beizer, 1990).

* Trigger every assertion check in the program: Use impossible data if necessary.

* Loop coverage: "Detect bugs that exhibit themselves only when a loop is executed more than once." (Marick, 1995, p. 146)

* Every module, object, component, tool, subsystem, etc. This seems obvious until you realize that many programs rely on off-the-shelf components. The programming staff doesn’t have the source code to these components, so measuring line coverage is impossible. At a minimum (which is what is measured here), you need a list of all these components and test cases that exercise each one at least once.

* Fuzzy decision coverage. If the program makes heuristically-based or similarity-based decisions, and uses comparison rules or data sets that evolve over time, check every rule several times over the course of training.

* Relational coverage. "Checks whether the subsystem has been exercised in a way that tends to detect off-by-one errors" such as errors caused by using < instead of <=. (Marick, 1995, p. 147) This coverage includes:

o Every boundary on every input variable. (Boundaries are classically described in numeric terms, but any change-point in a program can be a boundary. If the program works one way on one side of the change-point and differently on the other side, what does it matter whether the change-point is a number, a state variable, an amount of disk space or available memory, or a change in a document from one typeface to another, etc.? Kaner, Falk, & Nguyen, 1993, p. 399-401.)

o Every boundary on every output variable.

o Every boundary on every variable used in intermediate calculations.

* Data coverage. At least one test case for each data item / variable / field in the program.

* Constraints among variables: Let X and Y be two variables in the program. X and Y constrain each other if the value of one restricts the values the other can take. For example, if X is a transaction date and Y is the transaction’s confirmation date, Y can’t occur before X.

* Each appearance of a variable. Suppose that you can enter a value for X on three different data entry screens, the value of X is displayed on another two screens, and it is printed in five reports. Change X at each data entry screen and check the effect everywhere else X appears.

* Every type of data sent to every object. A key characteristic of object-oriented programming is that each object can handle any type of data (integer, real, string, etc.) that you pass to it. So, pass every conceivable type of data to every object.

* Handling of every potential data conflict. For example, in an appointment calendaring program, what happens if the user tries to schedule two appointments at the same date and time?

* Handling of every error state. Put the program into the error state, check for effects on the stack, available memory, handling of keyboard input. Failure to handle user errors well is an important problem, partially because about 90% of industrial accidents are blamed on human error or risk-taking. (Dhillon, 1986, p. 153) Under the legal doctrine of forseeable misuse (this doctrine is cleanly explained in Brown, 1991), the manufacturer is liable in negligence if it fails to protect the customer from the consequences of a reasonably forseeable misuse of the product.

* Every complexity / maintainability metric against every module, object, subsystem, etc. There are many such measures. Jones (1991, p. 238-341) lists 20 of them. Beizer (1990) provides a sympathetic introduction to these measures. Glass (1992) and Grady & Caswell, (1987) provide valuable perspective. People sometimes ask whether any of these statistics are grounded in a theory of measurement or have practical value. (For example, Kaner, Falk, & Nguyen, 1993, p. 47-48; also Glass, 1992, p. 303, "Software metrics to date have not produced any software quality results which are useful in practice.") However, it is clear that, in practice, some organizations find them an effective tool for highlighting code that needs further investigation and might need redesign. (For example, see Grady & Caswell, 1987, and Grady, 1992, p. 87-90.)

* Conformity of every module, subsystem, etc. against every corporate coding standard. Several companies believe that it is useful to measure characteristics of the code, such as total lines per module, ratio of lines of comments to lines of code, frequency of occurrence of certain types of statements, etc. A module that doesn’t fall within the "normal" range might be summarily rejected (bad idea) or re-examined to see if there’s a better way to design this part of the program.

* Table-driven code. The table is a list of addresses or pointers or names of modules. In a traditional CASE statement, the program branches to one of several places depending on the value of an expression. In the table-driven equivalent, the program would branch to the place specified in, say, location 23 of the table. The table is probably in a separate data file that can vary from day to day or from installation to installation. By modifying the table, you can radically change the control flow of the program without recompiling or relinking the code. Some programs drive a great deal of their control flow this way, using several tables. Coverage measures? Some examples:

o check that every expression selects the correct table element

o check that the program correctly jumps or calls through every table element

o check that every address or pointer that is available to be loaded into these tables is valid (no jumps to impossible places in memory, or to a routine whose starting address has changed)

o check the validity of every table that is loaded at any customer site.

* Every interrupt. An interrupt is a special signal that causes the computer to stop the program in progress and branch to an interrupt handling routine. Later, the program restarts from where it was interrupted. Interrupts might be triggered by hardware events (I/O or signals from the clock that a specified interval has elapsed) or software (such as error traps). Generate every type of interrupt in every way possible to trigger that interrupt.

* Every interrupt at every task, module, object, or even every line. The interrupt handling routine might change state variables, load data, use or shut down a peripheral device, or affect memory in ways that could be visible to the rest of the program. The interrupt can happen at any time–between any two lines, or when any module is being executed. The program may fail if the interrupt is handled at a specific time. (Example: what if the program branches to handle an interrupt while it’s in the middle of writing to the disk drive?) The number of test cases here is huge, but that doesn’t mean you don’t have to think about this type of testing. This is path testing through the eyes of the processor (which asks, "What instruction do I execute next?" and doesn’t care whether the instruction comes from the mainline code or from an interrupt handler) rather than path testing through the eyes of the reader of the mainline code. Especially in programs that have global state variables, interrupts at unexpected times can lead to very odd results.

* Every anticipated or potential race. (Here as in many other areas, see Appendix 1 of Kaner, Falk, & Nguyen, 1993 for a list and discussion of several hundred types of bugs, including interrupt-related, race-condition-related, etc.) Imagine two events, A and B. Both will occur, but the program is designed under the assumption that A will always precede B. This sets up a race between A and B–if B ever precedes A, the program will probably fail. To achieve race coverage, you must identify every potential race condition and then find ways, using random data or systematic test case selection, to attempt to drive B to precede A in each case. Races can be subtle. Suppose that you can enter a value for a data item on two different data entry screens. User 1 begins to edit a record, through the first screen. In the process, the program locks the record in Table 1. User 2 opens the second screen, which calls up a record in a different table, Table 2. The program is written to automatically update the corresponding record in the Table 1 when User 2 finishes data entry. Now, suppose that User 2 finishes before User 1. Table 2 has been updated, but the attempt to synchronize Table 1 and Table 2 fails. What happens at the time of failure, or later if the corresponding records in Table 1 and 2 stay out of synch?

* Every time-slice setting. In some systems, you can control the grain of switching between tasks or processes. The size of the time quantum that you choose can make race bugs, time-outs, interrupt-related problems, and other time-related problems more or less likely. Of course, coverage is an difficult problem here because you aren’t just varying time-slice settings through every possible value. You also have to decide which tests to run under each setting. Given a planned set of test cases per setting, the coverage measure looks at the number of settings you’ve covered.

* Varied levels of background activity. In a multiprocessing system, tie up the processor with competing, irrelevant background tasks. Look for effects on races and interrupt handling. Similar to time-slices, your coverage analysis must specify:

o categories of levels of background activity (figure out something that makes sense) and

o all timing-sensitive testing opportunities (races, interrupts, etc.).

* Each processor type and speed. Which processor chips do you test under? What tests do you run under each processor? You are looking for:

o speed effects, like the ones you look for with background activity testing, and

o consequences of processors’ different memory management rules, and

o floating point operations, and

o any processor-version-dependent problems that you can learn about.

* Every opportunity for file / record / field locking.

* Every dependency on the locked (or unlocked) state of a file, record or field.

* Every opportunity for contention for devices or resources.

* Performance of every module / task / object: Test the performance of a module then retest it during the next cycle of testing. If the performance has changed significantly, you are either looking at the effect of a performance-significant redesign or at a symptom of a new bug.

* Free memory / available resources / available stack space at every line or on entry into and exit out of every module or object.

* Execute every line (branch, etc.) under the debug version of the operating system: This shows illegal or problematic calls to the operating system.

* Vary the location of every file. What happens if you install or move one of the program’s component, control, initialization or data files to a different directory or drive or to another computer on the network? * Check the release disks for the presence of every file. It’s amazing how often a file vanishes. If you ship the product on different media, check for all files on all media.

* Every embedded string in the program. Use a utility to locate embedded strings. Then find a way to make the program display each string.

* Every menu option.

* Every dialogue or other data entry screen.

Operation of every function / feature / data handling operation under:

* Every program preference setting.

* Every character set, code page setting, or country code setting.

* The presence of every memory resident utility (inits, TSRs).

* Each operating system version.

* Each distinct level of multi-user operation.

* Each network type and version.

* Each level of available RAM.

* Each type / setting of virtual memory management.

Compatibility

*

Compatibility with every previous version of the program. *

Ability to read every type of data available in every readable input file format. If a file format is subject to subtle variations (e.g. CGM) or has several sub-types (e.g. TIFF) or versions (e.g. dBASE), test each one. *

Write every type of data to every available output file format. Again, beware of subtle variations in file formats–if you’re writing a CGM file, full coverage would require you to test your program’s output’s readability by every one of the main programs that read CGM files. *

Every typeface supplied with the product. Check all characters in all sizes and styles. If your program adds typefaces to a collection of fonts that are available to several other programs, check compatibility with the other programs (nonstandard typefaces will crash some programs).

*

Every type of typeface compatible with the program. For example, you might test the program with (many different) TrueType and Postscript typefaces, and fixed-sized bitmap fonts. *

Every piece of clip art in the product. Test each with this program. Test each with other programs that should be able to read this type of art. *

Every sound / animation provided with the product. Play them all under different device (e.g. sound) drivers / devices. Check compatibility with other programs that should be able to play this clip-content. *

Every supplied (or constructible) script to drive other machines / software (e.g. macros) / BBS’s and information services (communications scripts).

*

All commands available in a supplied communications protocol. *

Recognized characteristics. For example, every speaker’s voice characteristics (for voice recognition software) or writer’s handwriting characteristics (handwriting recognition software) or every typeface (OCR software). *

Every type of keyboard and keyboard driver. *

Every type of pointing device and driver at every resolution level and ballistic setting. *

Every output feature with every sound card and associated drivers. *

Every output feature with every type of printer and associated drivers at every resolution level. *

Every output feature with every type of video card and associated drivers at every resolution level. *

Every output feature with every type of terminal and associated protocols. *

Every output feature with every type of video monitor and monitor-specific drivers at every resolution level. *

Every color shade displayed or printed to every color output device (video card / monitor / printer / etc.) and associated drivers at every resolution level. And check the conversion to grey scale or black and white. *

Every color shade readable or scannable from each type of color input device at every resolution level. *

Every possible feature interaction between video card type and resolution, pointing device type and resolution, printer type and resolution, and memory level. This may seem excessively complex, but I’ve seen crash bugs that occur only under the pairing of specific printer and video drivers at a high resolution setting. Other crashes required pairing of a specific mouse and printer driver, pairing of mouse and video driver, and a combination of mouse driver plus video driver plus ballistic setting. *

Every type of CD-ROM drive, connected to every type of port (serial / parallel / SCSI) and associated drivers. *

Every type of writable disk drive / port / associated driver. Don’t forget the fun you can have with removable drives or disks. *

Compatibility with every type of disk compression software. Check error handling for every type of disk error, such as full disk.

*

Every voltage level from analog input devices. *

Every voltage level to analog output devices. *

Every type of modem and associated drivers. *

Every FAX command (send and receive operations) for every type of FAX card under every protocol and driver. *

Every type of connection of the computer to the telephone line (direct, via PBX, etc.; digital vs. analog connection and signalling); test every phone control command under every telephone control driver. *

Tolerance of every type of telephone line noise and regional variation (including variations that are out of spec) in telephone signaling (intensity, frequency, timing, other characteristics of ring / busy / etc. tones). *

Every variation in telephone dialing plans. *

Every possible keyboard combination. Sometimes you’ll find trap doors that the programmer used as hotkeys to call up debugging tools; these hotkeys may crash a debuggerless program. Other times, you’ll discover an Easter Egg (an undocumented, probably unauthorized, and possibly embarrassing feature). The broader coverage measure is every possible keyboard combination at every error message and every data entry point. You’ll often find different bugs when checking different keys in response to different error messages.

*

Recovery from every potential type of equipment failure. Full coverage includes each type of equipment, each driver, and each error state. For example, test the program’s ability to recover from full disk errors on writable disks. Include floppies, hard drives, cartridge drives, optical drives, etc. Include the various connections to the drive, such as IDE, SCSI, MFM, parallel port, and serial connections, because these will probably involve different drivers.

Computation

*

Function equivalence. For each mathematical function, check the output against a known good implementation of the function in a different program. Complete coverage involves equivalence testing of all testable functions across all possible input values. *

Zero handling. For each mathematical function, test when every input value, intermediate variable, or output variable is zero or near-zero. Look for severe rounding errors or divide-by-zero errors. * Accuracy of every graph, across the full range of graphable values. Include values that force shifts in the scale.

Information Content

*

Accuracy of every report.

Look at the correctness of every value, the formatting of every page, and the correctness of the selection of records used in each report. *

Accuracy of every message. *

Accuracy of every screen. *

Accuracy of every word and illustration in the manual. *

Accuracy of every fact or statement in every data file provided with the product. *

Accuracy of every word and illustration in the on-line help. *

Every jump, search term, or other means of navigation through the on-line help.

Threats, Legal Risks

*

Check for every type of virus / worm that could ship with the program.

*

Every possible kind of security violation of the program, or of the system while using the program. *

Check for copyright permissions for every statement, picture, sound clip, or other creation provided with the program.

Usability tests of:

* Every feature / function of the program.

* Every part of the manual.

* Every error message.

* Every on-line help topic.

* Every graph or report provided by the program.

Localizability / localization tests:

*

Every string.

Check program’s ability to display and use this string if it is modified by changing the length, using high or low ASCII characters, different capitalization rules, etc.

* Compatibility with text handling algorithms under other languages (sorting, spell checking, hyphenating, etc.)

*

Every date, number and measure in the program.

*

Hardware and drivers, operating system versions, and memory-resident programs that are popular in other countries.

*

Every input format, import format, output format, or export format that would be commonly used in programs that are popular in other countries.

*

Cross-cultural appraisal of the meaning and propriety of every string and graphic shipped with the program.

Verifications

*

Verification of the program against every program requirement and published specification.(How well are we checking these requirements? Are we just hitting line items or getting to the essence of them?) *

Verify against every business objectives associated with the program (these may or may not be listed in the requirements).

* Verification of the program against user scenarios. Use the program to do real tasks that are challenging and well-specified. For example, create key reports, pictures, page layouts, or other documents events to match ones that have been featured by competitive programs as interesting output or applications.

*

Verification against every regulation (IRS, SEC, FDA, etc.) that applies to the data or procedures of the program.

Coverage of specific types of tests:

*

Automation coverage: Percent of code (lines, branches, paths, etc.) covered by the pool of automated tests for this product.

* Regression coverage: Percent of code that is covered by the standard suite of regression tests used with this program.

*

Scenario (or soap opera) coverage: Percent of code that is covered by the set of scenario tests developed for this program.

*

Coverage associated with the set of planned tests.

o

Percent that the planned tests cover non-dead code (code that can be reached by a customer) o

Extent to which tests of the important lines, branches, paths, conditions (etc.–important in the eyes of the tester) cover the total population of lines, branches, etc. o

Extent to which the planned tests cover the important lines, branches, paths, conditions, etc. o Extent to which the scenario tests cover the important lines, branches, paths, conditions, or other variables of interest.

Inspection coverage

1. How much code has been inspected.

2. How many of the requirements inspected/reviewed.

3. How many of the design documents inspected.

4. How many of the unit tests inspected.

5. How many of the black box tests inspected.

6. How many of the automated tests inspected.

7. How many test artifacts reviewed by developers.

User-focused testing

*

How many of the types of users have been covered or simulated.

* How many of the possible use cases have been covered.

Agreement-Based Measures

Agreement-based measures start from an agreement about what testing will and will not be done. The agreement might be recorded in a formal test plan or in a memo cosigned by different stakeholders. Or it might simply be a work list that the test group has settled on. The list might fully detailed, spelling out every test case. Or it might list more general areas of work and describe the depth of testing (and the time budget) for each one. The essence of agreement-based measures is progress against a plan.

All of these measures are proportions: amount of work done divided by the amount of work planned. We can convert these to effort reports easily enough just by reporting the amount of work done.

* Percent of test requirements turned into tests.

* Percent of specification items turned into tests.

* Percent of planned tests developed.

* Percent of tests planned that have been executed.

* Percent of tests executed that have passed.

* Percent of tests passing, but that are old (haven’t been rerun recently).

* Percent (and list) of tests not yet executed.

* Percent (list, whatever) of deliverables given to the client vs. total expected, where deliverables include bugs, scripts, reports and documentation (e.g. the matrix).

* Test hours (or number of tests executed) per priority level (compared to the proportional levels of work intended for different priorities).

* Test hours (or number of tests executed) per feature area (compared to the time intended for different areas). * Pattern of progress of testing (how much planned, how much done, how much left to do) across different test types across feature (or activity) areas.

* Timesheet report of how tester hours were spent this week, against a pre-existing set of categories of tasks (time consumption types).

* All test cases written / run.

* Compare current results / status to product release criteria.

* It’s Tuesday — time to ship.

Effort-Based Measures

These can be turned into agreement-based measures if we have an expected level of effort for comparison to the actual level completed.

* How much code must be tested.

* How many spec pages / manual pages are associated with the material to be tested.

* How many changes have been specified for the new version of an old product? How changes many are not specified but are being made?

* How different is the new test plan from the old test plan.

* Number of hours spent testing.

* Amount of overtime per tester.

* Compare the amount of overtime worked by testers and developers.

* Number of test cycles completed.

* Number of tests performed in each cycle.

* Duration of each test cycle.

* Number of tests executed on production hardware.

* Number of test sessions (or some other unit of exploratory testing effort). See Bach (2000) for a discussion of exploratory testing sessions.

* Number of functional areas in which we’re doing exploratory testing on a weekly basis.

* Average cost of finding and/or fixing a bug.

* Time spent maintaining automated tests {total, to date, per test cycle}.

* Time to add new tests. How long it takes to add manual or automated tests to your population of tests. Include the time it takes to document the test, to file it in your source control system (if you have one), to plan the test, including checking whether it is redundant with current tests, etc.

* Time spent debugging the tests.

* Time spent checking failures to see if you are reporting duplicates.

* Time spent by testers writing defect reports.

Project-History Based Measures

If you’ve run several projects, you can compare today’s project with historical results. When someone claims that the project has reached "beta", you might compare current status with the state of the other products when they reached the beta milestone. For examplIe, some groups add features until the last minute. At those companies, "all code complete" is not a criterion for alpha or beta milestones. But you might discover that previous projects were 90% code complete 10 weeks before their ship date. Many of the measures flagged as based on risk, effort, or agreement can be used as project-history based if you have measured those variables previously, under circumstances that allow for meaningful comparison, and have the data available for comparison.

* How much time spent, compared to time expected from prior projects

* Counts of defects discovered, fixed, open compared to previous project patterns. For example, number (or proportion) of defects open at alpha or beta milestone. These reports are widely used and widely abused. Significant side effects (impacts on the test planning, bug hunting, bug reporting, and bug triage processes) seem to be common, even though they are not often discussed in print. For a recent discussion that focuses on bug counts, see Hoffman (2000). For an excellent but more general discussion, see Austin (1996)..

* Arrival rates of bugs (how many bugs opened) per week, compared to rates on previous projects. I have the same concerns about the curves that show bugs per week as the single point (e.g. beta milestone) comparisons.

* How many (or percent of) modules or components coded compared to previous releases at this point in the project.

Risk Based Measures

As I am using the term, the risk measures focus on risk remaining in the project. Imagine this question: If we were to release the product today, without further testing, what problems would we anticipate? Similarly, if we were to release the product on its intended release date, what problems would we anticipate?

* Of the test cases planned using hazard or risk analysis, how many have been written, run, or passed?

* In a group that creates new tests throughout the testing project, have we reached a point in which we don’t have any more questions? This might tell us that there is a problem with the test group, or it might tell us that we are close to being finished. (There might be plenty of errors left in the product, but if we’re out of ideas, we’re not going to find those errors.)

* How many bugs have been found in third party software that you include with or run with your products?

* How many failures found per unit (per test case, per thousand pages printed, per thousand lines of code, per report, per transaction, per hour, etc.)

* Mean time (transactions, etc.) between failures?

* Length of time before system failure (or other detectable failures) in a stochastic test or a load or stress test. * Rate at which you are finding new bugs.

* Rate at which previously fixed bugs are reappearing.

* Number of complaints from beta testers.

* Number of bugs found in the first manufactured copy of the product. If you release a product to manufacturing, you should test the first few manufactured copies, before sending the rest to customers. Manufacturing errors happen. The disks (books, collaterals) that you sent out for duplication may or may not match what you get back.

* Number of bugs found in a random selection of manufactured copies. In most products, all of the manufactured copies are identical. Some products, however, vary from copy to copy. Even if the only difference across products is in the serial number, problems might show up on one copy that don’t show up on others. The more extensively customized each copy is, the more relevant is this type of testing.

* After the product is released, we obtain data from the field that tell us whether we should reopen the product, such as:

o Angry letters to the CEO.

o Published criticism of the product.

o Rate of customer complaints for technical support.

o Number of "surprise" defects (problems found by customers that were not found pre-release by testers).

* Number of "surprisingly serious" defects (deferred problems that have generated more calls or angrier calls than anticipated.)

* Perceived risk. For example, the group might provide a rough estimate of risk for each area of the product on a 5-point scale (from low risk to very high risk), in order to think through the pattern of work left to be done in a short time available.

* Desperation level of triage team. The triage team is the group that decides whether or not to fix each defect. Toward the end of the project (especially if this "end" is a long time after the initially scheduled ship date), this team might be pushed very hard to stop fixing and just ship it. The more desperate the team is to ship the product soon, the higher the risk that the product will leave in bad shape. * Number of defects resolved without being fixed.

* Frequency of tester outbursts. This measure is informal but can be extremely important. As a poor quality product marches to release, many testers or test groups feel increasing pressure to find defects that will block the shipment. Frustration mounts as perfectly "good" bugs are deferred. Indication of high stress among the testers is a signal of danger in the product or the review process.

* Level of confusion or dissent among the development team as a whole or among sub-groups. (You measure this by talking with people, asking them how they feel about the product and the project, and its projected release date.

Obstacle Reports

Obstacles are risks to the testing project (or to the development project as a whole). These are the things that make it hard to do the testing or fixing well. This is not intended as a list of everything that can go wrong on a project. Instead, it is a list of common problems that make the testing less efficient.

*

Turnover of development staff (programmers, testers, writers, etc.) *

Number of marketing VPs per release. (Less flippantly, what is the rate of turnover among executives who influence the design of the product?). *

Layoffs of testing (or other development) staff. *

Number of organizational changes over the life of the project. *

Number of people who influence product release, and level of consensus about the product among them. *

Appropriateness of the tester to programmer ratio. (Note: I’ve seen successful ratios ranging from 1:5 through 5:1. It depends on the balance of work split between the groups and the extent to which the programmers are able to get most of the code right the first time.) *

Number of testers who speak English (if you do your work in English). *

Number of testers who speak the same language as the programmers. *

How many bugs are found by isolated (such as remote or offsite) testers compared to testers who are co-located with or in good communication with the programmers? *

Number of tests blocked by defects. *

List of defects blocking tests. *

Defect fix percentage (if low). *

Slow defect fix rate compared to find rate. *

Average age of open bugs. *

Average time from initial report of defect until fix verification. *

Number of times a bug is reopened (if high). *

Number of promotions and demotions of defect priority *

Number of failed attempts to get builds to pass smoke tests. *

Number of changes to (specifications or) requirements during testing. *

Percentage of tests changed by modified requirements. *

Time lost to development issues (such as lack of specifications or features not yet coded). *

Time required to test a typical emergency fix. *

Percentage of time spent on testing emergency fixes. *

Percentage of time spent providing technical support for pre-release users (such as beta testers). *

How many billable hours per week (for a consulting firm) or the equivalent task-focused hours (for in-house work) are required of the testers and how does this influence or interfere with their work? *

Ability of test environment team (or information systems support team) to build the system to be tested as specified by the programming team. *

Time lost to environment issues (such as difficulties obtaining test equipment or configuring test systems, defects in the operating system, device drivers, file system or other 3rd party, system software).

Evaluation-of-Testing Based Measures

These help you assess the testing effort. How hard are the testers doing, how well are they doing it, what could they improve? A high number for one of these measures might be good for one group and bad for another. For example, in a company that relies on an outside test lab to design a specialized set of tests for a very technical area, we’d expect a high bug find rate from third party test cases. In a company that thinks of its testing as more self-contained, a high rate from third parties is a warning flag.

*

Number of crises involving test tools. *

Number of defects related to test environment. *

Number of bugs found by boundary or negative tests vs. feature tests. *

Number of faults found in localized versions compared to base code. *

Number of faults found in inspected vs. uninspected areas. *

Number of defects found by inspection of the testing artifacts. *

Number of defects discovered during test design (rather than in later testing). *

Number of defects found by 3rd party test cases. *

Number of defects found by 3rd party test group vs. your group. *

Number of defects found by developers. *

Number of defects found by developers after unit testing. *

Backlog indicators. For example, how many unverified bug fixes are there (perhaps as a ratio of new bug fixes submitted) or what is the mean time to verify a fix? *

How many phone calls were generated during beta. *

Number of surprise bugs (bugs you didn’t know about) reported by beta testers. (the content of these calls indicate holes in testing or, possibly, weaknesses in the risk analysis that allowed a particular bug to be deferred)..

*

After the product is released, o Number of "surprise" defects (problems found by customers that were not found pre-release by testers).

o

Number of "surprisingly serious" defects (deferred problems that have generated more calls or angrier calls than anticipated.). o

Angry letters to the CEO. (Did testers mis-estimate severity?)

o

Published criticism of the product. (Did testers mis-estimate visibility?) o

Rate of customer complaints for technical support? (Did testers mis-estimate customer impact?)

*

Number of direct vs. indirect bug finds (were you looking for that bug or did you stumble on it as a side effect of some other test?) *

Number of irreproducible bugs (perhaps as a percentage of total bugs found). *

Number of noise bugs (issues that did not reflect software errors). *

Number of duplicate bugs being reported. *

Rumors of off-the-record defects (defects discovered but not formally reported or tracked. The discoveries might be by programmers or by testers who are choosing not to enter bugs into the tracking system–a common problem in companies that pay special attention to bug counts.) *

Number of test cases that can be automated. *

Cyclomatic complexity of automated test scripts. *

Size (e.g. lines of code) of test automation code. Appraisal of the code’s maintainability and modularity. *

Existence of requirements analyses, requirements documents, specifications, test plans and other software engineering processes and artifacts generated for the software test automation effort. *

Percentage of time spent by testers writing defect reports. *

Percentage of time spent by testers searching for bugs, writing bug reports, or doing focused test planning on this project. (Sometimes, you’re getting a lot less testing on a project than you think. Your staff may be so overcommitted that they have almost no time to focus and make real progress on anything. Other times, the deallocation of resources is intentional. For example, one company has an aggressive quality control group who publish bug curves and comparisons across projects. To deal with the political hassles posed by this external group, the software development team sends the testers to the movies whenever the open bug counts get too high or the fix rates get too low.) *

Percentage of bugs found by planned vs. exploratory vs. unplanned methods, compared to the percentage that you intended. *

What test techniques were applied compared to the population of techniques that you think are applicable compared to the techniques the testers actually know. *

Comparative effectiveness: what problems were found by one method of testing or one source vs. another. *

Comparative effectiveness over time: compare the effectiveness of different test methods over the life cycle of the product. We might expect simple function tests to yield more bugs early in testing and complex scenario tests to be more effective later. *

Complexity of causes over time: is there a trend that bugs found later have more complex causes (for example, require a more complex set of conditions) than bugs found earlier? Are there particular patterns of causes that should be tested for earlier? *

Differential bug rates across predictors: defect rates tracked across different predictors. For example, we might predict that applications with high McCabe-complexity numbers would have more bugs. Or we might predict that applications that were heavily changed or that were rated by programmers as more fragile, etc., would have more bugs. *

Delta between the planned and actual test effort. *

Ability of testers to articulate the test strategy. *

Ability of the programmers and other developers to articulate the test strategy. *

Approval of the test effort by an experienced, respected tester. *

Estimate the prospective effectiveness of the planned tests. (How good do you think these are?) *

Estimate the effectiveness of the tests run to date (subjective evaluation by testers, other developers). *

Estimate confidence that the right quality product will ship on time (subjective evaluation by testers, other developers).

Results Reports

So what has the test group accomplished? Most of the bug count metrics belong here.

* Build acceptance test criteria and results. * Benchmark performance results. * Number of builds submitted for testing. * Number of builds tested. * Number of revisions (drafts) of the test plan * Number of test cases created / coded / documented. * Number of reusable test cases created. * Percent or number of reusable test artifacts in this project. * Number of times reusable artifacts are reused. * Number of defects in test cases. * List of failing (defective) tests. * Number of promotions and demotions of defect priority. * Tester to developer ratio. * Defects per 1000 lines of code. * Total bugs found / fixed / deferred / rejected. * Find and close rate by severity. * Total number of open bugs. * Number of open bugs by {feature, subsystem, build, database, supplier}. * Number of open bugs assigned to {programmer, tester, someone else, total}. * Number of open bugs by platform. * Defect find rate per week (open defects/week). * Cumulative defects each week (Total defects found/time.) * Defect find rate normalized by the amount of time spent testing. * Number of bugs found from regression testing. * Number of bugs fixed or failed from regression testing. * Number of bugs found in beta/real customer testing over time. * Percent of tests per build {passing, failing, not yet analyzed}. * Number of defects found in requirements / specifications / other development planning documents. * Number of defects found in unit testing. * Number of bugs found after a ship candidate has been declared * Number of defects found in LAN vs. modem testing. (Similarly for other configuration variations.) * Percentage of bugs found per project phase. * Number or percentage of bugs found by programmers. * Number of bugs found by programmers after unit testing. * Number of bugs found in first article (first item manufactured by manufacturing). * Proportion of unreviewed bugs that become critical after review. (Some companies don’t have a review process. Others have each bug report reviewed by (depending on the company) another tester, a programmer, or the triage team as a whole. )

Progress Reporting Examples

The following examples are based on ideas presented at LAWST and STMR, but I’ve made some changes to simplify the description, or because additional ideas came up at the meeting (or since) that seem to me to extend the report in useful ways. Occasionally, the difference between my version and the original occurs because (oops) I misunderstood the original presentation.

For more, go to : http://www.kaner.com/pnsqc.html


Other Resource

... to read more articles, visit http://sqa.fyicenter.com/art/

Measurement of the Extent of Testing