Statistical significance
Figure 15.6 Normal Distribution
Statistical Significance
Mathematically calculating statistical significance, or reliability, based on sample size is
a task that is too arduous and complex for most commercially driven software-
development projects. Fortunately, there is a commonsense approach that is both efficient
and accurate enough to identify the most significant concerns related to statistical
significance. Unless you have a good reason to use a mathematically rigorous calculation
for statistical significance, a commonsense approximation is generally sufficient. In
support of the commonsense approach described below, consider this excerpt from a
StatSoft, Inc. (http://www.statsoftinc.com) discussion on the topic:
There is no way to avoid arbitrariness in the final decision as to what level of significance
will be treated as really `significant.' That is, the selection of some level of significance,
up to which the results will be rejected as invalid, is arbitrary.
Typically, it is fairly easy to add iterations to performance tests to increase the total
number of measurements collected; the best way to ensure statistical significance is
simply to collect additional data if there is any doubt about whether or not the collected
data represents reality. Whenever possible, ensure that you obtain a sample size of at least
100 measurements from at least two independent tests.
Although there is no strict rule about how to decide which results are statistically similar
without complex equations that call for huge volumes of data that commercially driven
software projects rarely have the time or resources to collect, the following is a
reasonable approach to apply if there is doubt about the significance or reliability of data
after evaluating two test executions where the data was expected to be similar. Compare
results from at least five test executions and apply the rules of thumb below to determine
whether or not test results are similar enough to be considered reliable:
1.
If more than 20 percent (or one out of five) of the test-execution results appear not to
be similar to the others, something is generally wrong with the test environment, the
application, or the test itself.
2.
If a 90
th
percentile value for any test execution is greater than the maximum or less
than the minimum value for any of the other test executions, that data set is probably
not statistically similar.
3.
If measurements from a test are noticeably higher or lower, when charted side-by-
side, than the results of the other test executions, it is probably not statistically
similar.