Challenges of Testing with Production Data
By: Rex Black
A number of RBCS clients find that obtaining good test data poses many challenges. For any large-scale system, testers usually cannot create sufficient and sufficiently diverse test data by hand; i.e., one record at a time. While data-generation tools exist and can create almost unlimited amounts of data, the data so generated often do not exhibit the same diversity and distribution of values as production data. For these reasons, many of our clients consider production data ideal for testing, particularly for systems where large sets of records have accumulated over years of use with various revisions of the systems currently in use, and systems previously in use.
However, to use production data, we must preserve privacy. Production data often contains personal data about individuals which must be handled securely. However, requiring secure data handling during testing activities imposes undesirable inefficiencies and constraints. Therefore, many organizations want to anonymize (scramble) the production data prior to using it for testing.
This anonymization process leads to the next set of challenges, though. The anonymization process must occur securely, in the sense that it is not reversible should the data fall into the wrong hands. For example, simply substituting the next digit or the next letter in sequence would be obvious to anyone—it doesn’t take long to deduce that “Kpio Cspxo” is actually “John Brown”—which makes the de-anonymization process trivial.
In addition, Kpio Cspxo and other similar nonsense scrambles make poor test data, because they are not realistic. The anonymization process must preserve the usefulness of the data for localization and functional testing, which often involves preserving its meaning and meaningfulness. For example, if the anonymization process changes “John Brown” to “Lester Camden,” we still have a male name, entirely usable for functional testing. If it changes “John Brown” to “Charlotte Dostoyevsky,” though, it has imposed a gender change on John, and if his logical record includes a gender field, we have now damaged the data.
Preserving the meaning of the data has another important implication. It must be possible to construct queries, views, and joins of these anonymized data that correspond directly to queries, views, and joins of the production data. For example, if a query for all records with the first name “John” and the last name “Brown” returned 20 records against production data, a query for all records with the first name “Lester” and the last name “Camden” must return 20 records against anonymized data. Failure to honor this corollary of the meaning and meaningfulness requirement can result in major problems when using the data for some types of functional tests, as well as any kind of performance, reliability, or load test.
... to read more articles, visit http://sqa.fyicenter.com/art/