Adventures with Testing BI/DW Application
By: Raj Kamal/Nakul Madan
The major objective of this paper is to offer guidelines in an attempt to document the generalized test process that can be followed across the business interface and data warehouse domain. This paper excludes the automation and performance aspects of ETL testing.
Test Objective is to enable customers to make intelligent decisions based on accurate and timely analysis of data.
Test Focus should be on verification and validation of business transformations applied on the data that helps customer in accurate and timely decision support E.g. the items with the most sales in a particular area within the last two years
Consolidation & frequent retrieval of data (OLAP) takes precedence over frequent storage/rare retrieval (OLTPs)
Emphasis here is mostly on consolidating & modelling data from various disparate data sources into OLAP form to support faster retrieval of data in contrast with frequent storage and rare retrieval of the data in OLTP systems.
Freshness and accuracy of the data is the key to success
Timely availability of the accurate and recent data is extremely critical for BI/DW applications to make accurate decisions on time. The Service Level Agreement (SLA) for the availability of latest and historic data has to be met, in spite of the fact that the volume of data and the size of the warehouse remain unpredictable to a great extent due to its dynamic nature.
Need to maintain history of data; Space required is huge
Data warehouses typically maintain history of the data and hence the storage requirement is humongous as compared to transactional systems which primarily focus on recent data of immediate relevance only.
Performance of Retrieval is important: De-normalization preferred over Normalization
Typically de-normalized with fewer tables; use of star and/or snowflake schema as compared to OLTP systems which follow famous Coddís data normalization approach The data in the warehouse are often stored multiple times - in their most granular form, this is done to gain the performance of data retrieval procedure.
Importance of Data Security
PII (Personal Identifiable Information) and other sensitive information are of HBI (High Business Impact) to customers. Maintaining the confidentiality of the PII fields such as Customer name, customer account details, contact details etc. are amongst top priority for any DW application. Data has to be closely analyzed and programs designed to protect PII data and expose only the required information.
Data volume and complexity grows over time
In this global economy, mergers, expansions and acquisitions have become quite common. Over a period of time, multiple sources get added and a single repository is constructed to consolidate all the data together at one place. Eventually as the data grows the complexity increases exponentially in terms of understanding syntax and semantics of the data. Also, the complex transformations logic to tackle this problem may further impact user query performance.
Upstream changes often leads to failure
Any changes made to the design of upstream data sources directly impact the integration process, which further results in modification of the existing schema and/or transformation logic. This eventually leads to not to be able to meet the SLA on time. Another constraint lies in the availability of data sources due to any unplanned outage.
Upstream Data Quality Issues
Lot many times, the quality of upstream data to be acquired itself is in question. It has been noticed that primary keys are not quite as unique as expected; also the duplicate data or malformed data do exists in source systems.
Data Retention (Archival & Purge) Policy increase maintenance and storage cost
Data Archival and Purging policy is arrived based on the business needs and if the data is required for longer duration then the cost of maintaining this data gradually increases with time. Data Partition techniques need to be applied in order to ensure that performance doesnít degrade over a period of time.
Data Freshness required can be quite costly for NRTR (Near Real-Time Reports)
For many time-critical applications running in sectors like stock exchanges, banking etc. itís important that the operational information presented in the trending reports, scorecards and dashboards presents the latest information from the transactional sources over the globe so that accurate decisions can be made in timely fashion. This requires very frequent processing of the source data which can be very cumbersome and cost intensive
... to read more articles, visit http://sqa.fyicenter.com/art/