Testing Big Data: Three Fundamental Components
By: Alexander Panchenko
Big Data is a big topic in software development today. When it comes to practice, software testers may not yet fully understand what Big Data is exactly. What testers do know is that you need a plan for testing it. The problem here is the lack of a clear understanding about what to test and how deep inside a tester should go. There are some key questions that must be answered before going down this path. Since most Big Data lacks a traditional structure, what does Big Data quality look like? And what are the most appropriate software testing tools?
As a software tester, it is imperative to first have a clear definition of Big Data. Many of us improperly believe that Big Data is just a large amount of information. This is a completely incorrect approach. For example, a 2 petabyte Oracle database alone doesn’t constitute a Big Data situation – just a high load one. To be very precise, Big Data is a series of approaches, tools and methods for processing of high volumes of structured and (most importantly) of unstructured data. The key difference between Big Data and “ordinary” high load systems is the ability to create flexible queries.
The Big Data trend first appeared five years ago in U.S., when researchers from Google announced their global achievement in the scientific journal, Nature. Without any significant results of medical tests, they were able to track the spread of flu in the U.S. by analyzing numbers of Google search queries to track influenza-like illness in a population.
Today, Big Data can be described by three “Vs”: Volume, Variety and Velocity. In other words, you have to process an enormous amount of data of various formats at high speed. The processing of Big Data, and, therefore its software testing process, can therefore be split into three basic components.
The process is illustrated below by an example based on the open source Apache Hadoop software framework:
Loading the initial data into the Hadoop Distributed File System (HDFS).
Execution of Map-Reduce operations.
Rolling out the output results from the HDFS.
Loading the Initial Data into HDFS
In this first step, the data is retrieved from various sources (social media, web logs, social networks etc.) and uploaded into the HDFS, being split into multiple files:
Verify that the required data was extracted from the original system and there was no data corruption.
Validate that the data files were loaded into the HDFS correctly.
Check the files partition and copy them to different data units.
Determine the most complete set of data that needs to be checked. For a step-by-step validation, you can use tools such as Datameer, Talend or Informatica.
Execution of Map-Reduce Operations
In this step, you process the initial data using a Map-Reduce operation to obtain the desired result. Map-reduce is a data processing concept for condensing large volumes of data into useful aggregated results:
Check required business logic on standalone unit and then on the set of units.
Validate the Map-Reduce process to ensure that the “key-value” pair is generated correctly.
Check the aggregation and consolidation of data after performing "reduce" operation.
Compare the output data with initial files to make sure that the output file was generated and its format meets all the requirements.
The most appropriate language for the verification of data is Hive. Testers prepare requests with the Hive (SQL-style) Query Language (HQL) that they send to Hbase to verify that the output complies with the requirements. Hbase is a NoSQL database that can serve as the input and output for Map-Reduce jobs.
You can also use other Big Data processing programs as an alternative to Map-Reduce. Frameworks like Spark or Storm are good examples of substitutes for this programming model, as they provide similar functionality and are compatible with the Hadoop community.
Rolling out the Output Results from HDFS
This final step includes unloading the data that was generated by the second step and loading it into the downstream system, which may be a repository for data to generate reports or a transactional analysis system for further processing: Conduct inspection of data aggregation to make sure that the data has been loaded into the required system and thus was not distorted. Validate that the reports include all the required data, and all indicators are referred to concrete measures and displayed correctly.
Testing data in a Big Data project can be obtained in two ways: copying actual production data or creating data exclusively for testing purposes – the former being the preferred method for software testers. In this case, the conditions are as realistic as possible and thus it becomes easier to work with a larger number of test scenarios. However, not all companies are willing to provide real data when they prefer to keep some information confidential. In this case, you must create testing data yourself or make a request for artificial info. The main drawback of this scenario is that artificial business scenarios created by using limited data inevitably restrict testing. Only real users themselves can detect defects in that case.
As speed is one of Big Data’s main characteristics, it is mandatory to do performance testing. A huge volume of data and an infrastructure similar to the production infrastructure is usually created for performance testing. Furthermore, if this is acceptable, data is copied directly from production.
... to read more articles, visit http://sqa.fyicenter.com/art/