Big Data: How to Test the Elephant?
By: Alexander Panchenko
Big Data is a big topic in software development today. When it comes to practice, software testers may not yet fully understand what is exactly Big Data. A tester knows that you need a plan for testing it. Since most Big Data lacks a traditional structure, how does Big Data quality look like? And what the are most appropriate software testing tools? This article tries to answer these questions.
As more and more companies are adopting Big Data as a solution for data analysis the question arises: how can you determine a proper testing strategy for controlling this heavyweight “elephant”? The problem for software testing is magnified by a lack of clear understanding about what to test and how deep inside a tester should go.
As a software tester, you need a clear definition of Big Data. Many of us improperly believe that Big Data is just a large amount of information. This is a completely incorrect approach. Actually, you don’t face Big Data when you work with an Oracle 2 petabytes database, but just a high load database. To be very precise, Big Data is a series of approaches, tools and methods for processing of high volumes of structured and (what is the most important) of unstructured data. The key difference of Big Data from “ordinary” high load-systems is the ability to create flexible queries.
Big Data can be described by three “V”: Volume, Variety, and Velocity. In other words, you have to process an enormous amount of data of various formats at high speed.
The processing of Big Data, and, therefore its software testing process, might be split into 3 basic components. The process is illustrated below by an example based on the open source Apache Hadoop software framework:
1. Loading the initial data into the HDFS (Hadoop Distributed File System)
2. Execution of Map-Reduce operations
3. Rolling out the output results from the HDFS
Loading the initial data into HDFS
In this first step, the data is retrieved from various sources (social media, web logs, social networks etc.) and uploaded into the HDFS, being split into multiple files.
* Verifying that the required data was extracted from the original system and there was no data corruption;
* Validating that the data files were loaded into the HDFS correctly;
* Checking the files partition and copying them to different data units;
* Determination of the most complete set of data that needs to be checked. For a step-by-step validation, you can use tools such as Datameer, Talend or Informatica.
... to read more articles, visit http://sqa.fyicenter.com/art/