Tuesday 31 May 2016

Building High Performance Big Data and Analytics Systems

Introduction

       Big Data and Analytics systems are fast emerging as one of the critical system in an organization’s IT environment. But with such huge amount of data, comes performance challenges. If Big Data systems cannot be used to make or forecast critical business decisions, or provide the insights into business values, hidden under huge amount of data, at the right time, then these systems lose their relevance. This blog post talks about some of the critical performance considerations, in a technology agnostic way. It talks about some techniques or guidelines, which can be used, during different phases of a big data system (i.e. data extraction, data cleansing, processing, storage as well as presentation). This should act as generic guidelines, which can be used by any Big Data professional to ensure that the final system meets the performance requirements of the system.

What is Big Data

          Big Data is one of the most common terms in IT world these days. Though different terms and definitions are used to explain Big Data, in principal, all conclude to the same point that with the generation of huge amount of data, both from structured and unstructured sources, traditional approaches to handle and process this data are not sufficient.

Big Data systems are generally considered to have five main characteristics of data, commonly called 5 Vs of data. These are Volume, Variety and Velocity, Veracity and Value.

According to Gartner, High Volume can be defined as “Bigdata is high volume when the processing capacity of the native data-capture technology and processes is insufficient for delivering business value to subsequent use cases. High volume also occurs when the existing technology was specifically engineered for addressing such volumes – a successful big data solution”.

Building Blocks of a Big Data System

         A Big Data system compromises of a number of functional blocks that provide the system the capability for data acquisition from diverse sources, doing pre-processing (e.g. cleansing, validation) etc. on this data, storing the data, doing processing and analytics on this stored data (e.g. doing predictive analytics, generating recommendations for online uses and so on), and finally presenting and visualizing the summarized and aggregated results.

The following figure depicts these high level components of Big Data system

Big Data System


Data Processing and Analysis

     Once the cleansed and de-deduped, the pre-processed data is available for doing the final processing and applying required analytical functions. Some of the steps involved in this stage are, de-normalization of the cleansed data, performing some sort of correlation amongst different set of data, aggregating the results based on some pre-defined time intervals, performing ML algorithms, doing predictive analytics, and so on.

In the following sections, this blog will present some of the best practices for carrying out data processing and analysis, to achieve better performance in a Big Data System.

Visualization and Presentation

The last step of a Big Data flow is to view the output of different analytical functions. This step involves reading from the pre-computed aggregated results (or other such entities) and presenting in the form of user friendly tables, charts, and other such methods, which makes it easy to interpret and understand the results.



Data Acquisition



No comments:

Post a Comment