Building High Performance Big Data and Analytics Systems
Introduction
Big Data and Analytics systems are fast emerging as one of
the critical system in an organization’s IT environment. But with such huge
amount of data, comes performance challenges. If Big Data systems cannot be
used to make or forecast critical business decisions, or provide the insights
into business values, hidden under huge amount of data, at the right time, then
these systems lose their relevance. This blog post talks about some of the
critical performance considerations, in a technology agnostic way. It talks
about some techniques or guidelines, which can be used, during different phases
of a big data system (i.e. data extraction, data cleansing, processing, storage
as well as presentation). This should act as generic guidelines, which can be
used by any Big Data professional to ensure that the final system meets the
performance requirements of the system.
What is Big Data
Big Data is one of the most common terms in IT world these
days. Though different terms and definitions are used to explain Big Data, in
principal, all conclude to the same point that with the generation of huge
amount of data, both from structured and unstructured sources, traditional
approaches to handle and process this data are not sufficient.
Big Data systems are generally considered to have five main
characteristics of data, commonly called 5 Vs of data. These are Volume,
Variety and Velocity, Veracity and Value.
According to Gartner, High Volume can be defined as “Bigdata is high volume when the processing capacity of the native data-capture
technology and processes is insufficient for delivering business value to
subsequent use cases. High volume also occurs when the existing technology was
specifically engineered for addressing such volumes – a successful big data
solution”.
Building Blocks of a Big Data System
A Big Data system compromises of a number of functional
blocks that provide the system the capability for data acquisition from diverse
sources, doing pre-processing (e.g. cleansing, validation) etc. on this data,
storing the data, doing processing and analytics on this stored data (e.g.
doing predictive analytics, generating recommendations for online uses and so
on), and finally presenting and visualizing the summarized and aggregated
results.
The following figure depicts these high level components of
Big Data system
Big Data System |
Data Processing and Analysis
Once the cleansed and de-deduped, the pre-processed data is
available for doing the final processing and applying required analytical
functions. Some of the steps involved in this stage are,
de-normalization of the cleansed data, performing some sort of correlation
amongst different set of data, aggregating the results based on some
pre-defined time intervals, performing ML algorithms, doing predictive
analytics, and so on.
In the following sections, this blog will present some of
the best practices for carrying out data processing and analysis, to achieve
better performance in a Big Data System.
Visualization and Presentation
The last step of a Big Data flow is to view the output of
different analytical functions. This step involves reading from the
pre-computed aggregated results (or other such entities) and presenting in the
form of user friendly tables, charts, and other such methods, which makes it
easy to interpret and understand the results.
Data Acquisition |
No comments:
Post a Comment