Thursday, July 4, 2013

Big Data Analytics - For Beginners

By 2018, the United States alone could face a shortage of 140,000to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions" Source: “Big Data: the next frontier for Innovation, competition and productivity". McKinsey, May 2011


Big data is moving from a relational to a chaotic world. Today, we already have a huge amount of data stored in a structured format in traditional relational databases but unstructured complex data from mixed sources and multiple formats text files, logs, binary, XML etc poses a huge problem. It becomes a huge challenge when it is complemented with the volume of data moving from terrabytes (called "Terror Bytes" sometime ago due to the size) to petabytes. To add to the above, organizations today have a HUGE data management problem with data in silos and scattered everywhere. The ability to stitch together multiple sources of data is going to be the game changer.

The world desperately needed answers to these challenges where data can be stored, processed and computed irrespective of size, format, structure or schemas in a cheaper and faster way.

Apache Hadoop
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

MapReduce:  At the core, MapReduce has the ability to run a query over a dataset, distribute it and run it parallel over multiple nodes. Distributing the query solves the issue of size and capacity. MapReduce can also be found inside MPP and NoSQL databases, such as Vertica or MongoDB.

Hadoop Distributed File System (HDFS™):  For that computation to take place, each server must have access to the data. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless.

PIG: Pig is a programming language that simplifies the tasks of loading data, transforming data and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications as well as drastically cuts the amount of code needed compared to direct use of Hadoop’s Java APIs.

A complete list of Hadoop modules:

Ambari
Deployment, configuration and monitoring
Flume
Collection and import of log and event data
HBase
Column-oriented database scaling to billions of rows
HCatalog
Schema and data type sharing over Pig, Hive and MapReduce
HDFS
Distributed redundant file system for Hadoop
Hive
Data warehouse with SQL-like access
Mahout
Library of machine learning and data mining algorithms
MapReduce
Parallel computation on server clusters
Pig
High-level programming language for Hadoop computations
Oozie
Orchestration and workflow management
Sqoop
Imports data from relational databases
Whirr
Cloud-agnostic deployment of clusters
Zookeeper
Configuration management and coordination

Who should use Hadoop?

Typically, any organization with more than 2 terabytes of data should consider Hadoop. "Anything more than 100 [terabytes], you absolutely want to be looking at Hadoop," said Josh Sullivan, a Vice President at Booz Allen Hamilton and founder of the Hadoop-DC Meetup group.

Case : Twitter

“Twitter users generate 12 terrabytes of data a day - about four petabytes per year. And that amount is multiplying every year.”
With this massive amount of user generated data Twitter has to store data on clusters rather than storing it in a single hard drive. Twitter uses Cloudera's Hadoop distribution to power its clusters.
Twitter uses all the data it collects to answer multiple questions. From simple computations such as to figure out the number of requests and searches it serves every day to complex comparative user analysis such as determining how different users use their service or if certain features contribute to casual users becoming frequent users. Several other interesting analyses such as determining which tweets get retweeted, differentiating between humans and bots etc are areas of deep interest.


Frequently asked Questions:

Programming using R
Revolution Analytics has developed “ConnectR for Hadoop,” a collection of capabilities that bring the power of advanced R analytics to  Hadoop distributions including from our partners Cloudera,  HortonWorks, IBM BigInsights and Intel.    ConnectR for Hadoop provides the ability to manipulate Hadoop data stores directly from HDFS and HBASE—and give R programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
With RevoConnectR for Hadoop and Revolution R Enterprise 6, R users can:
  • ·         Interface directly with the HDFS filesystem from R.
  • ·         Import big-data tables into R from Hadoop filestores via  HBASE.
  • ·         Create big-data analytics by writing map-reduce tasks directly in the R language


Programming using SAS
SAS' support for Hadoop is centered on a singular goal: helping you know more – faster – so you can make better decisions. Beyond accessing this tidal wave of data, SAS products and services create seamless and transparent access to more Hadoop capabilities such as the Pig and Hive languages and the MapReduce framework. SAS provides the framework for a richer visual and interactive Hadoop experience, making it easier to gain insights and discover trends.