Thursday, July 4, 2013

Big Data Analytics - For Beginners

By 2018, the United States alone could face a shortage of 140,000to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions" Source: “Big Data: the next frontier for Innovation, competition and productivity". McKinsey, May 2011

Big data is moving from a relational to a chaotic world. Today, we already have a huge amount of data stored in a structured format in traditional relational databases but unstructured complex data from mixed sources and multiple formats text files, logs, binary, XML etc poses a huge problem. It becomes a huge challenge when it is complemented with the volume of data moving from terrabytes (called "Terror Bytes" sometime ago due to the size) to petabytes. To add to the above, organizations today have a HUGE data management problem with data in silos and scattered everywhere. The ability to stitch together multiple sources of data is going to be the game changer.

The world desperately needed answers to these challenges where data can be stored, processed and computed irrespective of size, format, structure or schemas in a cheaper and faster way.

Apache Hadoop
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

MapReduce:  At the core, MapReduce has the ability to run a query over a dataset, distribute it and run it parallel over multiple nodes. Distributing the query solves the issue of size and capacity. MapReduce can also be found inside MPP and NoSQL databases, such as Vertica or MongoDB.

Hadoop Distributed File System (HDFS™):  For that computation to take place, each server must have access to the data. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless.

PIG: Pig is a programming language that simplifies the tasks of loading data, transforming data and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications as well as drastically cuts the amount of code needed compared to direct use of Hadoop’s Java APIs.

A complete list of Hadoop modules:

Deployment, configuration and monitoring
Collection and import of log and event data
Column-oriented database scaling to billions of rows
Schema and data type sharing over Pig, Hive and MapReduce
Distributed redundant file system for Hadoop
Data warehouse with SQL-like access
Library of machine learning and data mining algorithms
Parallel computation on server clusters
High-level programming language for Hadoop computations
Orchestration and workflow management
Imports data from relational databases
Cloud-agnostic deployment of clusters
Configuration management and coordination

Who should use Hadoop?

Typically, any organization with more than 2 terabytes of data should consider Hadoop. "Anything more than 100 [terabytes], you absolutely want to be looking at Hadoop," said Josh Sullivan, a Vice President at Booz Allen Hamilton and founder of the Hadoop-DC Meetup group.

Case : Twitter

“Twitter users generate 12 terrabytes of data a day - about four petabytes per year. And that amount is multiplying every year.”
With this massive amount of user generated data Twitter has to store data on clusters rather than storing it in a single hard drive. Twitter uses Cloudera's Hadoop distribution to power its clusters.
Twitter uses all the data it collects to answer multiple questions. From simple computations such as to figure out the number of requests and searches it serves every day to complex comparative user analysis such as determining how different users use their service or if certain features contribute to casual users becoming frequent users. Several other interesting analyses such as determining which tweets get retweeted, differentiating between humans and bots etc are areas of deep interest.

Frequently asked Questions:

Programming using R
Revolution Analytics has developed “ConnectR for Hadoop,” a collection of capabilities that bring the power of advanced R analytics to  Hadoop distributions including from our partners Cloudera,  HortonWorks, IBM BigInsights and Intel.    ConnectR for Hadoop provides the ability to manipulate Hadoop data stores directly from HDFS and HBASE—and give R programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
With RevoConnectR for Hadoop and Revolution R Enterprise 6, R users can:
  • ·         Interface directly with the HDFS filesystem from R.
  • ·         Import big-data tables into R from Hadoop filestores via  HBASE.
  • ·         Create big-data analytics by writing map-reduce tasks directly in the R language

Programming using SAS
SAS' support for Hadoop is centered on a singular goal: helping you know more – faster – so you can make better decisions. Beyond accessing this tidal wave of data, SAS products and services create seamless and transparent access to more Hadoop capabilities such as the Pig and Hive languages and the MapReduce framework. SAS provides the framework for a richer visual and interactive Hadoop experience, making it easier to gain insights and discover trends.

Friday, June 28, 2013

Big Data Analytics : R and SAS programming

Big Data Analytics : Statistical R and SAS programming
When we discuss cost we can't avoid the constant bickering over the choice of the right statistical Software environment and its pros and cons - namely R or SAS. With respect to statistical analytics capability both SAS and R share the same respectability and we must agree on some occasions one leads the other. However, it is argued that some of the cutting edge latest techniques available in R are not available in SAS. Unfortunately, we are not going to add to the huge amount of information already available on this topic but would rather stick to providing clients worldwide a choice in choosing the software they are already invested in terms of time or money.
Market Equations India offers clients a combination of rich Industry experience and a committed group of intellectuals from business, science and mathematics disciplines that are passionate about analytics and are comfortable and current with programming using different statistical software including R, SAS, SPSS and MATLAB.

Our expertise with the techniques used in R programming includes:
  • Reading data from various source files
  • Evaluate the cumulative distribution function, the probability density function and the quintile function
  • Examining the distribution of a set of data: stem and leaf plot
  • One or two sample tests: box plot, t-test, F-test, two-sample Wilcoxon test, Two-sample Kolmogorov-Smirnov test
  • Grouping, loops and conditional execution: if statements, for loops, repeat, and while loops
  • Writing R functions
  • Statistical modelling: regression analysis and the analysis of variance, generalized linear models, nonlinear regression models
  • Creating data graphics: High-level plotting functions, Low-level plotting functions, Interactive graphics functions
  • Accessing and installing R packages
  • Debugging
  • Organizing and commenting R code
Our knowledge in R programming extends to its comprehensive list of concepts including:
Accessing built-in datasets, Additive models, Analysis of variance, Arithmetic functions and operators, Arrays, Binary operators, Box plots, Character vectors, Concatenating lists, Control statements, Customizing the environment, Data frames, Density estimation, Determinants, Diverting input and output, Dynamic graphics, Eigenvalues and eigenvectors, Empirical CDFs, Generalized linear models, Generalized transpose of an array, Generic functions, Graphics device drivers, Graphics parameters, Grouped expressions, Indexing of and by arrays, Indexing vectors, Kolmogorov-Smirnov test, Least squares fitting, Linear equations, Linear models, Lists, Local approximating regressions, Loops and conditional execution, Matrices, Matrix multiplication, Maximum likelihood, Missing values, Mixed models, Named arguments, Namespace, Nonlinear least squares, One- and two-sample tests, Ordered factors, Outer products of arrays, Probability distributions, QR decomposition, Quantile-quantile plots, Reading data from files, Regular sequences, Removing objects, Robust regression, Search path, Shapiro-Wilk test, Singular value decomposition, Statistical models, Student's t test, Tabulation, Tree-based models, Updating fitted models, Wilcoxon test, Workspace, Writing functions.
Case Study : Statistical R Programming:
Market Equations helps a United Kingdom (UK) based E-Retailer institutionalize Sales and Marketing Analytics by building a correlation model linking Facebook "likes" and "fan" growth to Sales, helping them allocate their marketing spends effectively into channels that maximize returns and reduce costs incurred in holding excess inventory and retain clients by eliminating the possibility of stock outs.
Read More!

Thursday, June 27, 2013

Customer Preferences : Outsourcing Max Diff Analysis

Outsourcing Max-Differential Analysis to understand customer preferences

Maximum Difference Scaling (MaxDiff) is a statistical exercise wherein respondents score multiple features and attributes such as product features, product preference and usage etc based on the most important and least important feature or attribute to help obtain importance scores. This exercise provides different preference scores showing the relative importance of attributes  compared to a standard rating scale exercise. Hierarchical Bayesian technique is used to derive the importance scores at the respondent level.

Case Study : Max Differential Analysis on survey data for a large utility vehicle manufacturer in the US

Objective : To collect opinions from current or potential consumers on several new product concepts and to better understand customer needs in a utility vehicle product.

Process & Methodology: Each respondent was given a set of questions containing some utility vehicle attributes (below) and asked to indicate the most and least important attribute.


  • Analysis and summary plots were obtained for responses from EACH of the questions.
  • Primary analysis was performed using multinomial logit model to obtain the Importance Value of each attribute in percent-shared utility scale (add up to 100). These are the easiest to interpret and were obtained by probability based rescaling procedure of the raw utility scores.
  • Count analysis starting from simple proportions of least and most important attributes was also presented as a supportive analysis to the primary model based analysis.

Outcome: Our analysis helped the Utility Vehicle manufacturer better understand their target audience and helped them devise need based strategies based on customer feedback on the features the had the highest importance value for the customer. 

Customer Analytics

Customer Analytics Outsourcing Services

"Data today is being termed as the "new currency", the "new oil", the new "natural resource" and yet it is surprising that most organizations do not use this huge arsenal of data available to improve decision making and drive results." 

Embracing data in any size, shape or form helps organizations transform their huge customer data inventory into actionable insights through the use of extensive data analytics and predictive modeling services. 

Organizations that institutionalize extensive Customer Analytics into their decision management and reporting systems stand out and stay ahead of their competition as they have a clear and precise understanding of their customer base and treat data as a business asset that needs to be nurtured and worked on by applying Analytics driven data transformation that deliver actionable insights and impact based results. 

Customer Analytics & Reporting outsourcing services include:

·         Campaign Design & Tracking 
·         Customer Segmentation &Profiling
·         Life Time Value Modeling
·         Propensity Modeling
·         Customer Churn Analytics
·         Customer Loyalty Analytics
·         Customer Satisfaction Analytics
·         Spend Optimization Analytics
·         Retention Prediction scorecards
·         Revival Scorecards and Segmentation
·         Early warning churn prediction model
·         Market Basket Analysis
·         Cross sell- Up sell
·         New/Inline product Forecasts
·         Cross Channel Effectiveness
·         Demand, Supply and Inventory Planning 

You may find the below Case Studies worth a read.

Case Study: Customer Churn Analytics for a large Telecommunications provider in the United States
Market Equations India developed a Customer Churn Analysis Scorecard services for a large Telecom service provider in the United States to identify key churn drivers and helping them retain subscribers by implementing churn prevention strategies. 
Read More.
Case Study: Cross Sell Analytics strategies on a financial portfolio
Market Equations India helps a leading financial services group leverage its huge customer database to attract customers towards its various other financial products using innovative and smartly designed cross sell strategies. 
Read More.
Case Study: Customer Portfolio analytics and Loan performance optimization services
Market Equations India helps one of the largest Car rental dealerships in the US build incisive and comprehensive predictive models to help the dealership predict profitable future loans while avoiding unprofitable loans, design optimum pricing strategies and optimize portfolio performance to maximize revenue. 
Read More.