Integrating R and Hadoop
Why R on Hadoop ?
Storing and processing large amounts of data is a challenging job for existing statistical computer applications such as R
Statistical applications are incapable of handling Big Data
Data management tools lack analytical and statistical capabilities
Both R and Hadoop have their own working environments
R provides the analytics and statistics functionality
Hadoop provides algorithms for processing and storing distributed data
Integrating R with Hadoop bridges the gap between these two applications
Analyse Hadoop data using R
Because R is one of the most well known statistical software, an analyst working with Hadoop may also want to use existing R packages with Hadoop
R is the most comprehensive statistical analysis package available
R is free and open source software
R packages are powerful and widely used for statistical and data analysis
Can be used for parallel computing across a number of cores and clusters
Integration can leverage the processing power of R and Hadoop and make it sufficient for Big Data Analytics
Enabling R on Hadoop
Functionality from R open source packages can be used in the writing of mapper and reducer functions
R and Hadoop can be integrated by
RHadoop RHIPE Segue R with Hadoop Streaming
Options for R on Hadoop
RHadoop Overview
RHadoop is an open source project that allows programmers directly use the functionality of MapReduce in R code
Collection of R packages: rhdfs rmr2 rhbase plyrmr Mostly implemented in native R
When to use RHadoop
For data exploration Data aggregation need To make use of parallel framework in Hadoop To sample data
Majorly RHadoop is used for managing and performing data analysis tasks with Hadoop framework
RHadoop Packages Overview
This R package provides basic connectivity to the HDFS
Helps to browse, read, write, and modify files stored in HDFS
Functions kind of replicate standard HDFS commands File manipulations
hdfs.copy, hdfs.move, hdfs.delete, hdfs.put, hdfs.get
Handling directories hdfs.dircreate, hdfs.mkdir
About rhdfs
RHadoop Packages Overview
• library(rhdfs) #Loading the R library
• hdfs.init() #rhdfs package initialization • hdfs.ls(‘/’) #Lists out all HDFS related files and directories • hdfs.mkdir() #Create new directory in HDFS file system • hdfs.rm() #Remove directory from HDFS file system • help(‘rhdfs’) #Lists all functions of rhdfs package
More examples later...
Sample rhdfs functions
RHadoop Packages Overview
This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster
More focus on the data analysis of very large data sets Java alternative for writing MapReduce programs Uses Hadoop Streaming API to write MapReduce jobs in R All components communicate via key-value pairs By default, it supports some HDFS data loading functions
About rmr2
MapReduce workflow in rmr2
The rmr2 package creates a client-side environment for MapReduce to execute map and reduce functions
Allows these functions to access variables outside their scope
Work with inputs and outputs of MapReduce
Enables programmers to write R variables to HDFS and vice versa
Function Categories in rmr2
For storing and retrieving data ü to.dfs: To write R objects to HDFSü from.dfs: To read mapreduce output from HDFS to R file system
For mapreduce ü mapreduce(): For defining and executing mapreduce jobs ü keyval(): To create and extract key-value pairs
MapReduce function syntax in rmr2
Syntax of rmr2 function: mapreduce (input, output, map, reduce, input.format, output.format)
Input: HDFS path for the input data
Output: HDFS path for the output data
Map/Reduce: Map and Reduce functions applied on data
Input.format/Output.format: Data format i.e. text, csv, json
Typically, map and reduce components consists of keyval helper function to ensure output is key-value pairs
Text Analytics using RHadoop
How Text Mining Works with R and Hadoop
Lexical statistics, study of measuring the frequency of words
Data mining techniques used to identify relationships and patterns
Sentiment analysis used to understand the underlying attitude
Tools like R and SAS offer statistical functionality
Handling large databases needs new technologies (Hadoop)
Text Analysis Process
Information Extraction
Data Mining
Text Data Pre-Processing
Post Processing Analysis
Steps Involved
Sentiment Analysis
• Also known as opinion mining
• Important components of text mining
• Extract opinion sentiment from end user reviews
• Sentiment further classified as positive, negative or neutral
Study of analysing people’s opinions, sentiments,
attitudes, appraisals, and evaluations
Parameters used in Sentiment Analysis
• Polarity, which can be positive, negative, or neutral
• Emotional states, which can be sad, angry, or happy
• Scaling system or numeric values
• Subjectivity/objectivity
• Features based on key entities such as durability of the furniture,
• Screen size of the cell phone, lens quality of a camera, etc.
The process of sentiment analysis involves classification of given text on the basis of the following parameters:
How Sentiment Analysis Works
A Simple Sentiment Algorithm: This algorithm assigns sentiment score by simply counting the number of occurrences of “positive” and “negative” words in any sentence
“I bought an iPhone few days back. It is really nice. The touch screen and voice quality are really cool. It is so better than my old Blackberry phone which was so hard to type with tiny keys. However iPhone is a bit expensive.”
Positive Words: nice, cool, better Negative Words: hard, expensive Sentence Sentiment Score: Tot. Pos – Tot. Neg (3-2=>1) Sentence Sentiment Polarity: Positive Overall Score: Sum of all sentence sentiment scores
Process workflow of this Sentimental Analysis
Workflow
Any Questions ….?