Download - Integrating R & Hadoop - Text Mining & Sentiment Analysis

Integrating R and Hadoop

Why R on Hadoop ?

Storing and processing large amounts of data is a challenging job for existing statistical computer applications such as R

Statistical applications are incapable of handling Big Data

Data management tools lack analytical and statistical capabilities

Both R and Hadoop have their own working environments

R provides the analytics and statistics functionality

Hadoop provides algorithms for processing and storing distributed data

Integrating R with Hadoop bridges the gap between these two applications

Analyse Hadoop data using R

Because R is one of the most well known statistical software, an analyst working with Hadoop may also want to use existing R packages with Hadoop

R is the most comprehensive statistical analysis package available

R is free and open source software

R packages are powerful and widely used for statistical and data analysis

Can be used for parallel computing across a number of cores and clusters

Integration can leverage the processing power of R and Hadoop and make it sufficient for Big Data Analytics

Enabling R on Hadoop

Functionality from R open source packages can be used in the writing of mapper and reducer functions

R and Hadoop can be integrated by

RHadoop RHIPE Segue R with Hadoop Streaming

Options for R on Hadoop

RHadoop Overview

RHadoop is an open source project that allows programmers directly use the functionality of MapReduce in R code

Collection of R packages: rhdfs rmr2 rhbase plyrmr Mostly implemented in native R

When to use RHadoop

For data exploration Data aggregation need To make use of parallel framework in Hadoop To sample data

Majorly RHadoop is used for managing and performing data analysis tasks with Hadoop framework

RHadoop Packages Overview

This R package provides basic connectivity to the HDFS

Helps to browse, read, write, and modify files stored in HDFS

Functions kind of replicate standard HDFS commands File manipulations

hdfs.copy, hdfs.move, hdfs.delete, hdfs.put, hdfs.get

Handling directories hdfs.dircreate, hdfs.mkdir

About rhdfs


• library(rhdfs) #Loading the R library

• hdfs.init() #rhdfs package initialization • hdfs.ls(‘/’) #Lists out all HDFS related files and directories • hdfs.mkdir() #Create new directory in HDFS file system • hdfs.rm() #Remove directory from HDFS file system • help(‘rhdfs’) #Lists all functions of rhdfs package

More examples later...

Sample rhdfs functions


This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster

More focus on the data analysis of very large data sets Java alternative for writing MapReduce programs Uses Hadoop Streaming API to write MapReduce jobs in R All components communicate via key-value pairs By default, it supports some HDFS data loading functions

About rmr2

MapReduce workflow in rmr2

The rmr2 package creates a client-side environment for MapReduce to execute map and reduce functions

Allows these functions to access variables outside their scope

Work with inputs and outputs of MapReduce

Enables programmers to write R variables to HDFS and vice versa

Function Categories in rmr2

For storing and retrieving data ü to.dfs: To write R objects to HDFSü from.dfs: To read mapreduce output from HDFS to R file system

For mapreduce ü mapreduce(): For defining and executing mapreduce jobs ü keyval(): To create and extract key-value pairs

MapReduce function syntax in rmr2

Syntax of rmr2 function: mapreduce (input, output, map, reduce, input.format, output.format)

Input: HDFS path for the input data

Output: HDFS path for the output data

Map/Reduce: Map and Reduce functions applied on data

Input.format/Output.format: Data format i.e. text, csv, json

Typically, map and reduce components consists of keyval helper function to ensure output is key-value pairs

Text Analytics using RHadoop

How Text Mining Works with R and Hadoop

Lexical statistics, study of measuring the frequency of words

Data mining techniques used to identify relationships and patterns

Sentiment analysis used to understand the underlying attitude

Tools like R and SAS offer statistical functionality

Handling large databases needs new technologies (Hadoop)

Text Analysis Process

Information Extraction

Data Mining

Text Data Pre-Processing

Post Processing Analysis

Steps Involved

Sentiment Analysis

• Also known as opinion mining

• Important components of text mining

• Extract opinion sentiment from end user reviews

• Sentiment further classified as positive, negative or neutral

Study of analysing people’s opinions, sentiments,

attitudes, appraisals, and evaluations

Parameters used in Sentiment Analysis

• Polarity, which can be positive, negative, or neutral

• Emotional states, which can be sad, angry, or happy

• Scaling system or numeric values

• Subjectivity/objectivity

• Features based on key entities such as durability of the furniture,

• Screen size of the cell phone, lens quality of a camera, etc.

The process of sentiment analysis involves classification of given text on the basis of the following parameters:

How Sentiment Analysis Works

A Simple Sentiment Algorithm: This algorithm assigns sentiment score by simply counting the number of occurrences of “positive” and “negative” words in any sentence

“I bought an iPhone few days back. It is really nice. The touch screen and voice quality are really cool. It is so better than my old Blackberry phone which was so hard to type with tiny keys. However iPhone is a bit expensive.”

Positive Words: nice, cool, better Negative Words: hard, expensive Sentence Sentiment Score: Tot. Pos – Tot. Neg (3-2=>1) Sentence Sentiment Polarity: Positive Overall Score: Sum of all sentence sentiment scores

Process workflow of this Sentimental Analysis

Workflow

Any Questions ….?