Big Data Science: Fundamental, Techniques, and Challenges ...

Big Data Science 1

Big Data Science: Fundamental, Techniques, and Challenges

(HDFS & Map-Reduce)

2014. 6. 27.

Incheon Paik

University of Aizu, Japan

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

Big Data Science 2

Contents

What is Big Data?

Big Data Source

Clustered File System & Distributed File

System

Hadoop Distributed File System (HDFS)

Map-Reduce Operation

Application of Map-Reduce

Big Data Science 3

Overview of Big Data Science

What is Big Data?

Recently, there have been very large and

complex data sets from nature, sensors,

social networks, enterprises increasingly

based on high speed computers and networks

together.

Big data is the term for a collection of the data

sets that it becomes difficult to process using

on-hand database management tools or

traditional data processing applications.

Big Data Science 4

Overview of Big Data Science

Several Areas Meteorology

Genomics

Connectomics

Complex physics simulations

Biological and environmental

research

Internet search

Finance and business

informatics

Data sets ubiquitous

information-sensing mobile

devices

Aerial sensory

technologies (remote

sensing)

Software logs

Cameras

Microphones

Radio-frequency

identification readers

Wireless sensor

networks

Big Data Science 5

Clustered File System

A file system which is shared by being

simultaneously mounted on multiple servers.

Clustered file systems can provide features like

Location-independent addressing

Redundancy which improve reliability or reduce the

complexity of the other parts of the cluster.

Parallel file systems are a type of clustered file

system that spread data across multiple storage

nodes, usually for redundancy or performance.

Clustered File System (DAS/NAS/SAN)

Big Data Science 6

Source: www.cgw.com

Fundamental Differences in DAS/NAS/SAN

Big Data Science 7

Source: he.wikipedia.org

Storage Area Network

Big Data Science 8

Source: pccgroup.com

Big Data Science 9

Shared-Disk / Storage Area Network

A shared-disk file-system uses a storage-area

network (SAN) to provide direct disk access from

multiple computers at the block level.

Access control and translation from file-level

operations (application) to block-level operations (by

SAN) must take place on the client node.

Adds a mechanism for concurrency control which

gives a consistent and serializable view of the file

system, avoiding corruption and unintended data loss

Usually employ some sort of a fencing mechanism to

prevent data corruption in case of node failures.

Big Data Science 10

Shared-Disk / Storage Area Network

The SAN may use any of a number of block-level

protocols, including SCSI, iSCSI, HyperSCSI,

ATA over Ethernet, Fibre Channel, etc.

There are different architectural approaches to a

shared-disk file-system.

Distribute file information across all the servers in a

cluster (fully distributed).

Utilize a centralized metadata server.

Both achieve the same result of enabling all servers to

access all the data on a shared storage device.

Big Data Science 11

Network Attached Storage (NAS)

Source: www.boot.lv

Big Data Science 12

Network Attached Storage (NAS)

Provides both storage and a file system, like a

shared disk file system on top of a SAN.

Typically uses file-based protocols (as opposed to

block-based protocols a SAN would use) such as

NFS, SMB/CIFS, AFP, or NCP.

Design Considerations

Avoiding single point of failure: Fault tolerance and high

availability by data replication of one sort or another

Performance: fast disk-access time and small amount of

CPU-processing time over distributed structure

Concurrency: for consistent and efficient multiple accesses

to the same file or block by concurrency control or locking

which may either be built into the file system or provided by

an add-on protocol

Big Data Science 13

NAS vs. SAN

Source: turbotekcomputer.com

Big Data Science 14

Distributed File System

Distributed file systems do not share block level

access to the same storage but use a network

protocol.

These are commonly known as network file

systems, even though they are not the only file

systems that use the network to send data.

Distributed file systems can restrict access to the

file system depending on access lists or

capabilities on both the servers and the clients,

depending on how the protocol is designed.

Big Data Science 15


The difference between a distributed file

system and a distributed data store

Distributed file system (DFS) allows files to be

accessed using the same interfaces and

semantics as local files - e.g.

mounting/unmounting, listing directories,

read/write at byte boundaries, system's native

permission model.

Distributed data stores, by contrast, require

using a different API or library and have

different semantics (most often those of a

database).

Big Data Science 16


Design Goal of DFS

Access transparency: Clients are unaware that files

are distributed and can access them in the same way

as local files are accessed.

Location transparency: A consistent name space

exists encompassing local as well as remote files. The

name of a file does not give its location.

Concurrency transparency: All clients have the

same view of the state of the file system. This means

that if one process is modifying a file, any other

processes on the same system or remote systems

that are accessing the files will see the modifications

in a coherent manner.

Big Data Science 17

Distributed File System Design Goal of DFS

Failure transparency: The client and client programs

should operate correctly after a server failure.

Heterogeneity: File service should be provided across

different hardware and operating system platforms.

Scalability: The file system should work well in small

environments (1 machine, a dozen machines) and also

scale gracefully to huge ones (hundreds through tens

of thousands of systems).

Replication transparency: To support scalability, we

may wish to replicate files across multiple servers.

Clients should be unaware of this.

Migration transparency: Files should be able to move

around without the client's knowledge.

Big Data Science 18

Distributed File System Examples

GFS (Google Inc.)

Ceph (Inktank)

Windows Distributed File System (DFS) (Microsoft)

FhGFS (Fraunhofer)

GlusterFS (Red Hat)

Lustre

Ibrix

Big Data Science 19

The existing file system such as large file storage , NAS

or SAN is expensive. It needs very high performance

server.

HDFS can combine Web server level hosts into a disk

storage.

Some high performance large storage system will be

needed. SAN is suitable for DBMS, NAS for safe file

store.

It can process big data concurrently using Map-Reduce.

Four Fundamental Objectives of HDFS

Error recovery

Access data using streaming

Large data storage

Data integrity

Hadoop Distributed File System (HDFS)

Big Data Science 20

Hadoop Distributed File System

Big Data Science 21

Block Structured File System

Block 1 Block 2 Block 3 Block 4 Block 5

320 MB File

Block 1

Block 3

Block 4

Block 2

Block 3

Block 4

Block 1

Block 3

Block 5

Block 2

Block 4

Block 5

Block 1

Block 2

Block 5

HDFS

File Copy Structure on HDFS

Big Data Science 22


Source:bradhedlund.com

Big Data Science 23


Source:bradhedlund.com

Big Data Science 24

HDFS,NoSQL, Map-Reduce

NoSQL

No SQL database provides a mechanism for storage

and retrieval of data that employs less constrained

consistency models than traditional relational

databases.

Motivations for this approach include simplicity of

design, horizontal scaling and finer control over

availability.

NoSQL databases are often highly optimized key–

value stores intended for simple retrieval and

appending operations, with the goal being significant

performance benefits in terms of latency and

throughput.

Big Data Science 25

Map Operation

int a = {1,2,3};

for (I = 0; I < a.length; i++)

a[i] = a[i] * 2;

What is a Map operation

Doing something to every element in an array is a

common operation. The operation can be

considered as Map function.

This can be

written as a

function.

int a = {2,4,6};

New value for the variable a would be

Map-Reduce Operation

Source:bigdatauniversity.com

Big Data Science 26

Map Operation

int a = {1,2,3};


a[i] = fx(a[i]);


Doing something to every element in an array is a

common operation. The operation can be

considered as Map function.

Like this, where fx

is a function

defined as :

function

fx(x) {return x * 2 ;}

New value for the variable a would be

int a = {2,4,6};

All of this can also be converted

into a “map” function.

Big Data Science 27

Map Operation

function map (fx , a) {


a[i] = fx(a[i]);

}


…like this, where fx is a function passed as an

argument:

This is function fx whose defintion is included in the call.

You can invoke this map function as follows

map(function(x) {return x*2; }, a);

Big Data Science 28

Reduce Operation

function sum(a) {

int s = 0;

for (i = 0; I < a.length; i++)

s += a[i] ;

return s;

}

What is a Reduce operation

Another common operation on arrays is to combine

all their value:

This can be

written as a

function.

Big Data Science 29

Reduce Operation

function sum(a) {

int s = 0;

for (i = 0; I < a.length; i++)

s = fx (s, a[i]) ;

return s;

}

What is a Reduce operation

Another common operation on arrays is to combine

all their values:

Like this, where a

function fx is

defined so that it

adds its arguments:

function

fx(a,b) {return a+b ;}

The whole function sum can also be rewritten so that fx is

passed as an argument.

reduce(function(a,b) {return a+b; }, a, 0);

as a reduce operation:

Big Data Science 30

Submitting a MapReduce Job


Big Data Science 31

Shuffle Process

Map1

Reduce1

Input Split

Map2

Map3

Reduce2

1. Split Creation 2. Map 3. Spill 4. Merge 5. Copy 6. Sort 7. Reduce

Memory Buffer Partition

Map Output Data

Memory Buffer, File

File

Input Split

Input Split

Source: Beginning Hadoop Programming

MapReduce – Type and Data Flow

Lists

Big Data Science 32


HADOOP MapReduce Program : WordCount.java

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

// Mapper Implementation

public static class Map extends

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

// Reducer Implementation

public static class Reduce extends

Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable value = new IntWritable(0);

@Override

protected void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values)

sum += value.get();

value.set(sum);

context.write(key, value);

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

// Parameter Passing

GenericOptionsParser parser = new GenericOptionsParser(conf, args);

args = parser.getRemainingArgs();

// Creating of Job

Job job = new Job(conf, "wordcount");

job.setJarByClass(WordCount.class);

// Assign Classes for Mapper/Reducer

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

// Other Setting for Job

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

Big Data Science 33

Big Data Science 34

Map-Reduce Architecture on Hadoop

Client

Job Tracker

Task Tracker

Map Map

Map

Reduce Reduce

Reduce

Data Node

Task Tracker

Map Map

Map

Reduce Reduce

Reduce

Data Node

Task Tracker

Map Map

Map

Reduce Reduce

Reduce

Data Node


Big Data Science 35

Client: Map-Reduce API provided by Hadoop

and Map-Reduce Program executed by a user.

Job Tracker

A Map-Reducer program is managed as a work unit,

“Job”.

Manage scheduling and monitoring of entire jobs on

Hadoop cluster.

Running on a Name Node server usually, but

necessarily.

Description of Job Tracker

When a user requests a new job, it calculates how

many Maps and Reduces will be executed for the job.


Big Data Science 36

Description of Job Tracker

A Map-Reduce program that executed by a client

through Hadoop is managed by the unit of “Job”.

The Job Tracker monitors scheduling of all jobs

registered on Hadoop cluster. But we don’t have to

execute the job tracker on name node server.

Decide Task Trackers where the Maps-Reduces will

be executed and assign the job to the Task Trackers

decided. -> Then the task trackers execute the Map-

Reduce programs.

The job tracker communicates with task trackers with

heart-beats for checking status of task trackers and

job execution information.


Big Data Science 37

Task Tracker

A demon program that executes Map-Reduce

program of a user on a Data Node.

Creates Map-Tasks and Reduce-Tasks that the Job

Tracker requested.

Execute the tasks (Map and Reduces) by running

new JVM.

Workflow of Map-Reduce

A user executes a job by invoking waitForCompletion

method.

A job client object is created in the job interface, and

the object access to the Job Tracker to execute the

job.


Big Data Science 38

Map-Reduce Operation on Hadoop

Client

Task Tracker

Task Task

Task

Job Tracker

Task Task

Job

Job Client Input Data

Mapper

Reducer

Mapper

Partitioner

Key-Value

Reducer

Key-Value

Key-Value Key-Value

Output Data

1. Map-Reduce Job Execute

2. New Job Submit 3. Input Split Creation

5. Mapper Execution

6. Sort and Merge

7. Reducer Execution

8. Output Data Save

4. Job Assignment

Name Node Data Node


Big Data Science 39


Then, the Job Tracker returns a new job ID to the Job

Client. And the Job Client checks the output path

defined by the user.

The Job Client calculates input split (Size of input file

to be processed in a Map procedure) about the input

data of the job. After the calculation of the input split,

it saves input split information, setting file, and Map-

Reduce JAR file to HDFS, and notifies completion of

preparation for starting the job.

The Job Tracker registers the job on the Queue, Job

scheduler initializes the job from the Queue. And then,

the Job Scheduler creates Map tasks according to the

input split information and assign IDs.


Big Data Science 40


The Task Tracker gives its status to the Job Tracker

periodically calling heartbeat method. In heartbeat

information, there are resource information about CPU,

memory, disk, number of tasks, and available maximum

tasks, new tasks execution possibilities.

The Task Tracker executes the Map task assigned by

the Job Tracker. The Map task executes the logic in the

Map method, and save output data to memory buffer. At

this time, a patitioner decides a Reduce task to which

the output data should be transferred, and assign a

suitable partition for it. The data in memory will be

sorted by the keys, and saved to local hard disk. When

the save is to be completed, the files will be sorted and

merged into one file.


Big Data Science 41


A reduce task is the Reducer class written by a user.

The task can be executed when all output data are to be

prepared. When the Map task completes output, it

notifies its completion to the Task Tracker that ran itself.

When the Task Tracker receives the message, it notifies

status of the corresponding Map task and output data

path of the Map task to the Job Tracker.

The Reduce task copies output data of all Map tasks,

and merges output data of the Map tasks. When the

merge is to be completed, it executes analysis logic to

call Reduce method.

The Reduce task saves the output data to HDFS as

name of “part-nnnnn”. Where “nnnnn” means partition ID.


Big Data Science 42

Data Types

The Map-Reduce is the optimized object for network

communication, and provides “WritableComparable”

interface.

All data types for Key and Value in Map-Reduce

program should implement the “WritableComparable”

interface.

The “WritableComparable” interface inherits the

“Writable” and “Comparable” interfaces.

The “write” method serializes data value, and the

“readFields” method unserializes to read the serialized

data.

Component of Map-Reduce Programming

public interface Writable {

void write (DataOutput out) throws IOException;

void readFields (DataInput in) throws IOException;

}

Big Data Science 43

Data Types

The “Wrapper” class to implement the

“WritableComparable” interface for each data type:

BoolleanWritable, ByteWritable, DoubleWritable,

FloatWritable, IntWritable, LongWritable, TextWritable,

NullWritable.

InputFormat

The “InputFormat” abstract class let us use it as input

parameter of the “map” method.


public abstract class InputFormat(K, V> {

public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;

}

Big Data Science 44

InputFormat

“getSplits”: the “map” can use input split.

“createRecordReader”: creates “RecordReader” object so

that the “map” method uses input split in the form of key

and list.

The “map” method executes analysis logic to read key and

value in the RecordReader object.

Kinds of the InputFormat: TextInputFormat,

KeyValueTextInputFormat, NLineInputFormat,

DelegatingInputFormat, CombineFileInputFormat,

SequenceFileInputFormat,

SequenceFileAsBinaryInputFormat,

SequenceFileAsTextInputFormat


Big Data Science 45

Mapper Class

Carry out function of map method of Map-Reduce

programming.

It receives input data consisting key and value, then

process and classify the data to create new data list.

Class definition: using <Input Key Type, Input Value Type,

Output Key Type, Output Value Type>

Context object: Getting information about job, and read

input split as unit of record.

The RecordReader object is passed to the map method to

read data in the form of key and value.

A Map programmer overwrites the map method.

Next, when the run method is to be executed, the map

method will be executed for all keys in the Context object.


Big Data Science 46

Mapper Class


public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

public class Context extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

public Context(Configuration conf, TaskAttemptID taskid,

RecordReader<KEYIN,VALUEIN> reader,

RecordWriter<KEYOUT,VALUEOUT> writer,

OutputCommitter committer, StatusReporter reporter,

InputSplit split) throws IOException, InterruptedException {

super(conf, taskid, reader, writer, committer, reporter, split);

}

}

@SuppressWarnings("unchecked")

protected void map(KEYIN key, VALUEIN value,

Context context) throws IOException, InterruptedException {

context.write((KEYOUT) key, (VALUEOUT) value);

}

Public void run(Context context) throws IOException, InterruptedException {

setup (context);

while (context.nextKeyValue()) {

map(context.getCurrentKey(), context.getCurrentValue(), context);

}

}

Big Data Science 47

Partitioner

To decide to which Reduce task the data of Map task will be

passed.

Calculating a partition as key.hashCode() &

Integer.MAX_VALUE) % numReduceTasks

According to the result, the partition will be created at the

node that Map task has been executed, and output data of

the Map task is saved. When the Map task is completed, the

data at the partition will be transmitted to corresponding

Reduce task through network.

The getPartition method can be overwritten for other

partitioning strategy.


public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {

public int getPartition(K2 key, V2 value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}

}

Big Data Science 48

Reducer

The Reducer class receives the output data of the Map

task as input data to execute aggregation operation.

Class definition: using <Input Key Type, Input Value

Type, Output Key Type, Output Value Type>

Like Mapper class, the Context object: inherits the

ReduceContext. Consult with job for information, getting

the information as RawKeyValueIterator for checking

input value list.

RecordWriter as parameter so that output result of map

method can be as unit of record.

A Reducer programmer overwrites the reduce method.


Big Data Science 49

Reducer Class


public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

public class extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {

public Context(Configuration conf, TaskAttemptID taskid,

RawKeyValueIterator input, Counter inputKeyCounter, Counter inputValueCounter,

RecordWriter<KEYOUT,VALUEOUT> output, OutputCommitter committer, StatusReporter reporter,

RawComparator<KEYIN> comparator, Class<KEYIN> keyClass, Class<VALUEIN> valueClass

) throws IOException, InterruptedException {

super(conf, taskid, input, inputKeyCounter, inputValueCounter,

output, committer, reporter, comparator, keyClass, valueClass);

}

}

protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException {

for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); }

}

public void run(Context context) throws IOException, InterruptedException {

setup(context);

while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); }

cleanup(context);

}

} // end of class Reducer

Big Data Science 50

Combiner Class

Shuffle: Transportation process between Map task and

Reduce task.

It needs to reduce the amount of data that will be

transferred through network to improve performance of

entire jobs.

The Combiner class inputs the output data of Mapper

and creates reduced data on the local node to send

through network.

The Reducer should produce the same output in the

both case: with the Combiner and without the class.


Big Data Science 51

OutputFormat

Output data format of Map-Reduce is created by the format

of setOutputFormatClass method in Job interface.

Output data format is made by inheriting the abstract class

“OutputFormat”.

Kinds of the OutputFormat: TextOutputFormat,

SequenceFileOutputFormat,

SequenceFileAsBinaryOutputFormat, FilterOutputFormat,

LazyOutputFormat, NullOutputFormat.

The classes inherit the FileOutputFormat class which

inherits the OutputFormat class.


52

Document Frequency

But document frequency (df) may be better:

df = number of docs in a corpus containg

the term

Word cf df

pandemic 10777 20

influenza 10440 4000

Document/Collection frequency weighting is

only possible in known(static) collection.

So how do we make use of df?

Big Data Science

53

TF × IDF term weights

TF × IDF measure combines:

Term Frequency (TF)

― or wf, some measure of term density in a doc

Inverse document frequency (IDF)

― measure of informativeness of a term: its rarity

across the whole corpus

― Could just be raw count of number of documents the

term occurs in (IDFi = 1/DFi)

― but by far the most commonly used version is:

IDFi = log (n/DFi)

Big Data Science

54

Summary : TF × IDF (or tf.idf)

Assign a tf.idf weight to each term i in each

document d

Increases with the number of occurrences within a doc

Increases with the rarity of the term across the whole corpus

Big Data Science

HADOOP MapReduce Program : Calculating TF-IDF

Function: Calculating TF-IDF using Map-Reduce on Hadoop

Input: word in document // set of words that are split

Output: TF-IDF for each word //category of words set

1st Map function:

for each word output <word@document title,1>

1st Reduce function:

for each word@document

n = number of each word@document file

output file “word@documet title, n”

2nd Map function

for each word@document title output <document title,word=n>

2nd Reduce function

for each document

N = number of word in all documents

for each word output “word@document title, n/N”

3rd Map function

for each word@document output <word,document title=n/N>

3rd Reduce function

for each word

D = number of word in all documents

for each document

d = number of word in document

calculate TF-IDF value use n, N, d and D

output “word@documet,TF-IDF value”

}

Big Data Science 55

Big Data Science 56

Hadoop Eco System

Source: www.manojrpatil.com

Big Data Science 57

Hadoop Eco System

The Hadoop HDFS and Map-Reduce function

are useful for Big data processing, but just for

specific Java developers and providing low level

processing.

The system makes Hadoop more usable and

layman-accessible with user friendly manner.

It does not mean to all be used together, but as

parts of a single organism; some may even be

seeking to solve the same problem in different

ways.

Big Data Science 58

YARN

Yet Another Resource Negotiator (YARN) addresses

problems with MapReduce 1.0’s architecture, specifically

with the JobTracker service.

YARN "split[s] up the two major functionalities of the

JobTracker, resource management and job

scheduling/monitoring, into separate daemons. The idea

is to have a global Resource Manager (RM) and per-

application ApplicationMaster (AM)." (source: Apache)

Thus, rather than burdening a single node with handling

scheduling and resource management for the entire

cluster, YARN now distributes this responsibility across

the cluster.

Big Data Science 59

Hadoop Eco System

Avro

A framework for performing remote procedure calls and data

serialization. In the context of Hadoop, it can be used to pass

data from one program or language to another, e.g. from C to Pig.

BigTop

A project for packaging and testing the Hadoop ecosystem. Much

of BigTop's code was initially developed and released as part of

Cloudera's CDH distribution, but has since become its own

project at Apache.

Chukwa

A data collection and analysis system built on top of HDFS and

MapReduce. Tailored for collecting logs and other data from

distributed monitoring systems, Chukwa provides a workflow that

allows for incremental data collection, processing and storage in

Hadoop.

Big Data Science 60

Hadoop Eco System Drill

A distributed system for executing interactive analysis over large-

scale datasets. Some explicit goals of the Drill project are to

support real-time querying of nested data and to scale to clusters

of 10,000 nodes or more.

Flume

A tool for harvesting, aggregating and moving large amounts of

log data in and out of Hadoop.

HBase

Based on Google's Bigtable, HBase "is an open-source,

distributed, versioned, column-oriented store" that sits on top of

HDFS. HBase is column-based rather than row-based, which

enables high-speed execution of operations performed over

similar values across massive data sets, e.g. read/write

operations that involve all rows but only a small subset of all

columns.

Big Data Science 61

Hadoop Eco System

HCatalog

A metadata and table storage management service for HDFS.

HCatalog depends on the Hive metastore and exposes it to other

services such as MapReduce and Pig with plans to expand to

HBase using a common data model.

Hive

Provides a warehouse structure and SQL-like access for data in

HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's

query language, HiveQL, compiles to MapReduce. It also allows

user-defined functions (UDFs).

Oozie

A job coordinator and workflow manager for jobs executed in

Hadoop, which can include non-MapReduce jobs.

Big Data Science 62

Hadoop Eco System

Mahout

Mahout is a scalable machine-learning and data mining library.

There are currently four main groups of algorithms in Mahout:

― recommendations, a.k.a. collective filtering

― classification, a.k.a categorization

― clustering

― frequent itemset mining, a.k.a parallel frequent pattern mining

Pig

Framework consisting of a high-level scripting language (Pig

Latin) and a run-time environment that allows users to execute

MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin

is a higher-level language that compiles to MapReduce.

Big Data Science 63

Hadoop Eco System

Sqoop

Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both

directions between relational systems and HDFS or other Hadoop

data stores, e.g. Hive or HBase.

ZooKeeper

A service for maintaining configuration information, naming,

providing distributed synchronization and providing group services.

Spark (UC Berkeley)

A parallel computing program which can operate over any Hadoop

input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an

open-source project at the U.C. Berkeley AMPLab, and in its own

words, Spark "was initially developed for two applications where

keeping data in memory helps: iterative algorithms, which are

common in machine learning, and interactive data mining."

Big Data Science 64

Apache Pig

High-Level Data Flow Language

Made of two components:

Data processing language Pig Latin

Complier to translate Pig Latin to MapReduce

It abstracts a programmer from specific details

and allows to focus on data processing

Big Data Science 65

Pig in the Hadoop Ecosystem

Source: sigmanac.com

Big Data Science 66

Pig Latin

Pig Latin

users = LOAD ‘users.text’ USING PigStorage (‘,’) AS (name, age);

pages = LOAD ‘pages.txt’ USING PigStorage (‘ , ‘) AS (user, url);

filteredUsers = FILTER users BY age >= 18 and age <= 50;

joinResult = JOIN filteredUsers BY name, ages by user;

grouped = GROUP joinResult by url;

summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO ‘top10sites’;

Big Data Science 67

Apache Hive: SQL for Hadoop

Data Warehousing Layer on top of Hadoop

Allows Analysis and queries using a SQL-like

language

It is the best for data analysts familiar with SQL

who need to do ad-hoc queries, summarization

and data analysis

Source: sigmanac.com

Big Data Science 68

Hive Architecture

Big Data Science 69

Example

Hive Example

CREATE TABLE users (name STRING, age INT);

CREATE TABLE pages (user STRING, url STRING);

LOAD DATA INPATH ‘/user/sandbox/users.txt’ INTO TABLE ‘users’ ;

LOAD DATA INPATH ‘/user/sandbox/pages.txt’ INTO TABLE ‘pages’ ;

SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user)

WHERE users.age >= 18 AND users.age <= 50

GROUP BY pages.url

SORT By clicks DESC

LIMIT 10;

Big Data Science 70

Mahout in Hadoop Data Warehousing Layer on top of Hadoop

Allows Analysis and queries using a SQL-like language

It is the best for data analysts familiar with SQL who

need to do ad-hoc queries, summarization and data

analysis

Copy source: http://cloud.watch.impress.co.jp/ Copyright: NTT Data Co.

Mahout Package

71

What does Mahout provide:

Several Data Mining Algorithms

Classification

Clustering

Association Analysis

Recommendations

Others

Classification

72

Assigning some objects to one of

several predefined categories:

Politics, Economics, Sports… of

Reuter news paper

Some figures of galaxies

Classifying credit card transactions

as legitimate or fraudulent

Classify malicious code or ot

Algorithms

Naïve Bayes

Decision Tree

SVM

Linear Regression

Neural Network

Rule-based Methods

Clustering

Grouping objects such that the

objects in a group will be similar (or

related) to one another and

different from (or unrelated to) the

objects in other groups

Partitional Clustering / Hierarchical

Clustering

K-Means, Fuzzy K-Means, Density-

Based,…

Different distance measures

Manhattan, Euclidean, Other

domain dependent methods… 73

Association Analysis

Find the frequent item

sets in transaction DB

<milk, bread, cheese> are

sold frequently together

Apriori principle

Market analysis, access

pattern analysis, etc…

74

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Recommendation

75

Two Representative Types of

Recommendation

Contents Based

Recommendation

Collaborative Filtering (By rating

of similar users)

Trust-Based Recommendation

Online and Offline Support

Several Similarity Measures

Cosine, LLR, Pearson

Correlation

Others

Outlier detection

Math library

Vectors, matrices, etc.

Noise reduction

76

Big Data Science 77

Mahout Example: K-Mean Clustering

Big data application supports: Ma-Reduce

programming, Pig, Hive, etc

Some algorithms such as collaboration filtering,

clustering, classification, association is difficult

to implement.

Example of K-Mean Clustering

Big Data Science 78

Mahout Example: K-Mean Clustering

Short Introduction to Installation and Clustering of Human

Development Report Data Set

Mahout Source Download:

https://cwiki.apache.org/confluence/display/MAHOUT/Downloads

Extracting files and set MAHOUT_HOME variable.

Getting data set:

http://archive.ics.uci.edu/ml/databases/synthetic_contr

ol/synthetic_control.

Execution of K-Mean example

bin/mahout

org.apache.mahout.clustering.syntheticcontrol.kmeans

.Job








http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control










Big Data Science 79

Initializing K-Mean Algorithm public static void main(String[] args) throws Exception {

Path output = new Path("output"); Configuration conf =

new Configuration(); HadoopUtil.delete(conf, output);

run(conf, new Path("testdata"), output, new

EuclideanDistanceMeasure(), 6, 0.5, 10);

}

@Override

public int run(String[] args) throws Exception{

addInputOption(); addOutputOption();

addOption(DefaultOptionCreator.distanceMeasureOption().cre

ate());

addOption(DefaultOptionCreator.numClustersOption().create(

)); addOption(DefaultOptionCreator.t1Option().create());

addOption(DefaultOptionCreator.t2Option().create());

addOption(DefaultOptionCreator.convergenceOption().create(

));

addOption(DefaultOptionCreator.maxIterationsOption().create

());

addOption(DefaultOptionCreator.overwriteOption().create());

Map<String, String> argMap = parseArguments(args);

if (argMap == null) { return -1; }

Path input = getInputPath(); Path output = getOutputPath();

String measureClass =

getOption(DefaultOptionCreator.DISTANCE_MEASURE_OPT

ION); if (measureClass == null) { measureClass =

SquaredEuclideanDistanceMeasure.class.getName(); }

double convergenceDelta =

Double .parseDouble(getOption(DefaultOptionCreator

.CONVERGENCE_DELTA_OPTION)); int maxIterations

=

Integer .parseInt(getOption(DefaultOptionCreator.MA

X_ITERATIONS_OPTION)); if

(hasOption(DefaultOptionCreator.OVERWRITE_OPTION)

) { HadoopUtil.delete(getConf(), output); }

DistanceMeasure measure =

ClassUtils.instantiateAs(measureClass,

DistanceMeasure.class); if

(hasOption(DefaultOptionCreator.NUM_CLUSTERS_OPT

ION)) { int k =

Integer .parseInt(getOption(DefaultOptionCreator.N

UM_CLUSTERS_OPTION)); run(getConf(), input,

output, measure, k, convergenceDelta, maxIterations); }

else { double t1 =

Double.parseDouble(getOption(DefaultOptionCreator.T1_

OPTION)); double t2 =

Double.parseDouble(getOption(DefaultOptionCreator.T2_

OPTION)); run(getConf(), input, output, measure, t1, t2,

convergenceDelta, maxIterations); } return 0; }

Big Data Science 80

Setting K-Mean Algorithm

public static void run(Configuration conf, Path input, Path output,

DistanceMeasure measure, int k, double convergenceDelta, int maxIterations)

throws Exception{

Path directoryContainingConvertedInput = new Path(output,

DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("Preparing Input");

InputDriver.runJob(input, directoryContainingConvertedInput,

"org.apache.mahout.math.RandomAccessSparseVector");

log.info("Running random seed to get initial clusters");

Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);

clusters = RandomSeedGenerator.buildRandom(conf,

directoryContainingConvertedInput, clusters, k, measure);

log.info("Running KMeans");

KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, measure,

convergenceDelta, maxIterations, true, false); // run ClusterDumper ClusterDumper

clusterDumper = new ClusterDumper(finalClusterPath(conf, output, maxIterations), new

Path(output, "clusteredPoints")); clusterDumper.printClusters(null);

}

Big Data Science 81

Example of Visualization of Clustering

Big Data Science 82

Research Issues on Big Data Infra-Structure and Application

Design of Efficient Map-Reduce for Important

Algorithms:

Counting(Word Count/TF-IDF)/Sort

Several Data Mining Algorithms

(Classification/Clustering/Association)

Operations on Graphs/Networks

Other Applications

Application of Hadoop Map-Reduce on

Several Domains Web/SNS data/Bio/Sensors/IOT

Many Domains: Triple Engine for WoD/ SA on Big

Data

Situation Awareness on Big Data

Big Data Science

Three layers for situation awareness

Services for Data from Web/Smart Grid/Web/IOT/Cloud (World)

Processed Information Entity Metadata

(Perception)

Learned Relationship Metadata

(Comprehension)

Data

Information

Semantics / Understanding / Insight

Integration / Ontology Learning

Inferred Metadata

(Projection)

New Fact / Rules

Data Mining / Machine Learning

Reasoning / Prediction

Big Data Science

Layers for Situation Awareness on Smart Grid Services

Smart Grid Network / Sensors, Internet of Things

Web Services / SOAP, RESTfulServices

Perception Layer / Data Mining, Signal Processing

Comprehension Layer / Ontology Learning, Inference

Projection Layer / Knowledge Extraction, Inference

Big Data Science

Active Situation Awareness

86

Projection

Comprehension (Situation)

Perception

World

recommendToparticipateTheEvent(Building, Event)

needReplyTo(ITM)

checkHisEvent(ITM)

hasEvent(Building, Event)

isRare(Event)giveHotTopic

(ITM,ATopicHisBlog)

sayCelebration(ITM, myBlog)

Stand(People, Longline)

isAT(People, Building)

Wrote(ITM, myBlog)

needReplyTo(ITM)

Facebook Twitter

Google

Web Data Service

Big Data Science

A Framework for ASA on Big Data

Big Data Infrastructure by Hadhoop and Data

Mining Algorithm on it.

Big Data Infrastructure by Hadhoop

Smart Grid Infrastructure

Consistent Web APIs for ASA Upper Layer

- What is data format of SG? - How can get data from SG? - How Map between SG and Hive?

- How can get data from Hive?

- How map between Hive and Services?

Big Data Science

ETL Map-Reduce Engine

To ASA Perception Layer

Big Data Infrastructure

Big Data from SNS/Web/Sensors

Events with Time and Location (ETL) Identification & Collector

Big Data Reasoner

ETL Query

Big Data Science

Big Data Infrastructure <K,V> scheme for Event-Time-Location is

S1: <E, <T,L>> S2: <T, <E,L>> S3: <L, <E,T>> where, E is event, T time and L location.

Function: Retrieving E,T,L using Map-Reduce on Hadoop

Input: SNS data set // set of words that are split

Output: E/T/V related to Event //category of words set

Map function for Event:

for each word output <event@input, <T,L> pair>

Reduce function for Event:

for each event@input

<T,L> pair set by temporal and spatial relationship reasoning

output file “reasoned <T,L> pair set”

- Map-Reduce function for the other element can be obtained the same fashion

Map-Reduce Function for Event, Time and Location

Big Data Science

Big Data Science 90

Research Issues on Big Data Infra-Structure and Application

Optimization of Big Data Infrastructure

Locality/Network Aware Big Data

Infrastructure

Pipelining of Partitioning/Shuffling of Map-

Reduce

Considering Map-Oriented/Reduce-Oriented

Operation

Block & Job Scheduling for Optimization

Network-Aware Optimal Data Allocation for Map-Reduce

Big Data Science

Why heavy network load ?

Data movement costs in map phrase:

if the grouping data are distributed by Hadoop's random strategy, the shaded map tasks with either remote data access or queueing delay are the performance barriers;

whereas if these data are evenly distributed, the MapReduce program can avoid these barriers.

Intermediate data shuffling cost across racks in reduce phrase.

TTj stands for a Task Tracker node j. PRi:TTj stands for a partition P produced at TTj and hashed to reducer,

Shuffling PR0:TT0 from TTO to TT3, PR1:TT2 from TT2 to TTO, and PR1:TT3 from TT3 to TTO,

In addition, PR1:TT1 will be shuffled from TTl to TTO and PR0:TT3 will remain local to TT3.

Network load: Off-Rack>rack-local>Node-local

Big Data Science

Example of Minimizing Cost by Data Replacement

Big Data Science

Research Problem

The research problem is:

To develop the mathematic models and

theories;

To develop learning algorithm,

For the optimization of objective function

defined in Eq. (1) with constrains (2) and

(3).

Big Data Science

Optimization Example

0 0 0 1 0 1 0 0 0 0 1 0

0 0 1 0 0 0 1 0 0 0 0 1

0 1 0 0 0 0 0 1 0 1 0 0

1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 1 1

0 1 0 0 1 0 0 0 1 0 0 1

1 0 0 1 0 0 0 1 0 0 1 0

0 0 1 0 0 0 1 0 1 0 0 0

1 0 0 0 1 0 0 0 0 1 0 0

0 0 1 0 0 0 1 0 1 0 0 0

0 0 1 0 0 0 0 1 0 1 0 0

0 1 0 0 0 0 1 0 0 0 0 1

0 0 0 1 0 1 0 0 0 0 1 0

It may show intuitively:

Replicas of each data blocks are distributed into each main server node clusters.

Data blocks whom share high similarity will be replaced closer.

Big Data Science

Big Data Science 96

Tom White, Hadoop, OREILLLY, 2011

Srinath Perera, Thilina Gunarathne, Hadoop Map-

Reduce Programming, Packt Publishing, 2013

J.H Jeong, Beginning Hadoop Programming:

Development and Operations, Wiki Books, 2012

Big Data University, http://bigdatauniversity.com/

I. Paik, R. Sawa, K. Ofuji, and N. Yen, Introduction

to Big Data Science, Graduate School Course,

http://ebiz.u-aizu.ac.jp/lecture/2014-

1/BigDataScience/

References

http://bigdatauniversity.com/

http://ebiz.u-aizu.ac.jp/lecture/2014-1/BigDataScience/





Date post:	14-Feb-2017
Category:	Documents
Upload:	lamthuy
View:	221 times
Download:	2 times

Big Data Science: Fundamental, Techniques, and Challenges ...

Documents