Big Data Science 1
Big Data Science: Fundamental, Techniques, and Challenges
(HDFS & Map-Reduce)
2014. 6. 27.
Incheon Paik
University of Aizu, Japan
Tutorial, IEEE SERVICE 2014 Anchorage, Alaska
Big Data Science 2
Contents
What is Big Data?
Big Data Source
Clustered File System & Distributed File
System
Hadoop Distributed File System (HDFS)
Map-Reduce Operation
Application of Map-Reduce
Big Data Science 3
Overview of Big Data Science
What is Big Data?
Recently, there have been very large and
complex data sets from nature, sensors,
social networks, enterprises increasingly
based on high speed computers and networks
together.
Big data is the term for a collection of the data
sets that it becomes difficult to process using
on-hand database management tools or
traditional data processing applications.
Big Data Science 4
Overview of Big Data Science
Several Areas Meteorology
Genomics
Connectomics
Complex physics simulations
Biological and environmental
research
Internet search
Finance and business
informatics
Data sets ubiquitous
information-sensing mobile
devices
Aerial sensory
technologies (remote
sensing)
Software logs
Cameras
Microphones
Radio-frequency
identification readers
Wireless sensor
networks
Big Data Science 5
Clustered File System
A file system which is shared by being
simultaneously mounted on multiple servers.
Clustered file systems can provide features like
Location-independent addressing
Redundancy which improve reliability or reduce the
complexity of the other parts of the cluster.
Parallel file systems are a type of clustered file
system that spread data across multiple storage
nodes, usually for redundancy or performance.
Clustered File System (DAS/NAS/SAN)
Big Data Science 6
Source: www.cgw.com
Fundamental Differences in DAS/NAS/SAN
Big Data Science 7
Source: he.wikipedia.org
Storage Area Network
Big Data Science 8
Source: pccgroup.com
Big Data Science 9
Shared-Disk / Storage Area Network
A shared-disk file-system uses a storage-area
network (SAN) to provide direct disk access from
multiple computers at the block level.
Access control and translation from file-level
operations (application) to block-level operations (by
SAN) must take place on the client node.
Adds a mechanism for concurrency control which
gives a consistent and serializable view of the file
system, avoiding corruption and unintended data loss
Usually employ some sort of a fencing mechanism to
prevent data corruption in case of node failures.
Big Data Science 10
Shared-Disk / Storage Area Network
The SAN may use any of a number of block-level
protocols, including SCSI, iSCSI, HyperSCSI,
ATA over Ethernet, Fibre Channel, etc.
There are different architectural approaches to a
shared-disk file-system.
Distribute file information across all the servers in a
cluster (fully distributed).
Utilize a centralized metadata server.
Both achieve the same result of enabling all servers to
access all the data on a shared storage device.
Big Data Science 11
Network Attached Storage (NAS)
Source: www.boot.lv
Big Data Science 12
Network Attached Storage (NAS)
Provides both storage and a file system, like a
shared disk file system on top of a SAN.
Typically uses file-based protocols (as opposed to
block-based protocols a SAN would use) such as
NFS, SMB/CIFS, AFP, or NCP.
Design Considerations
Avoiding single point of failure: Fault tolerance and high
availability by data replication of one sort or another
Performance: fast disk-access time and small amount of
CPU-processing time over distributed structure
Concurrency: for consistent and efficient multiple accesses
to the same file or block by concurrency control or locking
which may either be built into the file system or provided by
an add-on protocol
Big Data Science 13
NAS vs. SAN
Source: turbotekcomputer.com
Big Data Science 14
Distributed File System
Distributed file systems do not share block level
access to the same storage but use a network
protocol.
These are commonly known as network file
systems, even though they are not the only file
systems that use the network to send data.
Distributed file systems can restrict access to the
file system depending on access lists or
capabilities on both the servers and the clients,
depending on how the protocol is designed.
Big Data Science 15
Distributed File System
The difference between a distributed file
system and a distributed data store
Distributed file system (DFS) allows files to be
accessed using the same interfaces and
semantics as local files - e.g.
mounting/unmounting, listing directories,
read/write at byte boundaries, system's native
permission model.
Distributed data stores, by contrast, require
using a different API or library and have
different semantics (most often those of a
database).
Big Data Science 16
Distributed File System
Design Goal of DFS
Access transparency: Clients are unaware that files
are distributed and can access them in the same way
as local files are accessed.
Location transparency: A consistent name space
exists encompassing local as well as remote files. The
name of a file does not give its location.
Concurrency transparency: All clients have the
same view of the state of the file system. This means
that if one process is modifying a file, any other
processes on the same system or remote systems
that are accessing the files will see the modifications
in a coherent manner.
Big Data Science 17
Distributed File System Design Goal of DFS
Failure transparency: The client and client programs
should operate correctly after a server failure.
Heterogeneity: File service should be provided across
different hardware and operating system platforms.
Scalability: The file system should work well in small
environments (1 machine, a dozen machines) and also
scale gracefully to huge ones (hundreds through tens
of thousands of systems).
Replication transparency: To support scalability, we
may wish to replicate files across multiple servers.
Clients should be unaware of this.
Migration transparency: Files should be able to move
around without the client's knowledge.
Big Data Science 18
Distributed File System Examples
GFS (Google Inc.)
Ceph (Inktank)
Windows Distributed File System (DFS) (Microsoft)
FhGFS (Fraunhofer)
GlusterFS (Red Hat)
Lustre
Ibrix
Big Data Science 19
The existing file system such as large file storage , NAS
or SAN is expensive. It needs very high performance
server.
HDFS can combine Web server level hosts into a disk
storage.
Some high performance large storage system will be
needed. SAN is suitable for DBMS, NAS for safe file
store.
It can process big data concurrently using Map-Reduce.
Four Fundamental Objectives of HDFS
Error recovery
Access data using streaming
Large data storage
Data integrity
Hadoop Distributed File System (HDFS)
Big Data Science 20
Hadoop Distributed File System
Big Data Science 21
Block Structured File System
Block 1 Block 2 Block 3 Block 4 Block 5
320 MB File
Block 1
Block 3
Block 4
Block 2
Block 3
Block 4
Block 1
Block 3
Block 5
Block 2
Block 4
Block 5
Block 1
Block 2
Block 5
HDFS
File Copy Structure on HDFS
Big Data Science 22
Block Structured File System
Source:bradhedlund.com
Big Data Science 23
Block Structured File System
Source:bradhedlund.com
Big Data Science 24
HDFS,NoSQL, Map-Reduce
NoSQL
No SQL database provides a mechanism for storage
and retrieval of data that employs less constrained
consistency models than traditional relational
databases.
Motivations for this approach include simplicity of
design, horizontal scaling and finer control over
availability.
NoSQL databases are often highly optimized key–
value stores intended for simple retrieval and
appending operations, with the goal being significant
performance benefits in terms of latency and
throughput.
Big Data Science 25
Map Operation
int a = {1,2,3};
for (I = 0; I < a.length; i++)
a[i] = a[i] * 2;
What is a Map operation
Doing something to every element in an array is a
common operation. The operation can be
considered as Map function.
This can be
written as a
function.
int a = {2,4,6};
New value for the variable a would be
Map-Reduce Operation
Source:bigdatauniversity.com
Big Data Science 26
Map Operation
int a = {1,2,3};
for (I = 0; I < a.length; i++)
a[i] = fx(a[i]);
What is a Map operation
Doing something to every element in an array is a
common operation. The operation can be
considered as Map function.
Like this, where fx
is a function
defined as :
function
fx(x) {return x * 2 ;}
New value for the variable a would be
int a = {2,4,6};
All of this can also be converted
into a “map” function.
Big Data Science 27
Map Operation
function map (fx , a) {
for (I = 0; I < a.length; i++)
a[i] = fx(a[i]);
}
What is a Map operation
…like this, where fx is a function passed as an
argument:
This is function fx whose defintion is included in the call.
You can invoke this map function as follows
map(function(x) {return x*2; }, a);
Big Data Science 28
Reduce Operation
function sum(a) {
int s = 0;
for (i = 0; I < a.length; i++)
s += a[i] ;
return s;
}
What is a Reduce operation
Another common operation on arrays is to combine
all their value:
This can be
written as a
function.
Big Data Science 29
Reduce Operation
function sum(a) {
int s = 0;
for (i = 0; I < a.length; i++)
s = fx (s, a[i]) ;
return s;
}
What is a Reduce operation
Another common operation on arrays is to combine
all their values:
Like this, where a
function fx is
defined so that it
adds its arguments:
function
fx(a,b) {return a+b ;}
The whole function sum can also be rewritten so that fx is
passed as an argument.
reduce(function(a,b) {return a+b; }, a, 0);
as a reduce operation:
Big Data Science 30
Submitting a MapReduce Job
Source:bigdatauniversity.com
Big Data Science 31
Shuffle Process
Map1
Reduce1
Input Split
Map2
Map3
Reduce2
1. Split Creation 2. Map 3. Spill 4. Merge 5. Copy 6. Sort 7. Reduce
Memory Buffer Partition
Map Output Data
Memory Buffer, File
File
Input Split
Input Split
Source: Beginning Hadoop Programming
MapReduce – Type and Data Flow
Lists
Big Data Science 32
Source:bigdatauniversity.com
HADOOP MapReduce Program : WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
// Mapper Implementation
public static class Map extends
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
// Reducer Implementation
public static class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable value = new IntWritable(0);
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values)
sum += value.get();
value.set(sum);
context.write(key, value);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Parameter Passing
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
args = parser.getRemainingArgs();
// Creating of Job
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
// Assign Classes for Mapper/Reducer
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// Other Setting for Job
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Big Data Science 33
Big Data Science 34
Map-Reduce Architecture on Hadoop
Client
Job Tracker
Task Tracker
Map Map
Map
Reduce Reduce
Reduce
Data Node
Task Tracker
Map Map
Map
Reduce Reduce
Reduce
Data Node
Task Tracker
Map Map
Map
Reduce Reduce
Reduce
Data Node
Source: Beginning Hadoop Programming
Big Data Science 35
Client: Map-Reduce API provided by Hadoop
and Map-Reduce Program executed by a user.
Job Tracker
A Map-Reducer program is managed as a work unit,
“Job”.
Manage scheduling and monitoring of entire jobs on
Hadoop cluster.
Running on a Name Node server usually, but
necessarily.
Description of Job Tracker
When a user requests a new job, it calculates how
many Maps and Reduces will be executed for the job.
Map-Reduce Architecture on Hadoop
Big Data Science 36
Description of Job Tracker
A Map-Reduce program that executed by a client
through Hadoop is managed by the unit of “Job”.
The Job Tracker monitors scheduling of all jobs
registered on Hadoop cluster. But we don’t have to
execute the job tracker on name node server.
Decide Task Trackers where the Maps-Reduces will
be executed and assign the job to the Task Trackers
decided. -> Then the task trackers execute the Map-
Reduce programs.
The job tracker communicates with task trackers with
heart-beats for checking status of task trackers and
job execution information.
Map-Reduce Architecture on Hadoop
Big Data Science 37
Task Tracker
A demon program that executes Map-Reduce
program of a user on a Data Node.
Creates Map-Tasks and Reduce-Tasks that the Job
Tracker requested.
Execute the tasks (Map and Reduces) by running
new JVM.
Workflow of Map-Reduce
A user executes a job by invoking waitForCompletion
method.
A job client object is created in the job interface, and
the object access to the Job Tracker to execute the
job.
Map-Reduce Architecture on Hadoop
Big Data Science 38
Map-Reduce Operation on Hadoop
Client
Task Tracker
Task Task
Task
Job Tracker
Task Task
Job
Job Client Input Data
Mapper
Reducer
Mapper
Partitioner
Key-Value
Reducer
Key-Value
Key-Value Key-Value
Output Data
1. Map-Reduce Job Execute
2. New Job Submit 3. Input Split Creation
5. Mapper Execution
6. Sort and Merge
7. Reducer Execution
8. Output Data Save
4. Job Assignment
Name Node Data Node
Source: Beginning Hadoop Programming
Big Data Science 39
Workflow of Map-Reduce
Then, the Job Tracker returns a new job ID to the Job
Client. And the Job Client checks the output path
defined by the user.
The Job Client calculates input split (Size of input file
to be processed in a Map procedure) about the input
data of the job. After the calculation of the input split,
it saves input split information, setting file, and Map-
Reduce JAR file to HDFS, and notifies completion of
preparation for starting the job.
The Job Tracker registers the job on the Queue, Job
scheduler initializes the job from the Queue. And then,
the Job Scheduler creates Map tasks according to the
input split information and assign IDs.
Map-Reduce Architecture on Hadoop
Big Data Science 40
Workflow of Map-Reduce
The Task Tracker gives its status to the Job Tracker
periodically calling heartbeat method. In heartbeat
information, there are resource information about CPU,
memory, disk, number of tasks, and available maximum
tasks, new tasks execution possibilities.
The Task Tracker executes the Map task assigned by
the Job Tracker. The Map task executes the logic in the
Map method, and save output data to memory buffer. At
this time, a patitioner decides a Reduce task to which
the output data should be transferred, and assign a
suitable partition for it. The data in memory will be
sorted by the keys, and saved to local hard disk. When
the save is to be completed, the files will be sorted and
merged into one file.
Map-Reduce Architecture on Hadoop
Big Data Science 41
Workflow of Map-Reduce
A reduce task is the Reducer class written by a user.
The task can be executed when all output data are to be
prepared. When the Map task completes output, it
notifies its completion to the Task Tracker that ran itself.
When the Task Tracker receives the message, it notifies
status of the corresponding Map task and output data
path of the Map task to the Job Tracker.
The Reduce task copies output data of all Map tasks,
and merges output data of the Map tasks. When the
merge is to be completed, it executes analysis logic to
call Reduce method.
The Reduce task saves the output data to HDFS as
name of “part-nnnnn”. Where “nnnnn” means partition ID.
Map-Reduce Architecture on Hadoop
Big Data Science 42
Data Types
The Map-Reduce is the optimized object for network
communication, and provides “WritableComparable”
interface.
All data types for Key and Value in Map-Reduce
program should implement the “WritableComparable”
interface.
The “WritableComparable” interface inherits the
“Writable” and “Comparable” interfaces.
The “write” method serializes data value, and the
“readFields” method unserializes to read the serialized
data.
Component of Map-Reduce Programming
public interface Writable {
void write (DataOutput out) throws IOException;
void readFields (DataInput in) throws IOException;
}
Big Data Science 43
Data Types
The “Wrapper” class to implement the
“WritableComparable” interface for each data type:
BoolleanWritable, ByteWritable, DoubleWritable,
FloatWritable, IntWritable, LongWritable, TextWritable,
NullWritable.
InputFormat
The “InputFormat” abstract class let us use it as input
parameter of the “map” method.
Component of Map-Reduce Programming
public abstract class InputFormat(K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;
public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
}
Big Data Science 44
InputFormat
“getSplits”: the “map” can use input split.
“createRecordReader”: creates “RecordReader” object so
that the “map” method uses input split in the form of key
and list.
The “map” method executes analysis logic to read key and
value in the RecordReader object.
Kinds of the InputFormat: TextInputFormat,
KeyValueTextInputFormat, NLineInputFormat,
DelegatingInputFormat, CombineFileInputFormat,
SequenceFileInputFormat,
SequenceFileAsBinaryInputFormat,
SequenceFileAsTextInputFormat
Component of Map-Reduce Programming
Big Data Science 45
Mapper Class
Carry out function of map method of Map-Reduce
programming.
It receives input data consisting key and value, then
process and classify the data to create new data list.
Class definition: using <Input Key Type, Input Value Type,
Output Key Type, Output Value Type>
Context object: Getting information about job, and read
input split as unit of record.
The RecordReader object is passed to the map method to
read data in the form of key and value.
A Map programmer overwrites the map method.
Next, when the run method is to be executed, the map
method will be executed for all keys in the Context object.
Component of Map-Reduce Programming
Big Data Science 46
Mapper Class
Component of Map-Reduce Programming
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public Context(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer, StatusReporter reporter,
InputSplit split) throws IOException, InterruptedException {
super(conf, taskid, reader, writer, committer, reporter, split);
}
}
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
Public void run(Context context) throws IOException, InterruptedException {
setup (context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
}
Big Data Science 47
Partitioner
To decide to which Reduce task the data of Map task will be
passed.
Calculating a partition as key.hashCode() &
Integer.MAX_VALUE) % numReduceTasks
According to the result, the partition will be created at the
node that Map task has been executed, and output data of
the Map task is saved. When the Map task is completed, the
data at the partition will be transmitted to corresponding
Reduce task through network.
The getPartition method can be overwritten for other
partitioning strategy.
Component of Map-Reduce Programming
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
public int getPartition(K2 key, V2 value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
Big Data Science 48
Reducer
The Reducer class receives the output data of the Map
task as input data to execute aggregation operation.
Class definition: using <Input Key Type, Input Value
Type, Output Key Type, Output Value Type>
Like Mapper class, the Context object: inherits the
ReduceContext. Consult with job for information, getting
the information as RawKeyValueIterator for checking
input value list.
RecordWriter as parameter so that output result of map
method can be as unit of record.
A Reducer programmer overwrites the reduce method.
Component of Map-Reduce Programming
Big Data Science 49
Reducer Class
Component of Map-Reduce Programming
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public class extends ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public Context(Configuration conf, TaskAttemptID taskid,
RawKeyValueIterator input, Counter inputKeyCounter, Counter inputValueCounter,
RecordWriter<KEYOUT,VALUEOUT> output, OutputCommitter committer, StatusReporter reporter,
RawComparator<KEYIN> comparator, Class<KEYIN> keyClass, Class<VALUEIN> valueClass
) throws IOException, InterruptedException {
super(conf, taskid, input, inputKeyCounter, inputValueCounter,
output, committer, reporter, comparator, keyClass, valueClass);
}
}
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException {
for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); }
}
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); }
cleanup(context);
}
} // end of class Reducer
Big Data Science 50
Combiner Class
Shuffle: Transportation process between Map task and
Reduce task.
It needs to reduce the amount of data that will be
transferred through network to improve performance of
entire jobs.
The Combiner class inputs the output data of Mapper
and creates reduced data on the local node to send
through network.
The Reducer should produce the same output in the
both case: with the Combiner and without the class.
Component of Map-Reduce Programming
Big Data Science 51
OutputFormat
Output data format of Map-Reduce is created by the format
of setOutputFormatClass method in Job interface.
Output data format is made by inheriting the abstract class
“OutputFormat”.
Kinds of the OutputFormat: TextOutputFormat,
SequenceFileOutputFormat,
SequenceFileAsBinaryOutputFormat, FilterOutputFormat,
LazyOutputFormat, NullOutputFormat.
The classes inherit the FileOutputFormat class which
inherits the OutputFormat class.
Component of Map-Reduce Programming
52
Document Frequency
But document frequency (df) may be better:
df = number of docs in a corpus containg
the term
Word cf df
pandemic 10777 20
influenza 10440 4000
Document/Collection frequency weighting is
only possible in known(static) collection.
So how do we make use of df?
Big Data Science
53
TF × IDF term weights
TF × IDF measure combines:
Term Frequency (TF)
― or wf, some measure of term density in a doc
Inverse document frequency (IDF)
― measure of informativeness of a term: its rarity
across the whole corpus
― Could just be raw count of number of documents the
term occurs in (IDFi = 1/DFi)
― but by far the most commonly used version is:
IDFi = log (n/DFi)
Big Data Science
54
Summary : TF × IDF (or tf.idf)
Assign a tf.idf weight to each term i in each
document d
Increases with the number of occurrences within a doc
Increases with the rarity of the term across the whole corpus
Big Data Science
HADOOP MapReduce Program : Calculating TF-IDF
Function: Calculating TF-IDF using Map-Reduce on Hadoop
Input: word in document // set of words that are split
Output: TF-IDF for each word //category of words set
1st Map function:
for each word output <word@document title,1>
1st Reduce function:
for each word@document
n = number of each word@document file
output file “word@documet title, n”
2nd Map function
for each word@document title output <document title,word=n>
2nd Reduce function
for each document
N = number of word in all documents
for each word output “word@document title, n/N”
3rd Map function
for each word@document output <word,document title=n/N>
3rd Reduce function
for each word
D = number of word in all documents
for each document
d = number of word in document
calculate TF-IDF value use n, N, d and D
output “word@documet,TF-IDF value”
}
Big Data Science 55
Big Data Science 56
Hadoop Eco System
Source: www.manojrpatil.com
Big Data Science 57
Hadoop Eco System
The Hadoop HDFS and Map-Reduce function
are useful for Big data processing, but just for
specific Java developers and providing low level
processing.
The system makes Hadoop more usable and
layman-accessible with user friendly manner.
It does not mean to all be used together, but as
parts of a single organism; some may even be
seeking to solve the same problem in different
ways.
Big Data Science 58
YARN
Yet Another Resource Negotiator (YARN) addresses
problems with MapReduce 1.0’s architecture, specifically
with the JobTracker service.
YARN "split[s] up the two major functionalities of the
JobTracker, resource management and job
scheduling/monitoring, into separate daemons. The idea
is to have a global Resource Manager (RM) and per-
application ApplicationMaster (AM)." (source: Apache)
Thus, rather than burdening a single node with handling
scheduling and resource management for the entire
cluster, YARN now distributes this responsibility across
the cluster.
Big Data Science 59
Hadoop Eco System
Avro
A framework for performing remote procedure calls and data
serialization. In the context of Hadoop, it can be used to pass
data from one program or language to another, e.g. from C to Pig.
BigTop
A project for packaging and testing the Hadoop ecosystem. Much
of BigTop's code was initially developed and released as part of
Cloudera's CDH distribution, but has since become its own
project at Apache.
Chukwa
A data collection and analysis system built on top of HDFS and
MapReduce. Tailored for collecting logs and other data from
distributed monitoring systems, Chukwa provides a workflow that
allows for incremental data collection, processing and storage in
Hadoop.
Big Data Science 60
Hadoop Eco System Drill
A distributed system for executing interactive analysis over large-
scale datasets. Some explicit goals of the Drill project are to
support real-time querying of nested data and to scale to clusters
of 10,000 nodes or more.
Flume
A tool for harvesting, aggregating and moving large amounts of
log data in and out of Hadoop.
HBase
Based on Google's Bigtable, HBase "is an open-source,
distributed, versioned, column-oriented store" that sits on top of
HDFS. HBase is column-based rather than row-based, which
enables high-speed execution of operations performed over
similar values across massive data sets, e.g. read/write
operations that involve all rows but only a small subset of all
columns.
Big Data Science 61
Hadoop Eco System
HCatalog
A metadata and table storage management service for HDFS.
HCatalog depends on the Hive metastore and exposes it to other
services such as MapReduce and Pig with plans to expand to
HBase using a common data model.
Hive
Provides a warehouse structure and SQL-like access for data in
HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's
query language, HiveQL, compiles to MapReduce. It also allows
user-defined functions (UDFs).
Oozie
A job coordinator and workflow manager for jobs executed in
Hadoop, which can include non-MapReduce jobs.
Big Data Science 62
Hadoop Eco System
Mahout
Mahout is a scalable machine-learning and data mining library.
There are currently four main groups of algorithms in Mahout:
― recommendations, a.k.a. collective filtering
― classification, a.k.a categorization
― clustering
― frequent itemset mining, a.k.a parallel frequent pattern mining
Pig
Framework consisting of a high-level scripting language (Pig
Latin) and a run-time environment that allows users to execute
MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin
is a higher-level language that compiles to MapReduce.
Big Data Science 63
Hadoop Eco System
Sqoop
Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both
directions between relational systems and HDFS or other Hadoop
data stores, e.g. Hive or HBase.
ZooKeeper
A service for maintaining configuration information, naming,
providing distributed synchronization and providing group services.
Spark (UC Berkeley)
A parallel computing program which can operate over any Hadoop
input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an
open-source project at the U.C. Berkeley AMPLab, and in its own
words, Spark "was initially developed for two applications where
keeping data in memory helps: iterative algorithms, which are
common in machine learning, and interactive data mining."
Big Data Science 64
Apache Pig
High-Level Data Flow Language
Made of two components:
Data processing language Pig Latin
Complier to translate Pig Latin to MapReduce
It abstracts a programmer from specific details
and allows to focus on data processing
Big Data Science 65
Pig in the Hadoop Ecosystem
Source: sigmanac.com
Big Data Science 66
Pig Latin
Pig Latin
users = LOAD ‘users.text’ USING PigStorage (‘,’) AS (name, age);
pages = LOAD ‘pages.txt’ USING PigStorage (‘ , ‘) AS (user, url);
filteredUsers = FILTER users BY age >= 18 and age <= 50;
joinResult = JOIN filteredUsers BY name, ages by user;
grouped = GROUP joinResult by url;
summed = FOREACH grouped GENERATE group, COUNT(joinResult) as clicks;
sorted = ORDER summed BY clicks desc;
top10 = LIMIT sorted 10;
STORE top10 INTO ‘top10sites’;
Big Data Science 67
Apache Hive: SQL for Hadoop
Data Warehousing Layer on top of Hadoop
Allows Analysis and queries using a SQL-like
language
It is the best for data analysts familiar with SQL
who need to do ad-hoc queries, summarization
and data analysis
Source: sigmanac.com
Big Data Science 68
Hive Architecture
Big Data Science 69
Example
Hive Example
CREATE TABLE users (name STRING, age INT);
CREATE TABLE pages (user STRING, url STRING);
LOAD DATA INPATH ‘/user/sandbox/users.txt’ INTO TABLE ‘users’ ;
LOAD DATA INPATH ‘/user/sandbox/pages.txt’ INTO TABLE ‘pages’ ;
SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user)
WHERE users.age >= 18 AND users.age <= 50
GROUP BY pages.url
SORT By clicks DESC
LIMIT 10;
Big Data Science 70
Mahout in Hadoop Data Warehousing Layer on top of Hadoop
Allows Analysis and queries using a SQL-like language
It is the best for data analysts familiar with SQL who
need to do ad-hoc queries, summarization and data
analysis
Copy source: http://cloud.watch.impress.co.jp/ Copyright: NTT Data Co.
Mahout Package
71
What does Mahout provide:
Several Data Mining Algorithms
Classification
Clustering
Association Analysis
Recommendations
Others
Classification
72
Assigning some objects to one of
several predefined categories:
Politics, Economics, Sports… of
Reuter news paper
Some figures of galaxies
Classifying credit card transactions
as legitimate or fraudulent
Classify malicious code or ot
Algorithms
Naïve Bayes
Decision Tree
SVM
Linear Regression
Neural Network
Rule-based Methods
Clustering
Grouping objects such that the
objects in a group will be similar (or
related) to one another and
different from (or unrelated to) the
objects in other groups
Partitional Clustering / Hierarchical
Clustering
K-Means, Fuzzy K-Means, Density-
Based,…
Different distance measures
Manhattan, Euclidean, Other
domain dependent methods… 73
Association Analysis
Find the frequent item
sets in transaction DB
<milk, bread, cheese> are
sold frequently together
Apriori principle
Market analysis, access
pattern analysis, etc…
74
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Recommendation
75
Two Representative Types of
Recommendation
Contents Based
Recommendation
Collaborative Filtering (By rating
of similar users)
Trust-Based Recommendation
Online and Offline Support
Several Similarity Measures
Cosine, LLR, Pearson
Correlation
Others
Outlier detection
Math library
Vectors, matrices, etc.
Noise reduction
76
Big Data Science 77
Mahout Example: K-Mean Clustering
Big data application supports: Ma-Reduce
programming, Pig, Hive, etc
Some algorithms such as collaboration filtering,
clustering, classification, association is difficult
to implement.
Example of K-Mean Clustering
Big Data Science 78
Mahout Example: K-Mean Clustering
Short Introduction to Installation and Clustering of Human
Development Report Data Set
Mahout Source Download:
https://cwiki.apache.org/confluence/display/MAHOUT/Downloads
Extracting files and set MAHOUT_HOME variable.
Getting data set:
http://archive.ics.uci.edu/ml/databases/synthetic_contr
ol/synthetic_control.
Execution of K-Mean example
bin/mahout
org.apache.mahout.clustering.syntheticcontrol.kmeans
.Job
Big Data Science 79
Initializing K-Mean Algorithm public static void main(String[] args) throws Exception {
Path output = new Path("output"); Configuration conf =
new Configuration(); HadoopUtil.delete(conf, output);
run(conf, new Path("testdata"), output, new
EuclideanDistanceMeasure(), 6, 0.5, 10);
}
@Override
public int run(String[] args) throws Exception{
addInputOption(); addOutputOption();
addOption(DefaultOptionCreator.distanceMeasureOption().cre
ate());
addOption(DefaultOptionCreator.numClustersOption().create(
)); addOption(DefaultOptionCreator.t1Option().create());
addOption(DefaultOptionCreator.t2Option().create());
addOption(DefaultOptionCreator.convergenceOption().create(
));
addOption(DefaultOptionCreator.maxIterationsOption().create
());
addOption(DefaultOptionCreator.overwriteOption().create());
Map<String, String> argMap = parseArguments(args);
if (argMap == null) { return -1; }
Path input = getInputPath(); Path output = getOutputPath();
String measureClass =
getOption(DefaultOptionCreator.DISTANCE_MEASURE_OPT
ION); if (measureClass == null) { measureClass =
SquaredEuclideanDistanceMeasure.class.getName(); }
double convergenceDelta =
Double .parseDouble(getOption(DefaultOptionCreator
.CONVERGENCE_DELTA_OPTION)); int maxIterations
=
Integer .parseInt(getOption(DefaultOptionCreator.MA
X_ITERATIONS_OPTION)); if
(hasOption(DefaultOptionCreator.OVERWRITE_OPTION)
) { HadoopUtil.delete(getConf(), output); }
DistanceMeasure measure =
ClassUtils.instantiateAs(measureClass,
DistanceMeasure.class); if
(hasOption(DefaultOptionCreator.NUM_CLUSTERS_OPT
ION)) { int k =
Integer .parseInt(getOption(DefaultOptionCreator.N
UM_CLUSTERS_OPTION)); run(getConf(), input,
output, measure, k, convergenceDelta, maxIterations); }
else { double t1 =
Double.parseDouble(getOption(DefaultOptionCreator.T1_
OPTION)); double t2 =
Double.parseDouble(getOption(DefaultOptionCreator.T2_
OPTION)); run(getConf(), input, output, measure, t1, t2,
convergenceDelta, maxIterations); } return 0; }
Big Data Science 80
Setting K-Mean Algorithm
public static void run(Configuration conf, Path input, Path output,
DistanceMeasure measure, int k, double convergenceDelta, int maxIterations)
throws Exception{
Path directoryContainingConvertedInput = new Path(output,
DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("Preparing Input");
InputDriver.runJob(input, directoryContainingConvertedInput,
"org.apache.mahout.math.RandomAccessSparseVector");
log.info("Running random seed to get initial clusters");
Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
clusters = RandomSeedGenerator.buildRandom(conf,
directoryContainingConvertedInput, clusters, k, measure);
log.info("Running KMeans");
KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, measure,
convergenceDelta, maxIterations, true, false); // run ClusterDumper ClusterDumper
clusterDumper = new ClusterDumper(finalClusterPath(conf, output, maxIterations), new
Path(output, "clusteredPoints")); clusterDumper.printClusters(null);
}
Big Data Science 81
Example of Visualization of Clustering
Big Data Science 82
Research Issues on Big Data Infra-Structure and Application
Design of Efficient Map-Reduce for Important
Algorithms:
Counting(Word Count/TF-IDF)/Sort
Several Data Mining Algorithms
(Classification/Clustering/Association)
Operations on Graphs/Networks
Other Applications
Application of Hadoop Map-Reduce on
Several Domains Web/SNS data/Bio/Sensors/IOT
Many Domains: Triple Engine for WoD/ SA on Big
Data
Situation Awareness on Big Data
Big Data Science
Three layers for situation awareness
Services for Data from Web/Smart Grid/Web/IOT/Cloud (World)
Processed Information Entity Metadata
(Perception)
Learned Relationship Metadata
(Comprehension)
Data
Information
Semantics / Understanding / Insight
Integration / Ontology Learning
Inferred Metadata
(Projection)
New Fact / Rules
Data Mining / Machine Learning
Reasoning / Prediction
Big Data Science
Layers for Situation Awareness on Smart Grid Services
Smart Grid Network / Sensors, Internet of Things
Web Services / SOAP, RESTfulServices
Perception Layer / Data Mining, Signal Processing
Comprehension Layer / Ontology Learning, Inference
Projection Layer / Knowledge Extraction, Inference
Big Data Science
Active Situation Awareness
86
Projection
Comprehension (Situation)
Perception
World
recommendToparticipateTheEvent(Building, Event)
needReplyTo(ITM)
checkHisEvent(ITM)
hasEvent(Building, Event)
isRare(Event)giveHotTopic
(ITM,ATopicHisBlog)
sayCelebration(ITM, myBlog)
Stand(People, Longline)
isAT(People, Building)
Wrote(ITM, myBlog)
needReplyTo(ITM)
Facebook Twitter
Web Data Service
Big Data Science
A Framework for ASA on Big Data
Big Data Infrastructure by Hadhoop and Data
Mining Algorithm on it.
Big Data Infrastructure by Hadhoop
Smart Grid Infrastructure
Consistent Web APIs for ASA Upper Layer
- What is data format of SG? - How can get data from SG? - How Map between SG and Hive?
- How can get data from Hive?
- How map between Hive and Services?
Big Data Science
ETL Map-Reduce Engine
To ASA Perception Layer
Big Data Infrastructure
Big Data from SNS/Web/Sensors
Events with Time and Location (ETL) Identification & Collector
Big Data Reasoner
ETL Query
Big Data Science
Big Data Infrastructure <K,V> scheme for Event-Time-Location is
S1: <E, <T,L>> S2: <T, <E,L>> S3: <L, <E,T>> where, E is event, T time and L location.
Function: Retrieving E,T,L using Map-Reduce on Hadoop
Input: SNS data set // set of words that are split
Output: E/T/V related to Event //category of words set
Map function for Event:
for each word output <event@input, <T,L> pair>
Reduce function for Event:
for each event@input
<T,L> pair set by temporal and spatial relationship reasoning
output file “reasoned <T,L> pair set”
- Map-Reduce function for the other element can be obtained the same fashion
Map-Reduce Function for Event, Time and Location
Big Data Science
Big Data Science 90
Research Issues on Big Data Infra-Structure and Application
Optimization of Big Data Infrastructure
Locality/Network Aware Big Data
Infrastructure
Pipelining of Partitioning/Shuffling of Map-
Reduce
Considering Map-Oriented/Reduce-Oriented
Operation
Block & Job Scheduling for Optimization
Network-Aware Optimal Data Allocation for Map-Reduce
Big Data Science
Why heavy network load ?
Data movement costs in map phrase:
if the grouping data are distributed by Hadoop's random strategy, the shaded map tasks with either remote data access or queueing delay are the performance barriers;
whereas if these data are evenly distributed, the MapReduce program can avoid these barriers.
Intermediate data shuffling cost across racks in reduce phrase.
TTj stands for a Task Tracker node j. PRi:TTj stands for a partition P produced at TTj and hashed to reducer,
Shuffling PR0:TT0 from TTO to TT3, PR1:TT2 from TT2 to TTO, and PR1:TT3 from TT3 to TTO,
In addition, PR1:TT1 will be shuffled from TTl to TTO and PR0:TT3 will remain local to TT3.
Network load: Off-Rack>rack-local>Node-local
Big Data Science
Example of Minimizing Cost by Data Replacement
Big Data Science
Research Problem
The research problem is:
To develop the mathematic models and
theories;
To develop learning algorithm,
For the optimization of objective function
defined in Eq. (1) with constrains (2) and
(3).
Big Data Science
Optimization Example
0 0 0 1 0 1 0 0 0 0 1 0
0 0 1 0 0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 1 0 1 0 0
1 1 0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 1 1
0 1 0 0 1 0 0 0 1 0 0 1
1 0 0 1 0 0 0 1 0 0 1 0
0 0 1 0 0 0 1 0 1 0 0 0
1 0 0 0 1 0 0 0 0 1 0 0
0 0 1 0 0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1 0 1 0 0
0 1 0 0 0 0 1 0 0 0 0 1
0 0 0 1 0 1 0 0 0 0 1 0
It may show intuitively:
Replicas of each data blocks are distributed into each main server node clusters.
Data blocks whom share high similarity will be replaced closer.
Big Data Science
Big Data Science 96
Tom White, Hadoop, OREILLLY, 2011
Srinath Perera, Thilina Gunarathne, Hadoop Map-
Reduce Programming, Packt Publishing, 2013
J.H Jeong, Beginning Hadoop Programming:
Development and Operations, Wiki Books, 2012
Big Data University, http://bigdatauniversity.com/
I. Paik, R. Sawa, K. Ofuji, and N. Yen, Introduction
to Big Data Science, Graduate School Course,
http://ebiz.u-aizu.ac.jp/lecture/2014-
1/BigDataScience/
References