Date post: | 12-Jul-2015 |
Category: |
Data & Analytics |
Upload: | ran-silberman |
View: | 155 times |
Download: | 1 times |
Hadoop Ecosystem
Ran Silberman Dec. 2014
What types of ecosystems exist?
● Systems that are based on MapReduce● Systems that replace MapReduce● Complementary databases● Utilities● See complete list here
Systems based on MapReduce
Hive
● Part of the Apache project● General SQL-like syntax for querying HDFS or other
large databases● Each SQL statement is translated to one or more
MapReduce jobs (in some cases none)● Supports pluggable Mappers, Reducers and SerDe’s
(Serializer/Deserializer)● Pro: Convenient for analytics people that use SQL
Hive Architecture
Hive UsageStart a hive shell:$hive
create hive table:hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email STRING)
Show all tables:hive> SHOW TABLES;
Add a new column to the table:hive> ALTER TABLE tikal ADD COLUMNS (description STRING);
Load HDFS data file into the dable:hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;
query employees that work more than a year:hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 * 60 * 60);
Pig
● Part of the Apache project● A programing language that is compiled into one or
more MaprRecuce jobs.● Supports User Defined functions● Pro: More Convenient to write than pure MapReduce.
Pig UsageStart a pig Shell. (grunt is the PigLatin shell prompt)$ pig
grunt>
Load a HDFS data file:grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'
as (id,name,startdate,email,description);
Dump the data to console:grunt> DUMP employees;
Query the data:grunt> employees_more_than_1_year = FILTER employees BY (float)rating>1.0;
grunt> DUMP employees_more_than_1_year;
Store query result to new file:grunt> store employees_more_than_1_year into '/home/hduser/employees_more_than_1_year';
Cascading
● An infrastructure with API that is compiled to one or more MapReduce jobs
● Provide graphical view of the MapReduce jobs workflow● Ways to tweak setting and improve performance of
workflow.● Pros:
○ Hides MapReduce API and joins jobs○ Graphical view and performance tuning
MapReduce workflow
● MapReduce framework operates exclusively on Key/Value pairs
● There are three phases in the workflow:○ map○ combine○ reduce
(input) <k1, v1> => map => <k2, v2> => combine => <k2, v2> => reduce => <k3, v3> (output)
WordCount in MapRecuce Java APIprivate class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
WordCount in MapRecuce Java Cont.public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
WordCount in MapRecuce Java Cont.public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
MapReduce workflow example.
Let’s consider two text files:
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01
Hello World Bye World
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Mapper codepublic void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
Mapper output
For two files there will be two mappers.
For the given sample input the first map emits: < Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits: < Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Set Combiner
We defined a combiner in the code:
job.setCombinerClass(IntSumReducer.class);
Combiner outputOutput of each map is passed through the local combiner for local aggregation, after being sorted on the keys.The output of the first map: < Bye, 1>
< Hello, 1>
< World, 2>
The output of the second map: < Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Reducer codepublic void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Reducer output
The reducer sums up the valuesThe output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
The Cascading core components
● Tap (Data resource)○ Source (Data input)○ Sink (Data output)
● Pipe (data stream)● Filter (Data operation)● Flow (assembly of Taps and Pipes)
WordCount in Cascading Visualizationsource (Document Collection)sink (Word Count)pipes (Tokenize, Count)
WodCount in Cascading Cont.// define source and sink Taps.Scheme sourceScheme = new TextLine( new Fields( "line" ) );Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assemblyPipe assembly = new Pipe( "wordcount" );
// For each input Tuple// parse out each word into a new Tuple with the field name "word"// regular expressions are optional in CascadingString regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";Function function = new RegexGenerator( new Fields( "word" ), regex );assembly = new Each( assembly, new Fields( "line" ), function );
// group the Tuple stream by the "word" valueassembly = new GroupBy( assembly, new Fields( "word" ) );
WodCount in Cascading// For every Tuple group// count the number of occurrences of "word" and store result in// a field named "count"Aggregator count = new Count( new Fields( "count" ) );assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to useProperties properties = new Properties();FlowConnector.setApplicationJarClass( properties, Main.class );
// plan a new Flow from the assembly using the source and sink Taps// with the above propertiesFlowConnector flowConnector = new FlowConnector( properties );Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until completeflow.complete();
Diagram of Cascading Flow
Scalding
● Extension to Cascading● Programing language is Scala instead of Java● Good for functional programing paradigms in Data
Applications● Pro: code can be very compact!
WordCount in Scaldingimport com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => line.split("""\s+""") }
.groupBy { word => word }
.size
.write(TypedTsv(args("output")))
}
Summingbird
● An open source from Twitter.● An API that is compiled to Scalding and to Storm
topologies.● Can be written in Java or Scala● Pro: When you want to use Lambda Architecture and
you want to write one code that will run on both Hadoop and Storm.
WordCount in Summingbirddef wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
Systems that replace MapReduce
Spark
● Part of the Apache project● Replaces MapReduce with it own engine that works
much faster without compromising consistency● Architecture not based on Map-reduce but rather on two
concepts: RDD (Resilient Distributed Dataset) and DAG (Directed Acyclic Graph)
● Pro’s: ○ Works much faster than MapReduce; ○ fast growing community.
Impala
● Open Source from Cloudera● Used for Interactive queries with SQL syntax● Replaces MapReduce with its own Impala Server ● Pro: Can get much faster response time for SQL over
HDFS than Hive or Pig.
Impala benchmark
Note: Impala is over Parquet!
Impala replaces MapReduce
Impala architecture
● Impala architecture was inspired by Google Dremel● MapReduce is great for functional programming, but not
efficient for SQL.● Impala replaced the MapReduce with Distributed Query
Engine that is optimized for fast queries.
Dermal architecture
Dremel: Interactive Analysis of Web-Scale Datasets
Impala architecture
Presto, Drill, Tez
● Several more alternatives:○ Presto by Facebook○ Apache Drill pushed by MapR○ Apache Tez pushed by Hortonworks
● all are alternatives to Impala and do more or less the same: provide faster response time for queries over HDFS.
● Each of the above claim to have very fast results.● Be careful of benchmarks they publish: to get better
results they use indexed data rather than sequential files in HDFS (i.e., ORC file, Parquet, HBase)
Complementary Databases
HBase
● Apache project● NoSQL cluster database that can grow linearly● Can store billions of rows X millions of columns● Storage is based on HDFS● API based on MapReduce● Pros:
○ Strongly consistent read/writes○ Good for high-speed counter aggregations
Parquet
● Apache (incubator) project. Initiated by Twitter & Cloudera
● Columnar File Format - write one column at a time● Integrated with Hadoop ecosystem (MapReduce, Hive)● Supports Avro, Thrift and ProtBuf● Pro: keep I/O to a minimum by reading from a disk only
the data required for the query
Columnar format (Parquet)
Advantages of Columnar formats
● Better compression as data is more homogenous.
● I/O will be reduced as we can efficiently scan only a
subset of the columns while reading the data.
● When storing data of the same type in each column,
we can use encodings better suited to the modern
processors’ pipeline by making instruction branching
more predictable.
Utilities
Flume
● Cloudera product● Used to collect files from distributed systems and send
them to central repository● Designed for integration with HDFS but can write to
other FS● Supports listening to TCP and UDP sockets● Main Use Case: collect distributed logs to HDFS
Avro
● An Apache project● Data Serialization by Schema● Support rich data structures. Defined in Json-like syntax● Support Schema evolution● Integrated with Hadoop I/O API● Similar to Thrift and ProtocolBuffers
Oozie
● An Apache project● Workflow Scheduler for Hadoop jobs● Very close integration with the Hadoop API
Mesos
● Apache project● Cluster manager that abstracts resources● Integrated with Hadoop to allocate resources● Scalable to 10,000 nodes● Supports physical machines, VM’s, Docker● Multi resource scheduler (memory, CPU, disk, ports)● Web UI for viewing cluster status