© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and
Spark
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Machine Types
core
switch
top-of-rack
switches
master nodes run
Hadoop master
processes to
manage and
coordinate cluster
services and
tasks
slave nodes run
Hadoop slave
processes and
provide cluster
resources to
perform data
processing
client machines have
client-side software
used to access a cluster
to process data
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
How Hadoop Processes Data
• Hadoop has historically processed data using
MapReduce.
• MapReduce has been the basis for Hadoop’s data
processing scalability.
– MapReduce processes the data on each slave node in parallel
and then aggregates the results.
• The secret to performance and scalability is to move the processing to
the data rather than move the data to the processing.
• Doing so signficantly reduces network I/O traffic.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Version 2.x
• Hadoop 2.x has two core
components.
– HDFS provides distributed,
scalable, and highly available data
storage.
– YARN provides distributed,
scalable, and highly available
processing.
YARN : Data Operating System
DATA MANAGEMENT
DATA ACCESS
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBase
Stream
Storm
Others
In-Memory Analytics,
ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Tez Tez
Hadoop 2.x
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS is a Distributed File System
dataA
dataB
dataC
C
B
A
master node
(NameNode)
slave nodes
(DataNodes)
split
block
block
block
block locations
MR
MR
MR large data file
HDFS automatically: -splits large files into
blocks
-spreads blocks across
cluster
-tracks block locations
-replicates blocks (not
shown) distributed
applications
like
MapReduce
get block
information
to access
and analyze
data
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Data Operating System
• Apache Hadoop YARN is the data operating system for
Hadoop 2.
• YARN is:
– Responsible for scheduling
tasks and managing CPU
and memory resources
– Designed to enable multiple
distributed applications to utilize
cluster resources in a shared,
secure, and multi-tenant manner
YARN : Data Operating System
DATA MANAGEMENT
DATA ACCESS
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBase
Stream
Storm
Others
In-Memory Analytics,
ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Tez Tez
Hadoop 2.x
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Little History
• In Hadoop version 1.x, MapReduce was more than just a
data processing application.
– MapReduce was also
the Hadoop cluster’s
scheduler and resource
manager.
• In Hadoop 2.x, YARN
replaced MapReduce
for scheduling and
resource management.
MapReduce: Scheduling and Resource Management
DATA MANAGEMENT
DATA ACCESS
Script
Pig
SQL
Hive HCatalog
NoSQL
Hbase
1 ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
YARN : Data Operating System
DATA MANAGEMENT
DATA ACCESS
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBase
Stream
Storm
Others
In-Memory Analytics,
ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Tez Tez
Hadoop 1.x Hadoop 2.x
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Why the Move to YARN?
• YARN is a generic scheduler and resource manager to
support applications other than just MapReduce.
• MapReduce is not suitable for every type of data
processing workload.
– The problem is that MapReduce is by nature batch processing.
Batch is not suitable for:
• Processing streaming data
• Performing real-time analytics
• Record fetching
• High-speed iterative processing
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Before YARN
• Many times separate
clusters were deployed that:
– Ensured different workloads
received sufficient resources
– Wasted time and money on
additional deployment and
management tasks
– Created data silos that forced
additional data transfers
interactive
processing
batch
processing
ingest
data
results
clusterA
clusterB
transfer
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop After YARN
• YARN transformed Hadoop
into a generic, distributed
operating system.
– HDFS is a distributed file
system.
– YARN is a distributed
scheduler.
– The combination gives a single
Hadoop cluster multi-tenant
capability to run distributed
applications of many types.
YARN distributed
processing
HDFS distributed
storage
batch real-time streaming iterative
applications
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez, an Alternative to MapReduce
• Tez is an alternative
to the traditional
MapReduce
framework.
– It meets the demands
for fast response
times and extreme
throughput at
petabyte scale.
MapReduce: Scheduling and Resource Management
DATA MANAGEMENT
DATA ACCESS
Script
Pig
SQL
Hive HCatalog
NoSQL
Hbase
1 ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
YARN : Data Operating System
DATA MANAGEMENT
DATA ACCESS
Script
Pig
Search
Solr
SQL
Hive HCatalog
NoSQL
HBase
Stream
Storm
Others
In-Memory Analytics,
ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
Tez Tez
Hadoop 1.x Hadoop 2.x
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Inefficiencies in MapReduce
• To understand how Tez accelerates query processing it is
helpful to understand some inefficiencies in MapReduce.
– These inefficiencies make MapReduce suitable only for batch
processing.
• Causes of MapReduce inefficiencies are:
– HDFS and local storage use
– Requirement of map phase before reduce phase
– Hadoop containers (A container is an abstraction used to represent a discreet amount of slave node CPU and
memory resources. Resources in one container are logically isolated from other container
resources. Applications run inside containers.)
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
MapReduce and HDFS
• MapReduce uses HDFS
storage to store temporary
data between MapReduce
jobs.
• Local storage is used to
store temporary data
between map and reduce
phases.
– Storage I/O adds significant
overhead to the overall job.
M
HDFS
M M
R R
M M M
R R
HDFS
M
HDFS
M M M
HDFS
HDFS
M M M
R R
temporary
data
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez and HDFS
M
HDFS
M M
R R
M M M
R R
HDFS
M
HDFS
M M M
HDFS
HDFS
M M M
R R
Map and Reduce
over MapReduce
M M M
R R
M M M
R R
HDFS
M
M M M
HDFS
M M M
R R
Map and Reduce
over MapReduce
Map and Reduce
over Tez
Map and Reduce
over Tez
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez is Simple
• Tez is a completely client-side implementation.
– Tez is a set of client-side libraries.
– There is no server to deploy or manage.
• Tez is not meant for end-users.
– Developers use the Tez API to create better end-user
applications.
– Tez applications:
• Support batch and interactive data processing applications
• Integrate with YARN
• Perform well in a mixed application workload cluster
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source, general purpose
processing engine used to build and run fast and
sophisticated applications.
– It features a simple set of APIs to write applications in Scala,
Java, or Python.
• The processing engine and applications run on Hadoop 2.
– It leverages Hadoop’s horizontal scale out capabilities.
• It is YARN-ready.
– You can process a single copy of data in multiple ways using
the same cluster.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark RDD – Scalability and Performance
• To leverage Hadoop’s horizontal
scalability:
– Spark processes data in a Resilient
Distributed Dataset (RDD).
• It is a fault-tolerant collection of data elements.
– An RDD is stored in memory or on disk.
– Each RDD is distributed across Hadoop
slave nodes.
• Enables parallel processing across the cluster
10x MapReduce
performance
RAM
RAM
RAM
RAM
on-disk
RDD
in-memory
RDD
100x MapReduce
performance
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark High-Level Tools
• The Spark Engine supports
four high-level tools to build
applications.
– Spark SQL
– Spark Streaming
– MLlib
– GraphX
Spark
Streaming
streaming
GraphX
MLlib
Spark
SQL
SQL
Apache Spark Engine
graph
computation
machine
learning
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark SQL
– Use Spark SQL for interactive or batch queries on streaming or
historical data.
• Perform queries in Scala, Java, and Python programs using integrated APIs.
• It queries structured data as an SchemaRDD.
– A SchemaRDD is an RDD of row objects that has an associated schema.
– SchemaRDDs are registered as tables and used in FROM clauses in SQL statements.
– SchemaRDDs can be used in relational queries, as well as in standard RDD functions.
– Spark SQL reuses an existing Apache Hive frontend and metastore.
• This makes it compatible with existing Hive data, queries, and UDFs.
– Spark SQL includes a server mode with standard ODBC and JDBC
connectors.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Decisions, decisions, decisions…
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Processing Options – Spark vs. Tez
• Three Common Options
– Hive on Tez
– Hive on Spark
– Spark SQL
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive on Tez vs. Hive on Spark
• Hive on Tez outperforms Hive on Spark
– Hive tends to be bound by CPU rather than I/O, especially with
introduction of columnar file formats
– Spark spends time translating from RDDs to Hive’s native “Row
Containers”
• Ends up consuming more CPU, Disk & Network I/O
– Tez is a framework for building special-purpose engines,
whereas Spark is a general-purpose engine
• Hive on Tez is optimized for typical Hive operations
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive on Tez vs. Spark SQL
• Depends on size of dataset
– Less than 200 GB, Spark SQL wins
– 200 GB and greater, Hive on Tez wins
• The larger the dataset, the greater the discrepancy in performance
• http://www.slideshare.net/hortonworks/hive-on-spark-is-
blazing-fast-or-is-it-final
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez vs. Spark
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
BUT…
• Spark, like all other Hadoop projects, is evolving.
Performance metrics are likely to change
– …as will those for Tez applications, etc.
• Your mileage will vary, and performance variance today
may not be the same as performance variance tomorrow
– Beware of the word “always”