Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | hortonworks |
View: | 117 times |
Download: | 3 times |
Apache Hadoop YARN and Tez Future of Data Processing with Hadoop
Page 1
Chris Harris [email protected]
Agenda
Page 2
• Overview
• Status Quo
• Architecture
• Improvements and Updates
• Tez
Hadoop MapReduce Classic
• JobTracker
– Manages cluster resources and job scheduling • TaskTracker
– Per-node agent
– Manage tasks
Current Limitations
• Scalability – Maximum Cluster size – 4,000 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker
• Single point of failure – Failure kills all queued and running jobs – Jobs need to be re-submitted by users
• Restart is very tricky due to complex state
4
Current Limitations
• Hard partition of resources into map and reduce slots – Low resource utilization
• Lacks support for alternate paradigms – Iterative applications implemented using MapReduce are
10x slower – Hacks for the likes of MPI/Graph Processing
• Lack of wire-compatible protocols – Client and cluster must be of same version – Applications and workflows cannot migrate to different
clusters
5
Requirements
• Reliability • Availability
• Utilization
• Wire Compatibility
• Agility & Evolution – Ability for customers to control upgrades to the grid software stack.
• Scalability - Clusters of 6,000-10,000 machines – Each machine with 16 cores, 48G/96G RAM, 24TB/36TB
disks – 100,000+ concurrent tasks – 10,000 concurrent jobs
6
Design Centre
• Split up the two major functions of JobTracker – Cluster resource management – Application life-cycle management
• MapReduce becomes user-land library
7
Concepts
• Application – Application is a job submitted to the framework – Example – Map Reduce Job
• Container – Basic unit of allocation – Example – container A = 2GB, 1CPU – Fine-grained resource allocation – Replaces the fixed map/reduce slots
8
Architecture
• Resource Manager – Global resource scheduler – Hierarchical queues
• Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring
• Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master
9
Architecture
ResourceManager
MapReduce Status
Job Submission
Client
NodeManager
NodeManager
Container
NodeManager
App Mstr
Node Status
Resource Request
ResourceManager
Client
MapReduce Status
Job Submission
Client
NodeManager
NodeManager
App Mstr Container
NodeManager
App Mstr
Node Status
Resource Request
ResourceManager
Client
MapReduce Status
Job Submission
Client
NodeManager
Container Container
NodeManager
App Mstr Container
NodeManager
Container App Mstr
Node Status
Resource Request
Improvements vis-à-vis classic MapReduce
11
• Utilization – Generic resource model
• Data Locality (node, rack etc.) • Memory • CPU • Disk b/q • Network b/w
– Remove fixed partition of map and reduce slot • Scalability
– Application life-cycle management is very expensive – Partition resource management and application life-cycle management – Application management is distributed – Hardware trends - Currently run clusters of 4,000 machines
• 6,000 2012 machines > 12,000 2009 machines • <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>
• Fault Tolerance and Availability – Resource Manager
• No single point of failure – state saved in ZooKeeper (coming soon)
• Application Masters are restarted automatically on RM restart – Application Master
• Optional failover via application-specific checkpoint • MapReduce applications pick up where they left off via state saved
in HDFS
• Wire Compatibility – Protocols are wire-compatible – Old clients can talk to new servers – Rolling upgrades
12
Improvements vis-à-vis classic MapReduce
• Innovation and Agility
– MapReduce now becomes a user-land library – Multiple versions of MapReduce can run in the same cluster
(a la Apache Pig) • Faster deployment cycles for improvements
– Customers upgrade MapReduce versions on their schedule – Users can customize MapReduce
13
Improvements vis-à-vis classic MapReduce
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
To Talk to Tez..You Have To First Talk Yarn
• New Processing layer in Hadoop 2.0 that decouples Resource management from application management
• Created to manage resource needs across all uses
• Ensures predictable performance & QoS for all apps • Enables apps to run “IN” Hadoop rather than “ON”
– Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc.
Page 14
Applica'ons Run Na'vely IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH (MapReduce)
INTERACTIVE (Tez)
STREAMING (Storm, S4,…)
GRAPH (Giraph)
IN-‐MEMORY (Spark)
HPC MPI (OpenMPI)
ONLINE (HBase)
OTHER (Search) (Weave…)
© Hortonworks Inc. 2013
ResourceManager
Client
MapReduce Status
Job Submission
Client
NodeManager
Container Container
NodeManager
App Mstr Container
NodeManager
Container App Mstr
Node Status
Resource Request
Tez: High Throughput and Low Latency
Tez runs in YARN
Accelerate High Throughput
AND Low Latency
Processing
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Tez - Core Idea
Task with pluggable Input, Processor & Output
Page 16
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Tez Hive Performance
• Low level data-processing, execution engine on YARN • Base for MapReduce, Hive, Pig, Cascading, etc. • Re-usable data processing primitives (ex: sort, merge, intermediate data management)
• All Hive SQL, can be expressed as single job – Jobs are no longer interrupted (efficient pipeline) – Avoid writing intermediate output to HDFS when performance
outweights job re-start (speed and network/disk usage savings) – Break MR contract to turn MRMRMR to MRRR (flexible DAG)
• Removes task and job overhead (10s savings is huge for a 2s query!)
Page 17
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Tez Service
• MR Query Startup Expensive – Job launch & task-launch latencies are fatal for short queries (in
order of 5s to 30s)
• Tez Service
– An always-on pool of Application Masters – Hive (and soon others like Pig) jobs are executed on an
Application Masterin the pool instead of starting a new Application Master(saving precious seconds) – Container Preallocation/Ware Containers
• In Summary… – Removes job-launch overhead (Application Master) – Removes task-launch overhead (Pre-warmed Containers)
Page 18
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Stinger Initiative: Making Apache Hive 100X Faster
Page 19 © Hortonworks Inc. 2013 Page 19 DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & SENSITIVE INFORMATION
Had
oop
Base Op'miza'ons
Generate simplified DAGs In-‐memory Hash Joins
Current
3 – 9 Months
9 – 18 Months
Deep Analy'cs
SQL CompaEble Types SQL CompaEble Windowing
More SQL Subqueries
Hiv
e
YARN
Next-‐gen Hadoop data processing framework
Tez
Express data processing tasks more simply
Eliminate disk writes
Hive Tez Service
Pre-‐warmed Containers Low-‐latency dispatch
ORCFile
Column Store High Compression
Predicate / Filter Pushdowns
Buffer Caching
Cache accessed data OpEmized for vector engine
Query Planner
Intelligent Cost-‐Based OpEmizer
Vector Query Engine
OpEmized for modern processor architectures
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Demo Benchmark Spec
• The TPC-DS benchmark data+query set
• Query 27 – big table(store_sales) joins lots of small tables – A.K.A Star Schema Join
• What does Query 27 do? For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic..
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
SELECT col5, avg(col6)
FROM store_sales_fact ssf
join item_dim on (ssf.col1 = item_dim .col1)
join date_dim on (ssf.col2 = date_dim.col2
join custdmgrphcs_dim on (ssf.col3 =custdmgrphcs_dim.col3)
join store_dim on (ssf.col4 = store_dim.col4)
GROUP BY col5
ORDER BY col5
LIMIT 100;
Query 27 - Star Schema Join
• Derived from TPC-DS Query 27
Page 21
41 GB
58 MB
11MB
80MB
106 KB
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Benchmark Cluster Specs
• 6 Node HDP 1.3 Cluster – 2 Master Nodes – 4 Data/Slave Nodes
• Slave Node Specs – Dual Processors – 14 GB
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution Before Hive 11-Text Format
Query spawned 5 MR Jobs
The intermediate output of each job is written to HDFS
Job 1 of 5 – Mappers stream Fact table and first dimension table sorted by join key. Reducers do the join operation between the two tables. Job 2 of 5: Mappers and Reducers join the
output of job 1 and second dimension table.
Last job performs the order by and group operation
Query Response Time
149 total mappers got executed
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution With Hive 11-Text Format
Query spawned of 1 job with Hive 11 compared to 5 MR Jobs with Hive 10
Job 1 of 1 – Each Mapper loads into memory the 4 small dimension tables and streams parts of the large fact table. Joins then occur in Mapper hence the name MapJoin
Increase in performance with Hive 11 as query time went down from 21 minutes to about 4 minutes
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution With Hive 11- RC Format
Conversion from Text to RC file format decreased size of dimension data set from 38 GB to 8.21 GB
Smaller file equates to less IO causing the query time to decrease from 246 seconds to 136 seconds
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution With Hive 11- ORC Format
ORC File type consolidates data more tighly than RCFile as the size of dataset decreased from 8.21 GB to 2.83 GB
Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution With Hive 11- RC Format/Partitioned, Bucketed and Sorted
Partitioned the data decreases the file input size to 1.73 GB
Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Query27 Execution With Hive 11- ORCFormat/Partitioned, Bucketed and Sorted
Partitioned Data with ORC file format produces the smallest input size of 687 MB a decrease from 43 GB
Smallest Input Size allows us the fastest response time for the query: 79 seconds
© Hortonworks Inc. 2013 © Hortonworks Inc. 2013
Summary of Results
File Type Number of MR Jobs
Input Size Mappers Time
Text/Hive 10 5 43.1 GB 179 21 minutes
Text/Hive 11 1 38 GB 151 246 seconds
RC/Hive 11 1 8.21 GB 76 136 seconds
ORC/Hive 11 1 2.83 GB 38 104 seconds
RC/Hive 11/Partitioned/Bucketed
1 1.73 GB 19 104 seconds
ORC/Hive 11/Partitioned/Bucketed
1 687 MB 27 79.62
Resources
Page 30
hadoop-2.0.3-alpha: http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.3-alpha/ Blogs: http://hortonworks.com/blog/category/apache-hadoop/yarn/ http://hortonworks.com/blog/introducing-apache-hadoop-yarn/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Thank You! Questions & Answers
Page 31