Introduction to Yarn

Post on 16-Apr-2017

370 views 2 download

transcript

Bhupesh Chawdabhupesh@apache.org

DataTorrent

Introduction to YARNNext Gen Hadoop

Image Source: https://memegenerator.net/instance/64508420

Why YARN

Hadoop v1 (MR1) Architecture● Job Tracker

○ Manages cluster resources ○ Job scheduling○ Bottleneck

● Task Tracker○ Per-node Agent○ Manages tasks○ Map / Reduce task slots

MapReduce Status

Job Submission

JobTracker

Task Task

Task Task

Client

Client

TaskTracker

Task Task

Task Tracker

TaskTracker

Limitations with MR1• Scalability

Maximum cluster size: 4,000 nodesMaximum concurrent tasks: 40,000

• Availability - Job Tracker is a SPOF• Resource Utilization - Map / Reduce slots• Runs only MapReduce applications

Why YARN (Cont…)

Introducing YARN

● YARN - Yet Another Resource Negotiator● Framework that facilitates writing arbitrary distributed processing

frameworks and applications.● YARN Applications/frameworks:

e.g. MapReduce2, Apache Spark, Apache Giraph, Apache Apex etc.

Image Source: http://tm.durusau.net/?cat=1525

Hadoop beyond Batch

YARN for better resource utilization

More applications than MapReduce

Comparing MapReduce with YARN

MapReduceYARN

≈8Proprietary and Confidential

Job Tracker

Resource Manager

Application Master

Task Tracker Node Manager

Map Slot

Reduce Slot

Backward Compatibility Maintained!

● Existing Map Reduce jobs run as is on the YARN framework

● No Job Tracker and Task Tracker processes

• Resource ManagerManages and allocates cluster resources

Application scheduling

Applications Manager

• Node Manager

Per-machine agent

Manages life-cycle of container

Monitors resources

• Application Master

Per-application

Manages application scheduling and task execution

Hadoop v2 (YARN) Architecture

Image Source: hadoop.apache.org

Application Submission workflow

YarnClient

Node RM

(ApplicationsManagers + Scheduler)

Resource Manager

Node NM

Node Manager

Node NM

Node ManagerApplication

Master

ContainerContainer

1) Submit application

2) Launch application Master

RM = Resource ManagerNM = Node ManagerAM = Application Master = Heartbeats

3) AM registers with RM

4) AM negotiates for containers

5) Launch Container

Application Masters - One for each Application Type

MapReduce Application MapReduce Application Master

Apex ApplicationApex

Application Master (StrAM)

Flink Application Flink Application Master

Giraph Application Giraph Application Master

Already provided by Hadoop as a backward compatibility option for MapReduce

Provided by Apache Apex

● YARN enables non-MapReduce applications to run in a distributed fashion● Each Application first asks for a container for the Application Master

○ The Application Master then talks to YARN to get resources needed by the application

○ Once YARN allocates containers as requested to the Application Master, it starts the application components in those containers.

● Hadoop is no more just batch processing!!

Key Takeaways

References● Simple Yarn code example

○ https://github.com/hortonworks/simple-yarn-app

● Document references○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html○ http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/○ http://www.slideshare.net/

● Acknowledgements○ Priyanka Gugale, DataTorrent - Slide deck

Thank You!!

Please send your questions at:bhupesh@apache.org / bhupesh@datatorrent.com