+ All Categories
Home > Documents > Session9_MR2_Sep25

Session9_MR2_Sep25

Date post: 02-Jun-2018
Category:
Upload: mbscribd2011
View: 213 times
Download: 0 times
Share this document with a friend
28
Inspire…Educate…Transform. The best place for students to learn Applied Engineering http://www .insofe.edu.in Dr . Sreerama KV Murt hy September 25, 2013 Engineering Big Data: Online Batch Session 9: Map Reduce 2 CEO, Teqnium Consultancy Services
Transcript

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 1/28

Inspire…Educate…Transform.

The best place for students to learn Applied Engineering http://www.insofe.ed

Dr. Sreerama KV Murthy

September 25, 2013

Engineering Big Data:Online Batch

Session 9: Map Reduce 2

CEO, Teqnium Consultancy Services

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 2/28

The best place for students to learn Applied Engineering 2 http://www.insofe.ed

Refresher: What is MapReduce?

MapReduce is a programming model Google has used

successfully is processing its “big-data” sets (~ 20000 petabytes per day)

Users specify the computation in terms of a map and a reduce

function

Underlying runtime system automatically parallelizes the computationacross large-scale clusters of machines, and

Underlying system also handles machine failures, efficientcommunications, and performance issues.

Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified

data processing on large clusters. Communication of ACM 51, 1 (Jan2008), 107-113.

CCSCNE 2009 Palttsburg, April 24 2009. B.Ramamurthy & K.Madurai

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 3/28

The best place for students to learn Applied Engineering 3 http://www.insofe.ed

The Five MapReduce Daemons

1. NameNode

• Holds the metadata for HDFS

2. Secondary NameNode

• Performs housekeeping functions for the NameNode. It is not a backup or

hot standby for the NameNode.

3. DataNode

• Stores actual HDFS data blocks

4. JobTracker • Manages MapReduce jobs, distributes individual tasks to machines, etc

5. TaskTracker 

• Instantiates and monitors individual Map and Reduce tasks

“Master Nodes” in the cluster run one of the blue daemons above.

“Slave Nodes” run both of the non-blue daemons.

Each daemon runs in its own Java virtual machine.

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 4/28

The best place for students to learn Applied Engineering 4 http://www.insofe.ed

Five Daemons of MapReduce.. contd.

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 5/28The best place for students to learn Applied Engineering 5 http://www.insofe.ed

 YARN (MR2)

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 6/28The best place for students to learn Applied Engineering 6 http://www.insofe.ed

CDH-3 Map Reduce Daemons

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 7/28The best place for students to learn Applied Engineering 7 http://www.insofe.ed

MapReduce NextGen aka YARN aka MRv2

• Divides the two major functions of the JobTracker - resource

management and job life-cycle management - into separatecomponents

• Released in Hadoop-0.23

• The new ResourceManager manages the global assignment ofcompute resources to applications.

• The ResourceManager has two main components: Scheduler andApplicationsManager.

• The Scheduler is responsible for allocating resources to variousrunning applications subject to constraints of capacities, queues etc.

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 8/28The best place for students to learn Applied Engineering 8 http://www.insofe.ed

MRv2 aka YARN: JobTracker Redefined

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 9/28The best place for students to learn Applied Engineering 9 http://www.insofe.ed

• Scheduler performs no tracking of status for the application, andoffers no guarantees about restarting failed tasks.

• The per-application ApplicationMaster manages the application’sscheduling and coordination

• The per-machine NodeManager daemon manages the userprocesses on that machine.

• An application is either a single MR job or a DAG of such jobs.

• The ApplicationMaster negotiates resources from theResourceManager and works with the NodeManager(s) to executeand monitor tasks.

• CDH4 continues to support the original MapReduce framework (i.e.the JobTracker and TaskTrackers). The old framework is referred toas MRv1.

 YARN aka MRv2 (contd.)

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 10/28The best place for students to learn Applied Engineering 10 http://www.insofe.ed

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 11/28The best place for students to learn Applied Engineering 11 http://www.insofe.ed

A COUPLE OF USE CASESMap-Reduce

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 12/28The best place for students to learn Applied Engineering 12 http://www.insofe.ed

 Yahoo: Running Production WebMap

• Search needs a graph of the “known” web

– Invert edges, compute link text, whole graph heuristics

• Periodic batch job using Map/Reduce

– Uses a chain of ~100 map/reduce jobs

• Scale

– 1 trillion edges in graph

– Largest shuffle is 450 TB

– Final output is 300 TB compressed

– Runs on 10,000 cores

– Raw disk used 5 PB

• Written mostly using Hadoop’s C++ interface

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 13/28The best place for students to learn Applied Engineering 13 http://www.insofe.ed

 Yahoo Research Clusters

• Mostly data mining/machine learning jobs

• Most research jobs are not Java:

–42% Streaming

• Uses Unix text processing to define map and

reduce–28% Pig

• Higher level dataflow scripting language

–28% Java

–2% C++

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 14/28

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 15/28

The best place for students to learn Applied Engineering 15 http://www.insofe.ed

Terabyte Sort Benchmark

• Started by Jim Gray at Microsoft in 1998• Sorting 10 billion 100 byte records

• Hadoop won the general category in 209 seconds– 910 nodes

– 2 quad-core Xeons @ 2.0Ghz / node

– 4 SATA disks / node

– 8 GB ram / node– 1 gb ethernet / node

– 40 nodes / rack

– 8 gb ethernet uplink / rack

• Previous records was 297 seconds

• Only hard parts were:– Getting a total order

– Converting the data generator to map/reduce

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 16/28

The best place for students to learn Applied Engineering 16 http://www.insofe.ed

NOW….THE

SHAKE-UP QUIZ !! 

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 17/28

The best place for students to learn Applied Engineering 17 http://www.insofe.ed

KEY-VALUE PAIRS

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 18/28

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 19/28

The best place for students to learn Applied Engineering 19 http://www.insofe.ed

Keys and Values

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 20/28

The best place for students to learn Applied Engineering 20 http://www.insofe.ed

MapReduce – In more detail

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 21/28

The best place for students to learn Applied Engineering 21 http://www.insofe.ed

MR: Logical Execution

k k k k k kv v v v

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 22/28

The best place for students to learn Applied Engineering 22 http://www.insofe.ed

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5

v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r 1 s1 r 2 s2 r 3 s3

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 23/28

The best place for students to learn Applied Engineering 23 http://www.insofe.ed

SIMPLE MAPPERS & REDUCERS

M

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 24/28

The best place for students to learn Applied Engineering 24 http://www.insofe.ed

Mappers

• Mappers run on nodes which hold their portion of the data locally,to avoid network traffic

• Multiple Mappers run in parallel, each processing a portion of theinput data

• Mapper reads data in the form of key/value pairs

– Mapper may use, or completely ignore, the input key.

– E.g., a standard pattern is to read a line of a file at a time. Key then isthe byte offset into the file at which the line starts. Value is thecontents of the line itself. Typically the key is considered irrelevant .

• It outputs zero or more key/value pairs

– let map(k, v) = emit(k.toUpper(), v.toUpper())

– ('foo', 'bar') -> ('FOO', 'BAR')

Oth E l E l d

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 25/28

The best place for students to learn Applied Engineering 25 http://www.insofe.ed

Others Examples: Explode mapper 

E l Filt

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 26/28

The best place for students to learn Applied Engineering 26 http://www.insofe.ed

Example: Filter mapper 

E l Ch i K

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 27/28

The best place for students to learn Applied Engineering 27 http://www.insofe.ed

Example: Changing Keyspaces

8/11/2019 Session9_MR2_Sep25

http://slidepdf.com/reader/full/session9mr2sep25 28/28

The best place for students to learn Applied Engineering 28 http://www.insofe.ed

International School of Engineering

Plot No 63/A, 1st Floor, Road No 13, Film Nagar, Jubilee Hills, Hyderabad - 500033

For Individuals: +91-9502334561/63

For Corporates: +91-9618483483

Web: http://www.insofe.edu.in

Facebook: https://www.facebook.com/Insofe

Twitter: https://twitter.com/INSOFEedu

YouTube: http://www.youtube.com/InsofeVideos

SlideShare: http://www.slideshare.net/INSOFE

This presentation may contain references to findings of various reports available in the public domain. INSOFE makes no representation as to their accuracy or that the organization

subscribes to those findings.