+ All Categories
Home > Technology > Hadoop Overview kdd2011

Hadoop Overview kdd2011

Date post: 26-Jan-2015
Category:
Upload: milind-bhandarkar
View: 105 times
Download: 1 times
Share this document with a friend
Description:
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
62
Modeling with Hadoop Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo! Milind Bhandarkar Chief Architect, Greenplum Labs, EMC 2
Transcript
Page 1: Hadoop Overview kdd2011

Modeling with Hadoop

Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo!

Milind Bhandarkar Chief Architect, Greenplum Labs, EMC2

Page 2: Hadoop Overview kdd2011

2

Session 1: Overview of Hadoop

• Motivation

•  Hadoop

• Map-Reduce

•  Distributed File System

• Next Generation MapReduce

• Q & A

Page 3: Hadoop Overview kdd2011

Session 2: Modeling with Hadoop

• Types of learning in MapReduce

• Algorithms in MapReduce framework

• Data parallel algorithms

• Sequential algorithms

• Challenges and Enhancements

3

Page 4: Hadoop Overview kdd2011

Session 3: Hands On Exercise

•  Spin-up Single Node Hadoop cluster in a Virtual Machine

• Write a regression trainer

•  Train model on a dataset

4

Page 5: Hadoop Overview kdd2011

Overview of Apache Hadoop

Page 6: Hadoop Overview kdd2011

6

Hadoop At Yahoo!���(Some Statistics)

•  40,000 + machines in 20+ clusters

•  Largest cluster is 4,000 machines

•  170 Petabytes of storage

•  1000+ users

•  1,000,000+ jobs/month

Page 7: Hadoop Overview kdd2011

EVERY CLICK BEHIND

Page 8: Hadoop Overview kdd2011
Page 9: Hadoop Overview kdd2011

Who Uses Hadoop ?

Page 10: Hadoop Overview kdd2011

10

Why Hadoop ?

Page 11: Hadoop Overview kdd2011

Big Datasets���(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)

Page 12: Hadoop Overview kdd2011

Cost Per Gigabyte���(http://www.mkomo.com/cost-per-gigabyte)

Page 13: Hadoop Overview kdd2011

Storage Trends���(Graph by Adam Leventhal, ACM Queue, Dec 2009)

Page 14: Hadoop Overview kdd2011

14

Motivating Examples

Page 15: Hadoop Overview kdd2011

Yahoo! Search Assist

Page 16: Hadoop Overview kdd2011

16

Search Assist

•  Insight: Related concepts appear close together in text corpus

•  Input: Web pages

•  1 Billion Pages, 10K bytes each

•  10 TB of input data

• Output: List(word, List(related words))

Page 17: Hadoop Overview kdd2011

17

// Input: List(URL, Text)foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs;// Result: Pairs = List (word, RelatedWord)Group Pairs by word;// Result: List (word, List(RelatedWords)foreach word in Pairs : Count RelatedWords in GroupedPairs;// Result: List (word, List(RelatedWords, count))foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs;// Result: List (word, Top5(RelatedWords))

Search Assist

Page 18: Hadoop Overview kdd2011

People You May Know

Page 19: Hadoop Overview kdd2011

19

People You May Know

•  Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith

•  if you don’t know Joe Smith already

• Numbers:

•  100 MM users

•  Average connections per user is 100

Page 20: Hadoop Overview kdd2011

20

// Input: List(UserName, List(Connections))foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving;

People You May Know

Page 21: Hadoop Overview kdd2011

21

Performance

•  101 Random accesses for each user

•  Assume 1 ms per random access

•  100 ms per user

•  100 MM users

•  100 days on a single machine

Page 22: Hadoop Overview kdd2011

22

MapReduce Paradigm

Page 23: Hadoop Overview kdd2011

23

Map & Reduce

•  Primitives in Lisp (& Other functional languages) 1970s

•  Google Paper 2004

•  http://labs.google.com/papers/mapreduce.html

Page 24: Hadoop Overview kdd2011

24

Output_List = Map (Input_List)

Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =(1, 4, 9, 16, 25, 36,49, 64, 81, 100)

Map

Page 25: Hadoop Overview kdd2011

25

Output_Element = Reduce (Input_List)

Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385

Reduce

Page 26: Hadoop Overview kdd2011

26

Parallelism

• Map is inherently parallel

•  Each list element processed independently

•  Reduce is inherently sequential

•  Unless processing multiple lists

•  Grouping to produce multiple lists

Page 27: Hadoop Overview kdd2011

27

// Input: http://hadoop.apache.orgPairs = Tokenize_And_Pair ( Text ( Input ) )

Output = {(apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)...}

Search Assist Map

Page 28: Hadoop Overview kdd2011

28

// Input: GroupedList (word, GroupedList(words))CountedPairs = CountOccurrences (word, RelatedWords)

Output = {(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ...}

Search Assist Reduce

Page 29: Hadoop Overview kdd2011

29

Issues with Large Data

• Map Parallelism: Chunking input data

•  Reduce Parallelism: Grouping related data

•  Dealing with failures & load imbalance

Page 30: Hadoop Overview kdd2011
Page 31: Hadoop Overview kdd2011

31

Apache Hadoop

•  January 2006: Subproject of Lucene

•  January 2008: Top-level Apache project

•  Stable Version: 0.20.203

•  Latest Version: 0.22 (Coming soon)

Page 32: Hadoop Overview kdd2011

32

Apache Hadoop

•  Reliable, Performant Distributed file system

• MapReduce Programming framework

•  Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ...

Page 33: Hadoop Overview kdd2011

33

Problem: Bandwidth to Data

•  Scan 100TB Datasets on 1000 node cluster

•  Remote storage @ 10MB/s = 165 mins

•  Local storage @ 50-200MB/s = 33-8 mins

• Moving computation is more efficient than moving data

• Need visibility into data placement

Page 34: Hadoop Overview kdd2011

34

Problem: Scaling Reliably

•  Failure is not an option, it’s a rule !

•  1000 nodes, MTBF < 1 day

•  4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM)

• Need fault tolerant store with reasonable availability guarantees

•  Handle hardware faults transparently

Page 35: Hadoop Overview kdd2011

35

Hadoop Goals

•  Scalable: Petabytes (1015 Bytes) of data on thousands on nodes

•  Economical: Commodity components only

•  Reliable

•  Engineering reliability into every application is expensive

Page 36: Hadoop Overview kdd2011

36

Hadoop MapReduce

Page 37: Hadoop Overview kdd2011

37

Think MapReduce

•  Record = (Key, Value)

•  Key : Comparable, Serializable

•  Value: Serializable

•  Input, Map, Shuffle, Reduce, Output

Page 38: Hadoop Overview kdd2011

38

cat /var/log/auth.log* | \ grep “session opened” | cut -d’ ‘ -f10 | \sort | \uniq -c > \~/userlist

Seems Familiar ?

Page 39: Hadoop Overview kdd2011

39

Map

•  Input: (Key1, Value1)

• Output: List(Key2, Value2)

•  Projections, Filtering, Transformation

Page 40: Hadoop Overview kdd2011

40

Shuffle

•  Input: List(Key2, Value2)

• Output

•  Sort(Partition(List(Key2, List(Value2))))

•  Provided by Hadoop

Page 41: Hadoop Overview kdd2011

41

Reduce

•  Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

•  Aggregation

Page 42: Hadoop Overview kdd2011

42

Hadoop Streaming

•  Hadoop is written in Java

•  Java MapReduce code is “native” • What about Non-Java Programmers ?

•  Perl, Python, Shell, R

•  grep, sed, awk, uniq as Mappers/Reducers

•  Text Input and Output

Page 43: Hadoop Overview kdd2011

43

Hadoop Streaming

•  Thin Java wrapper for Map & Reduce Tasks

•  Forks actual Mapper & Reducer

•  IPC via stdin, stdout, stderr

•  Key.toString() \t Value.toString() \n

•  Slower than Java programs

•  Allows for quick prototyping / debugging

Page 44: Hadoop Overview kdd2011

44

$ bin/hadoop jar hadoop-streaming.jar \ -input in-files -output out-dir \ -mapper mapper.sh -reducer reducer.sh# mapper.shsed -e 's/ /\n/g' | grep .# reducer.shuniq -c | awk '{print $2 "\t" $1}'

Hadoop Streaming

Page 45: Hadoop Overview kdd2011

45

Hadoop Distributed File System (HDFS)

Page 46: Hadoop Overview kdd2011

46

HDFS

•  Data is organized into files and directories

•  Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes

•  HDFS exposes block placement so that computation can be migrated to data

Page 47: Hadoop Overview kdd2011

47

HDFS

•  Blocks are replicated (default 3) to handle hardware failure

•  Replication for performance and fault tolerance (Rack-Aware placement)

•  HDFS keeps checksums of data for corruption detection and recovery

Page 48: Hadoop Overview kdd2011

48

HDFS

• Master-Worker Architecture

•  Single NameNode

• Many (Thousands) DataNodes

Page 49: Hadoop Overview kdd2011

49

HDFS Master���(NameNode)

• Manages filesystem namespace

•  File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations

•  Authorization & Authentication

•  Checkpoint & journal namespace changes

Page 50: Hadoop Overview kdd2011

50

Namenode

• Mapping of datanode to list of blocks

• Monitor datanode health

•  Replicate missing blocks

•  Keeps ALL namespace in memory

•  60M objects (File/Block) in 16GB

Page 51: Hadoop Overview kdd2011

51

Datanodes

•  Handle block storage on multiple volumes & block integrity

•  Clients access the blocks directly from data nodes

•  Periodically send heartbeats and block reports to Namenode

•  Blocks are stored as underlying OS’s files

Page 52: Hadoop Overview kdd2011

HDFS Architecture

Page 53: Hadoop Overview kdd2011

53

Next Generation MapReduce

Page 54: Hadoop Overview kdd2011

MapReduce Today (Courtesy: Arun Murthy, Hortonworks)

Page 55: Hadoop Overview kdd2011

55

Why ?

•  Scalability Limitations today

• Maximum cluster size: 4000 nodes

• Maximum Concurrent tasks: 40,000

•  Job Tracker SPOF

•  Fixed map and reduce containers (slots)

•  Punishes pleasantly parallel apps

Page 56: Hadoop Overview kdd2011

56

Why ? (contd)

• MapReduce is not suitable for every application

•  Fine-Grained Iterative applications

•  HaLoop: Hadoop in a Loop

• Message passing applications

•  Graph Processing

Page 57: Hadoop Overview kdd2011

57

Requirements

• Need scalable cluster resources manager

•  Separate scheduling from resource management

• Multi-Lingual Communication Protocols

Page 58: Hadoop Overview kdd2011

58

Bottom Line

• @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen)

•  Expect different programming paradigms to be implemented

•  Including MPI (soon)

Page 59: Hadoop Overview kdd2011

Architecture (Courtesy: Arun Murthy, Hortonworks)

Page 60: Hadoop Overview kdd2011

60

The New World

•  Resource Manager

•  Allocates resources (containers) to applications

•  Node Manager

•  Manages containers on nodes

•  Application Master

•  Specific to paradigm e.g. MapReduce application master, MPI application master etc

Page 61: Hadoop Overview kdd2011

61

Container

•  In current terminology: A Task Slot

•  Slice of the node’s hardware resources

•  #of cores, virtual memory, disk size, disk and network bandwidth etc

•  Currently, only memory usage is sliced

Page 62: Hadoop Overview kdd2011

62

Questions ?


Recommended