Post on 17-May-2015
transcript
HADOOPFramework and Applications
Prepared by: TEAM HADOOP slide1/22
CONTENTS WHY HADOOP?
INTRODUCTION TO MapReduce
Prepared by: TEAM HADOOP slide 2/22
WHAT?“... to create building blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Source: Hadoop: The Definitive Guide
Prepared by: TEAM HADOOP slide 3/22
WHAT? Hadoop contains many subprojects: Hadoop Common Chukwa HBase ZooKeeper Pig Zombie Hive MapReduce
We will focus on MapReduce
Prepared by: TEAM HADOOP slide 4/22
WHO & WHEN? Pre-2004 : Cutting and Cafarella develop
open source projects for web-scale indexing, crawling and search.
Prepared by: TEAM HADOOP slide 5/22
WHO & WHEN? 2004: Jeffrey Dean and Sanjay
Ghemawat introduce map reduce model used internally at Google.
Prepared by: TEAM HADOOP slide 6/22
WHO & WHEN? 2006: Hadoop becomes official Apache
project, Cutting joins Yahoo!Yahoo adopts Hadoop.
Prepared by: TEAM HADOOP slide 7/22
TRENDS
Prepared by: TEAM HADOOP slide 8/22
WHO USES IT?
Prepared by: TEAM HADOOP slide 9/22
Roughly how long to read 1TB from a commodity hard disk?
Prepared by: TEAM HADOOP slide 10/22
Roughly how long to read 1TB from a commodity hard disk?
Around 4 hours
62 seconds…
WITH HADOOP..
Prepared by: TEAM HADOOP slide 11/22
INTRODUCTION TO MapReduce
"Break large problem into smaller parts, solve in parallel, combine results."
Prepared by: TEAM HADOOP slide 12/22
Typical scenario How many times is the word ‘IT’
present? You’ll probably count but in a 30k paged document, can you??
Prepared by: TEAM HADOOP slide 13/22
Map Reduce Typical Illustration
Prepared by: TEAM HADOOP slide 14/22
Map Reduce paradigm
Input
Map
Shuffle/SortReduce
Output
Prepared by: TEAM HADOOP slide 15/22
Map Reduce paradigm Map: transforms input record to
intermediate (key, value) pair
Prepared by: TEAM HADOOP slide 16/22
Map Reduce paradigm Reduce: transforms all records for given
key to final output.
Prepared by: TEAM HADOOP slide 17/22
Map reduce principles
Move code to data (local
computation)
Allow programs to scale
transparently w.r.t size of input
Abstract away fault tolerance, synchronization,
etc.
Prepared by: TEAM HADOOP slide 18/22
Implementation: Hardware
Prepared by: TEAM HADOOP sroy choudhury7@gmail.com slide 19/22
Map Reduce: strengths
Batch, offline jobs
Write-once, read-many across full data set
Usually, though not always, simple computations
I/O bound by disk/network bandwidth
Prepared by: TEAM HADOOP slide 20/22
What it’s not!
What it’s not:
High-performance parallel computing, e.g. MPI
Low-latency random access relational database
Always the right solution
Prepared by: TEAM HADOOP slide 21/22
THANK YOU!
QUESTIONS?
Prepared by: TEAM HADOOP slide 22/22