Post on 15-Jul-2015
transcript
Counting at ScaleAPAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 6, 2013
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 1 / 27
Last week
Claim:
Solving the counting problem at scale enables you to investigatemany interesting questions in the social sciences
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 2 / 27
Learning to count
Last week:
Counting at small/medium scales on a single machine
This week:
Counting at large scales in parallel
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
Learning to count
Last week:
Counting at small/medium scales on a single machine
This week:
Counting at large scales in parallel
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
What?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 4 / 27
What?
“... to create building blocks for programmers who justhappen to have lots of data to store, lots of data toanalyze, or lots of machines to coordinate, and whodon’t have the time, the skill, or the inclination tobecome distributed systems experts to build theinfrastructure to handle it.”
-Tom WhiteHadoop: The Definitive Guide
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 5 / 27
What?
Hadoop contains many subprojects:
We’ll focus on distributed computation with MapReduce.
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 6 / 27
Who/when?
An overly brief history
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when?
pre-2004Doug Cutting and Mike Cafarella develop open source projects for
web-scale indexing, crawling, and search
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when?
2004Dean and Ghemawat publish MapReduce programming model,
used internally at Google
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when?
2006Hadoop becomes official Apache project, Cutting joins Yahoo!,
Yahoo adopts Hadoop
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Where?
http://wiki.apache.org/hadoop/PoweredBy
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 8 / 27
Why?
Why yet another solution?
(I already use too many languages/environments)
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
Why?
Why a distributed solution?
(My desktop has TBs of storage and GBs of memory)
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
Why?
Roughly how long to read 1TB from a commodity hard disk?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why?
Roughly how long to read 1TB from a commodity hard disk?
1
2
Gb
sec× 1
8
B
b× 3600
sec
hr≈ 225
GB
hr
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why?
Roughly how long to read 1TB from a commodity hard disk?
≈ 4hrs
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why?
http://bit.ly/petabytesort
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 11 / 27
Typical scenario
Store, parse, and analyze high-volume server logs,
e.g. how many search queries match “icwsm”?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 12 / 27
MapReduce: 30k ft
Break large problem into smaller parts, solve in parallel, combineresults
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 13 / 27
Typical scenario
“Embarassingly parallel”(or nearly so)
node 1local read filter
node 2local read filter
node 3local read filter
node 4local read filter
}collect results
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 14 / 27
Typical scenario++
How many search queries match “icwsm”, grouped by month?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 15 / 27
MapReduce: example
20091201,4.2.2.1,"icwsm 2010"20100523,2.4.1.2,"hadoop"20100101,9.7.6.5,"tutorial"20091125,2.4.6.1,"data"20090708,4.2.2.1,"open source"20100124,1.2.2.4,"washington dc"
20100522,2.4.1.2,"conference"20091008,4.2.2.1,"2009 icwsm"20090807,4.2.2.1,"apache.org"20100101,9.7.6.5,"mapreduce"20100123,1.2.2.4,"washington dc"20091121,2.4.6.1,"icwsm dates"
20090807,4.2.2.1,"distributed"20091225,4.2.2.1,"icwsm"20100522,2.4.1.2,"media"20100123,1.2.2.4,"social"20091114,2.4.6.1,"d.c."20100101,9.7.6.5,"new year's"
Mapmatching records to(YYYYMM, count=1)
200912, 1
200910, 1200911, 1
200912, 1
200910, 1...
200912, 1200912, 1
...200911, 1
200910, 1...200912, 2
...200911, 1
Shuffleto collect all recordsw/ same key (month)
Reduceresults by adding
count values for each key
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 16 / 27
MapReduce: paradigm
Programmer specifies map and reduce functions
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm
Map: tranforms input record to intermediate (key, value) pair
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm
Shuffle: collects all intermediate records by key
Record assigned to reducers by hash(key) % num reducers
Reducers perform a merge sort to collect records with same key
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm
Reduce: transforms all records for given key to final output
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm
Distributed read, shuffle, and write are transparent to programmer
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: principles
• Move code to data (local computation)
• Allow programs to scale transparently w.r.t size of input
• Abstract away fault tolerance, synchronization, etc.
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 18 / 27
MapReduce: strengths
• Batch, offline jobs
• Write-once, read-many across full data set
• Usually, though not always, simple computations
• I/O bound by disk/network bandwidth
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 19 / 27
!MapReduce
What it’s not:
• High-performance parallel computing, e.g. MPI
• Low-latency random access relational database
• Always the right solution
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 20 / 27
Word count
dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1
the quick brown foxjumps over the lazy dogwho jumped over thatlazy dog -- the fox ?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count
Map: for each line, output each word and count (of 1)
the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?
the 1quick 1brown 1fox 1---------jumps 1over 1the 1lazy 1dog 1---------who 1jumped 1over 1---------that 1lazy 1dog 1-- 1the 1fox 1? 1
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count
Shuffle: collect all records for each word
the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?
-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count
Reduce: add counts for each word
-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1
-- 1? 1brown 1dog 2fox 2jumped 1jumps 1lazy 2over 2quick 1that 1the 3who 1
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count
dog 1dog 1----------- 1---------the 1the 1the 1---------brown 1---------fox 1fox 1---------jumped 1---------lazy 1lazy 1---------jumps 1---------over 1over 1---------quick 1---------that 1---------? 1---------who 1
dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1
the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
WordCount.java
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 22 / 27
Hadoop streaming
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 23 / 27
Hadoop streaming
MapReduce for *nix geeks1:
# cat data | map | sort | reduce
• Mapper reads input data from stdin
• Mapper writes output to stdout
• Reducer receives input, sorted by key, on stdin
• Reducer writes output to stdout
1http://bit.ly/michaelnollJake Hofman (Columbia University) Counting at Scale February 6, 2013 24 / 27
wordcount.sh
Locally:
# cat data | tr " " "\n" | sort | uniq -c
⇓
Distributed:
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
wordcount.sh
Locally:
# cat data | tr " " "\n" | sort | uniq -c
⇓
Distributed:
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
Transparent scaling
Use the same code on MBs locally or TBs across thousandsof machines.
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 26 / 27
wordcount.py
Jake Hofman (Columbia University) Counting at Scale February 6, 2013 27 / 27