Date post: | 13-Sep-2014 |
Category: |
Technology |
View: | 1,263 times |
Download: | 2 times |
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 6 September 29, 2011
Matt Lease
School of Information
University of Texas at Austin
ml at ischool dot utexas dot edu
Jason Baldridge
Department of Linguistics
University of Texas at Austin
Jasonbaldridge at gmail dot com
Acknowledgments
Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park
Some figures courtesy of the following excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
Today’s Agenda
• Automatic Spelling Correction
– Review: Information Retrieval (IR)
• Boolean Search
• Vector Space Modeling
• Inverted Indexing in MapReduce
– Probabilisitic modeling via noisy channel
• Index Compression
– Order inversion in MapReduce
• In-class exercise
• Hadoop: Pipelined & Chained jobs
Automatic Spelling Correction
Automatic Spelling Correction
Three main stages
Error detection
Candidate generation
Candidate ranking / choose best candidate
Usage cases
Flagging possible misspellings / spell checker
Suggesting possible corrections
Automatically correcting (inferred) misspellings
• “as you type” correction
• web queries
• real-time closed captioning
• …
Types of spelling errors
Unknown words: “She is their favorite acress in town.”
Can be identified using a dictionary…
…but could be a valid word not in the dictionary
Dictionary could be automatically constructed from large corpora
• Filter out rare words (misspellings, or valid but unlikely)…
• Why filter out rare words that are valid?
Unknown words violating phonotactics:
e.g. “There isn’t enough room in this tonw for the both of us.”
Given dictionary, could automatically construct “n-gram dictionary”
of all character n-grams known in the language
• e.g. English words don’t end with “nw”, so flag tonw
Incorrect homophone: “She drove their.”
Valid word, wrong usage; infer appropriateness from context
Typing errors reflecting kayout of leyboard
Candidate generation
How to generate possible corrections for acress?
Inspiration: how do people do it?
People may suggest words like actress, across, access, acres,
caress, and cress – what do these have in common?
What about “blam” and “zigzag”?
Two standard strategies for candidate generation
Minimum edit distance
• Generate all candidates within 1+ edit step(s)
• Possible edit operations: insertion, deletion, substitution, transposition, …
• Filter through a dictionary
• See Peter Norvig’s post: http://norvig.com/spell-correct.html
Character ngrams: see next slide…
Character ngram Spelling Correction
Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
Uhm, IR model???
Review…
Abstract IR Architecture
Documents Query
Results
Representation
Function
Representation
Function
Query Representation Document Representation
Comparison
Function Index
offline online
Document Boolean Representation
McDonald's slims down spuds
Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.
But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.
…
McDonalds
fat
fries
new
french
Company
Said
nutrition
…
“Bag of Words”
Boolean Retrieval
dogs Doc 1
dolphins Doc 2
football Doc 3
football dolphins Doc 4
Inverted Index: Boolean Retrieval
one fish, two fish Doc 1
red fish, blue fish Doc 2
cat in the hat Doc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
green eggs and ham Doc 4
1 red
1 two
2 red
1 two
Inverted Indexing via MapReduce
1 one
1 two
1 fish
one fish, two fish Doc 1
2 red
2 blue
2 fish
red fish, blue fish Doc 2
3 cat
3 hat
cat in the hat Doc 3
1 fish 2
1 one 1 two
2 red
3 cat
2 blue
3 hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
Scalability Bottleneck
Desired output format: <term, [doc1, doc2, …]>
Just emitting each <term, docID> pair won’t produce this
How to produce this without buffering?
Side-effect: write directly to HDFS instead of emitting
Complications?
• Persistent data must be cleaned up if reducer restarted…
Using the Inverted Index
Boolean Retrieval: to execute a Boolean query
Build query syntax tree
For each clause, look up postings
Traverse postings and apply Boolean operator
Efficiency analysis
Start with shortest posting first
Postings traversal is linear (if postings are sorted)
• Oops… we didn’t actually do this in building our index…
( blue AND fish ) OR ham
blue fish
AND ham
OR
1
2 blue
fish 2
Inverted Indexing in MapReduce
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)
Inverted Indexing in MapReduce: try 2
1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)
1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Sort(P)
5: Emit(term t; P)
1
fish
2
(Another) Scalability Bottleneck
Reducers buffers all docIDs associated with term (to sort)
What if term occurs in many documents?
Secondary sorting
Use composite key
Partition function
Key Comparator
Side-effect: write directly to HDFS as before…
Inverted index for spelling correction
Like search, spelling correction must be fast
How can we quickly identify candidate corrections?
II: Map each character ngram list of all words containing it
#ac -> { act, across, actress, acquire, … }
acr -> { across, acrimony, macro, … }
cre -> { crest, acre, acres, … }
res -> { arrest, rest, rescue, restaurant, … }
ess -> { less, lesson, necessary, actress, … }
ss# -> { less, mess, moss, across, actress, … }
How do we build the inverted index in MapReduce?
Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
Exercise
Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus
Also other alternatives, e.g. in-mapper combining, pairs
Is MapReduce even necessary for this?
Dictionary vs. token frequency
Map(String docid, String text):
for each word w in text:
for each trigram t in w:
Emit(t, w)
Reduce(String trigram, Iterator<Text> values):
Emit(trigram, values.toSet)
Spelling correction as Boolean search
Given inverted index, how to find set of possible corrections?
Compute union of all words indexed by any of its character ngrams
= Boolean search
• Query “acress” “#ac OR acr OR cre OR res OR ess OR ss# “
Are all corrections equally likely / good?
Ranked Information Retrieval
Order documents by probability of relevance
Estimate relevance of each document to the query
Rank documents by relevance
How do we estimate relevance?
Vector space paradigm
Approximate relevance by vector similarity (e.g. cosine)
Represent queries and documents as vectors
Rank documents by vector similarity to the query
Vector Space Model
Assumption: Documents that are “close” in vector space
“talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Retrieve documents based on how close the document
vector is to the query vector (i.e., similarity ~ “closeness”)
Similarity Metric
Use “angle” between the vectors
Given pre-normalized vectors, just compute inner product
n
i ki
n
i ji
n
i kiji
kj
kj
kj
ww
ww
dd
ddddsim
1
2
,1
2
,
1 ,,),(
kj
kj
dd
dd
)cos(
n
i kijikjkj wwddddsim1 ,,),(
Boolean Character ngram correction
Boolean Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is set of character ngrams
Let’s use n=3 (trigram), with # to mark word start/end
Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
Spelling Correction in Vector Space
Assumption: Words that are “close together” in ngram
vector space have similar orthography
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve words in the dictionary based on how
close the word is to the typo (i.e., similarity ~ “closeness”)
Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
Use simple count here, what else might we do?
IR Term Weighting
Term weights consist of two components
Local: how important is the term in this document?
Global: how important is the term in the collection?
Here’s the intuition:
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency (local)
Inverse document frequency (global)
TF.IDF Term Weighting
i
jijin
Nw logtf ,,
jiw ,
ji ,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
2
1
1
2
1
1
1
1
1
1
1
Inverted Index: TF.IDF
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tf
df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1 red
1 1 two
1 red
1 two
one fish, two fish Doc 1
red fish, blue fish Doc 2
cat in the hat Doc 3
green eggs and ham Doc 4
3
4
1
4
4
3
2
1
2
2
1
Inverted Indexing via MapReduce
1 one
1 two
1 fish
one fish, two fish Doc 1
2 red
2 blue
2 fish
red fish, blue fish Doc 2
3 cat
3 hat
cat in the hat Doc 3
1 fish 2
1 one 1 two
2 red
3 cat
2 blue
3 hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
1
1
2
1
1
2 2
1 1
1
1
1
1
1
1
2
Inverted Indexing via MapReduce (2)
1 one
1 two
1 fish
one fish, two fish Doc 1
2 red
2 blue
2 fish
red fish, blue fish Doc 2
3 cat
3 hat
cat in the hat Doc 3
1 fish 2
1 one 1 two
2 red
3 cat
2 blue
3 hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
Inverted Indexing: Pseudo-Code
Further exaccerbates earlier scalability issues …
Ranked Character ngram correction
Vector space Information Retrieval (IR) model
Query=typo word
Document collection = dictionary (i.e. set of valid words)
Representation: word is vector of character ngram value
Rank candidate corrections according to vector similarity (cosine)
Trigram Examples
across: [#ac, acr, cro, oss, ss#]
acress: [#ac, acr, cre, res, ess, ss#]
actress: [#ac, act, ctr, tre, res, ess, ss#]
blam: [#bl, bla, lam, am#]
mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
“value” here expresses relative importance of different
vector components for the similarity comparison
What else might we do? TF.IDF for character n-grams?
TF.IDF for character n-grams
Think about what makes an ngram more discriminating
e.g. in acquire, acq and cqu are more indicative than qui and ire.
Schematically, we want something like:
• acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]
Possible solution: TF-IDF, where
TF is the frequency of the ngram in the word
IDF is the number of words the ngram occurs in in the vocabulary
Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
Correction Beyond Orthography
So far we’ve focused on orthography alone
The context of a typo also tells us a great deal
How can we compare contexts?
Idea: use the co-occurrence matrices built during HW2
We have a vector of co-occurrence counts for each word
Extract a similar vector for the typo given its immediate context
• “She is their favorite acress in town.”
acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]
Possible enhancement: make vectors sensitive to word order
Combining evidence
We have orthographic similarity and contextual similarity
We can do a simple weighted combination of the two, e.g.:
How to do this more efficiently?
Compute top candidates based on simOrth
Take top k for consideration with simContext
…or other way around…
The combined model might also be expressed by a similar
probabilistic model…
),(1),(),( kjkjkj ddsimContextddsimOrthdddsimCombine
March 22, 2005 42
Paradigm: Noisy-Channel Modeling
)|()(maxarg)|(maxarg SOPSPOSPsSS
Want to recover most likely latent (correct) source
word underlying the observed (misspelled) word
P(S): language model gives probability distribution
over possible (candidate) source words
P(O|S): channel model gives probability of each
candidate source word being “corrupted” into the
observed typo
Noisy Channel Model for correction
We want to rank candidates by P(cand | typo)
Using Bayes law, the chain rule, an independence
assumption, and logs, we have:
)|(log)|(log
)|()|(
)()|()|(
),()|(
),(),|(
),,(
),(
),,(),|(
contextcandPcandtypoP
contextcandPcandtypoP
contextPcontextcandPcandtypoP
contextcandPcandtypoP
contextcandPcontextcandtypoP
contexttypocandP
contexttypoP
contexttypocandPcontexttypocandP
Probabilistic vs. vector space model
Both measure orthographic & contextual “fit” of the
candidate given the typo and its usage context
Noisy channel:
IR approach:
Both can benefit from “big” data (i.e. bigger samples)
Better estimates of probabilities and population frequencies
Usual probabilistic vs. non-probabilistic tradeoffs
Principled theory and methodology for modeling and estimation
How to extend the feature space to include additional information?
• Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
)|(log)|(log),|( contextcandPcandtypoPcontexttypocandP
),(1),(),( kjkjkj ddsimContextddsimOrthdddsimCombine
Index Compression
2 1 3 1 2 3
2 1 3 1 2 3
Postings Encoding
1 fish 9 21 34 35 80 …
1 fish 8 12 13 1 45 …
Conceptually:
In Practice:
•Instead of document IDs, encode deltas (or d-gaps) • But it’s not obvious that this save space…
Overview of Index Compression
Byte-aligned vs. bit-aligned
Non-parameterized bit-aligned
Unary codes
(gamma) codes
(delta) codes
Parameterized bit-aligned
Golomb codes
Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
But First... General Data Compression
Run Length Encoding
7 7 7 8 8 9 = (7, 3), (8,2), (9,1)
Binary Equivalent
0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3
Good with sparse binary data
Huffman Coding
Optimal when data is distributed by negative powers of two
e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8
• a = 0, b = 10, c= 110, d=111
Prefix codes: no codeword is the prefix of another codeword
• If we read 0, we know it’s an “a” following bits are a new codeword
• Similarly 10 is a b (no other codeword starts with 10), etc.
• Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
Unary Codes
Encode number as a run of 1s, specifically…
x 1 coded as x-1 1s, followed by zero bit terminator
1 = 0
2 = 10
3 = 110
4 = 1110
...
Great for small numbers… horrible for large numbers
Overly-biased for very small gaps
codes
x 1 is coded in two parts: unary length : offset
Start with binary encoded, remove highest-order bit = offset
Length is number of binary digits, encoded in unary
Concatenate length + offset codes
Example: 9 in binary is 1001
Offset = 001
Length = 4, in unary code = 1110
code = 1110:001
Another example: 7 (111 in binary)
• offset=11, length=3 (110 in unary) code = 110:11
Analysis
Offset = log x
Length = log x +1
Total = 2 log x +1 (97 bits, 75 bits, …)
codes
As with codes, two parts: unary length & offset
Offset is same as before
Length is encoded by its code
Example: 9 (=1001 in binary)
Offset = 001
Length = 4 (100), offset=00, length 3 = 110 in unary
• code=110:00
code = 110:00:001
Comparison
codes better for smaller numbers
codes better for larger numbers
Golomb Codes
x 1, parameter b
x encoded in two parts
Part 1: q = ( x - 1 ) / b , code q + 1 in unary
Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary
Truncated binary defines prefix code
if b is a power of 2
• easy case: truncated binary = regular binary
else
• First 2^(log b + 1) – b values encoded in log b bits
• Remaining values encoded in log b + 1 bits
Let’s see some examples
Golomb Code Examples
b = 3, r = [0:2]
First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit
First 1 value in 1 bit: 0
Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11
b = 5, r = [0:4]
First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits
First 3 values in 2 bits: 00, 01, 10
Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111
• Two prefix bits needed since single leading 1 already used in “10”
b = 6, r = [0:5]
First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits
First 2 values in 2 bits: 00, 01
Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
Comparison of Coding Schemes
1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:000 110:10 10:01
9 111111110 1110:001 11000:001 110:11 10:100
10 1111111110 1110:010 11000:010 1110:0 10:101
Unary Golomb
b=3 b=6
Witten, Moffat, Bell, Managing Gigabytes (1999)
See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10
Index Compression: Performance
Witten, Moffat, Bell, Managing Gigabytes (1999)
Unary 262 1918
Binary 15 20
6.51 6.63
6.23 6.38
Golomb 6.09 5.84
Bible TREC
Use Golomb for d-gaps, codes for term frequencies
Optimal b 0.69 (N/df): Different b for every term!
Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)
Comparison of Index Size (bits per pointer)
[2,4]
[9]
[1,8,22]
[23]
[8,41]
[2,9,76]
[2,4]
[9]
[1,8,22]
[23]
[8,41]
[2,9,76]
2
1
3
1
2
3
Where are we without compression?
1 fish
9
21
(values) (key)
34
35
80
1 fish
9
21
(values) (keys)
34
35
80
fish
fish
fish
fish
fish
How is this different? • Let the framework do the sorting
• Directly write postings to disk
• Term frequency implicitly stored
Index Compression in MapReduce
Need df to compress posting for each term
How do we compute df?
Count the # of postings in reduce(), then compress
Problem?
Order Inversion Pattern
In the mapper:
Emit “special” key-value pairs to keep track of df
In the reducer:
Make sure “special” key-value pairs come first: process them to
determine df
Remember: proper partitioning!
Getting the df: Modified Mapper
one fish, two fish Doc 1
1 fish [2,4]
(value) (key)
1 one [1]
1 two [3]
fish [1]
one [1]
two [1]
Input document…
Emit normal key-value pairs…
Emit “special” key-value pairs to keep track of df…
Getting the df: Modified Reducer
1 fish
9
[2,4]
[9]
21 [1,8,22]
(value) (key)
34 [23]
35 [8,41]
80 [2,9,76]
fish
fish
fish
fish
fish
Write postings directly to disk
fish [63] [82] [27] …
…
First, compute the df by summing contributions
from all “special” key-value pair…
Compress postings incrementally as they arrive
Important: properly define sort order to make
sure “special” key-value pairs come first!
Where have we seen this before?
In-class Exercise
Exercise: where have all the ngrams gone?
For each observed (word) trigram in collection, output its observed (docID, wordIndex) locations
Input
Output Possible Tools:
* pairs/stripes?
* combining?
* secondary sorting?
* order inversion?
* side effects?
one fish two
one fish two fish Doc 1
one fish two salmon Doc 2
two fish two fish Doc 3
fish two fish
fish two salmon
two fish two
[(1,1),(2,1)]
[(1,2),(3,2)]
[(2,2)]
[(3,1)]
Exercise: shingling Given observed (docID, wordIndex) ngram locations For each document, for each of its ngrams (in order), give a list of the ngram locations for that ngram Input Possible Tools: * pairs/stripes? Output * combining? * secondary sorting? * order inversion? * side effects?
Doc 1
fish two fish
fish two salmon
two fish two
[ [(1,1),(2,1)], [(1,2),(3,2)] ]
[(1,2),(3,2)]
[(2,2)]
[(3,1)]
one fish two [(1,1),(2,1)]
Doc 2 [ [(1,1),(2,1)], [(2,2)] ]
Doc 3 [ [(3,1)], [(1,2),(3,2)] ]
Exercise: shingling (2) How can we recognize when longer ngrams are aligned across documents?
Example
doc 1: a b c d e
doc 2: a b c d f
doc 3: e b c d f
doc 4: a b c d e
Find “a b c d” in docs 1 2 and 4,
“b c d f” in 2 & 3
“a b c d e” in 1 and 4
class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
... @inproceedings{Kolak:2008,
author = {Kolak, Okan and Schilit, Bill N.},
title = {Generating links by mining quotations},
booktitle = {19th ACM conference on Hypertext and
hypermedia},
year = {2008},
pages = {117--126}
}
class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ++index; foreach Alignment a in alignments Ngram next = new Ngram(a.otherID, a.otherIndex + a.length) if (ngrams.contains(next)) // extend alignment a.length += 1; ngrams.remove(next) else // terminate alignment emit _docID, (a); alignments.remove(a) foreach ngram in ngrams alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
Sequences of MapReduce Jobs
Building more complex MR algorithms
Monolithic single Map + single Reduce
What we’ve done so far
Fitting all computation to this model can be difficult and ugly
We generally strive for modularization when possible
What else can we do?
Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)
Chaining: [Map+ Reduce Map*]
• 1 or more Mappers
• 1 reducer
• 0 or more Mappers
Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …
Express arbitrary dependencies between jobs
Modularization and WordCount
General benefits of modularization
Re-use for easier/faster development
Consistent behavior across applications
Easier/faster to maintain/extend for benefit of many applications
Even basic word count can be broken down
Pre-processing
• How will we tokenize? Perform stemming? Remove stopwords?
Main computation: count tokenized tokens and group by word
Post-processing
• Transform the values? (e.g. log-damping)
Let’s separate tokenization into its own module
Many other tasks can likely benefit
First approach: pipeline…
Pipeline WordCount Modules
Count Observer Mapper
• List[String] -> List[(String, Int)]
• E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1), (“sleep”,1)]
LongSumReducer
• Sum token counts
• E.g. (“sleep”, [1, 5, 2]) -> (“sleep”, 8)
Tokenize Tokenizer Mapper
• String -> List[String]
• Keep doc ID key
• E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”])
No Reducer
Pipeline WordCount in Hadoop
Two distinct jobs: tokenize and count
Data sharing between jobs via persistent output
Can use combiners and partitioners as usual (won’t bother here)
Let’s use SequenceFileOutputFormat rather than TextOutputFormat
sequence of binary key-value pairs; faster / smaller
tokenization output will stick around unless we delete it
Tokenize job
Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer
Output goes to directory we specify
Files will be read back in by the counting job
Output is array of tokens
We need to make a suitable Writable for String arrays
Count job
Input types defined by the input SequenceFile (don’t need to be specified)
Mapper is trivial
observes tokens from incoming data
Key: (docid) & Value: (Array of Strings, encoded as a Writable)
Pipeline WordCount (old Hadoop API)
Configuration conf = new Configuration();
String tmpDir1to2 = "/tmp/intermediate1to2";
// Tokenize job
JobConf tokenizationJob = new JobConf(conf);
tokenizationJob.setJarByClass(PipelineWordCount.class);
FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));
FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));
tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);
tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);
tokenizationJob.setOutputKeyClass(LongWritable.class);
tokenizationJob.setOutputValueClass(TextArrayWritable.class);
tokenizationJob.setNumReduceTasks(0);
// Count job
JobConf countingJob = new JobConf(conf);
countingJob.setJarByClass(PipelineWordCount.class);
countingJob.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));
FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));
countingJob.setMapperClass(TrivialWordObserver.class);
countingJob.setReducerClass(MapRedIntSumReducer.class);
countingJob.setOutputKeyClass(Text.class);
countingJob.setOutputValueClass(IntWritable.class);
countingJob.setNumReduceTasks(reduceTasks);
JobClient.runJob(tokenizationJob);
JobClient.runJob(countingJob);
Pipeline jobs in Hadoop
Old API
JobClinet.runJob(..) does not return until job finishes
New API
Use Job rather than JobConf
Use job.waitForCompletion instead of JobClient.runJob
Why Old API?
In 0.20.2, chaining only possible under old API
We want to re-use the same components for chaining (next…)
Chaining in Hadoop
Map+ Reduce Map*
1 or more Mappers
• Can use IdentityMapper
1 reducer
• No reducers: conf.setNumReduceTasks(0)?
0 or more Mappers
Usual combiners and partitioners
By default, data passed between
Mappers by usual writing of
intermediate data to disk
Can always use side-effects…
There is a better, built-in way to bypass
this and pass (Key,Value) pairs by
reference instead
• Requires different Mapper semantics!
Mapper 1 Mapper 1
Intermediates Intermediates
Reducer Reducer
Mapper 2 Mapper 2
Mapper 3 Mapper 3
Persistent Output Persistent Output
Hadoop: ChainMapper & ChainReducer
Below JobConf objects (deprecated in Hadoop 0.20.2).
No undeprecated replacement in 0.20.2…
Examples here work for later versions with small changes
Configuration conf = new Configuration();
JobConf job = new JobConf(conf);
...
boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,
Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);
JobConf map2Conf = new JobConf(false);
ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,
Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,
ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)
JobConf map3Conf = new JobConf(false);
ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,
Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)
JobClient.runJob(job);
Chaining in Hadoop
Let’s continue our running example:
Mapper 1: Tokenize
Mapper 2: Observe (count) words
Reducer: same IntSum reducer as always
Mapper 3 Log-dampen counts
• We didn’t have this in our pipeline example but we’ll add here…
Chained Tokenizer + WordCount
// Set up configuration and intermediate directory location
Configuration conf = new Configuration();
JobConf chainJob = new JobConf(conf);
chainJob.setJobName("Chain job");
chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…
chainJob.setNumReduceTasks(reduceTasks);
FileInputFormat.setInputPaths(chainJob, new Path(inputPath));
FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));
// pass output (Key,Value) pairs to next Mapper by reference?
boolean passByRef = false;
JobConf map1 = new JobConf(false); // tokenization
ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,
LongWritable.class, Text.class,
LongWritable.class, TextArrayWritable.class, passByRef, map1);
JobConf map2 = new JobConf(false); // Add token observer job
ChainMapper.addMapper(chainJob, TrivialWordObserver.class,
LongWritable.class, TextArrayWritable.class,
Text.class, LongWritable.class, passByRef, map2);
JobConf reduce = new JobConf(false); // Set the int sum reducer
ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,
Text.class, LongWritable.class, passByRef, reduce);
JobConf map3 = new JobConf(false); // log-scaling of counts
ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,
Text.class, FloatWritable.class, passByRef, map3);
JobClient.runJob(chainJob);
Hadoop Chaining: Pass by Reference
Chaining allows possible optimization
Chained mappers run in same JVM thread, so opportunity to avoid
serialization to/from disk with pipelined jobs
Also lesser benefit of avoiding extra object destruction / construction
Gotchas
OutputCollector.collect(K k, V v) promises not alter the content of k and v
But if Map1 passes (k,v) by reference to Map2 via collect(),
Map2 may alter (k,v) & thereby violate the contract
What to do?
Option 1: Honor the contract – don’t alter input (k,v) in Map2
Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()
Document carefully to avoid later changes silently breaking this…
Setting Dependencies Between Jobs
JobControl and Job provide the mechanism
New API: no JobConf, create Job from Configuration, …
// create jobconf1 and jobconf2 as appropriate
// …
Job job1=new Job(jobconf1)
Job job2=new Job(jobconf2);
job2.addDependingJob(job1);
JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
jbcntrl.run()
Higher Level Abstractions
Pig: language and execution environment for expressing
MapReduce data flows. (pretty much the standard)
See White, Chapter 11
Cascading: another environment with a higher level of
abstraction for composing complex data flows
See White, Chapter 16, pp 539-552
Cascalog: query language based on Cascading that uses
Clojure (a JVM-based LISP variant)
Word count in Cascalog
Certainly more concise – though you need to grok the syntax.
(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))