+ All Categories
Home > Documents > Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf ·...

Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf ·...

Date post: 15-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
67
Probabilistic Counting with Randomized Storage Benjamin Van Durme and Ashwin Lall Thursday, July 16, 2009
Transcript
Page 1: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Probabilistic Counting with Randomized Storage

Benjamin Van Durme and Ashwin Lall

Thursday, July 16, 2009

Page 2: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Data Overload

• Lots of text (, images, audio, ...) is good

• But how to process it all?

• Approximate algorithms!

2

Make the best of what you’ve got

Thursday, July 16, 2009

Page 3: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Data Overload

• Lots of text (, images, audio, ...) is good

• But how to process it all?

• Approximate algorithms!

2

More data equalsbetter results

Make the best of what you’ve got

Thursday, July 16, 2009

Page 4: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Data Overload

• Lots of text (, images, audio, ...) is good

• But how to process it all?

• Approximate algorithms!

2

More data equalsbetter results

Buy/rent a data center?

Make the best of what you’ve got

Thursday, July 16, 2009

Page 5: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bulky Data

3

19901980 1985 1995 2000 2005... ... ... ... ...

Thursday, July 16, 2009

Page 6: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bulky Data in Small Space

4

19901980 1985 1995 2000 2005... ... ... ... ...

Thursday, July 16, 2009

Page 7: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bulky Data in Small Space Online?

5

20001980 ... ... ... ...

+ + +

Thursday, July 16, 2009

Page 8: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Outline

• Storing Static Counts

• Counting Online

• Experiments

• Additional Comments

6

Thursday, July 16, 2009

Page 9: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Outline

• Storing Static Counts

• Counting Online

• Experiments

• Additional Comments

7

Thursday, July 16, 2009

Page 10: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

8

Thursday, July 16, 2009

Page 11: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

Insert(x)

8

Thursday, July 16, 2009

Page 12: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

Insert(x)

8

Thursday, July 16, 2009

Page 13: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

Insert(x)

8

Thursday, July 16, 2009

Page 14: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

8

Thursday, July 16, 2009

Page 15: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

Lookup(y)

8

Thursday, July 16, 2009

Page 16: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Bloom Filters [Bloom ’70]

• Records set membership.

• No false negatives.

• Some false positives.

• Think hashtables, where you throw away the key.

Lookup(y)

8

Thursday, July 16, 2009

Page 17: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

• Bloom filters are nice when you can tolerate small false positives.

• And your x’s are large.

• For example, Language Modeling.

9

Bloom Filters ...

Insert(x)

Thursday, July 16, 2009

Page 18: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Motivation: n-grams for MT

10

...the dog

dog barked barked at

...

Thursday, July 16, 2009

Page 19: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Motivation: n-grams for MT

11

...the dog 97

dog barked 42barked at 58

...

Thursday, July 16, 2009

Page 20: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Motivation: n-grams for MT...

the dog 97dog barked 42

barked at 58...

狗叫了...

The cat barked ...

The dog barked ... Dog barked ...

??

Thursday, July 16, 2009

Page 21: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Motivation: n-grams for MT

13

...the dog 97

dog barked 42barked at 58

...

狗叫了...

The cat barked ...

The dog barked ... Dog barked ...

??

Thursday, July 16, 2009

Page 22: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Motivation: n-grams for MT

14

...the dog 97

dog barked 42barked at 58

...

狗叫了...

The cat barked ...

The dog barked ... Dog barked ...

??

Thursday, July 16, 2009

Page 23: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Storing Counts with Bloom Filters

15

ACL 2007

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512–519,Prague, Czech Republic, June 2007. c!2007 Association for Computational Linguistics

Randomised Language Modelling for Statistical Machine Translation

David Talbot and Miles OsborneSchool of Informatics, University of Edinburgh2 Buccleuch Place, Edinburgh, EH8 9LW, UK

[email protected], [email protected]

Abstract

A Bloom filter (BF) is a randomised datastructure for set membership queries. Itsspace requirements are significantly belowlossless information-theoretic lower boundsbut it produces false positives with somequantifiable probability. Here we explore theuse of BFs for language modelling in statis-tical machine translation.

We show how a BF containing n-grams canenable us to use much larger corpora andhigher-order models complementing a con-ventional n-gram LM within an SMT sys-tem. We also consider (i) how to include ap-proximate frequency information efficientlywithin a BF and (ii) how to reduce the er-ror rate of these models by first checking forlower-order sub-sequences in candidate n-grams. Our solutions in both cases retain theone-sided error guarantees of the BF whiletaking advantage of the Zipf-like distributionof word frequencies to reduce the space re-quirements.

1 IntroductionLanguage modelling (LM) is a crucial component instatistical machine translation (SMT). Standard n-gram language models assign probabilities to trans-lation hypotheses in the target language, typically assmoothed trigram models, e.g. (Chiang, 2005). Al-though it is well-known that higher-order LMs andmodels trained on additional monolingual corporacan yield better translation performance, the chal-

lenges in deploying large LMs are not trivial. In-creasing the order of an n-gram model can result inan exponential increase in the number of parameters;for corpora such as the English Gigaword corpus, forinstance, there are 300 million distinct trigrams andover 1.2 billion 5-grams. Since a LM may be queriedmillions of times per sentence, it should ideally re-side locally in memory to avoid time-consuming re-mote or disk-based look-ups.

Against this background, we consider a radicallydifferent approach to language modelling: insteadof explicitly storing all distinct n-grams, we store arandomised representation. In particular, we showthat the Bloom filter (Bloom (1970); BF), a sim-ple space-efficient randomised data structure for rep-resenting sets, may be used to represent statisticsfrom larger corpora and for higher-order n-grams tocomplement a conventional smoothed trigram modelwithin an SMT decoder. 1

The space requirements of a Bloom filter are quitespectacular, falling significantly below information-theoretic error-free lower bounds while query timesare constant. This efficiency, however, comes at theprice of false positives: the filter may erroneouslyreport that an item not in the set is a member. Falsenegatives, on the other hand, will never occur: theerror is said to be one-sided.

In this paper, we show that a Bloom filter can beused effectively for language modelling within anSMT decoder and present the log-frequency Bloomfilter, an extension of the standard Boolean BF that

1For extensions of the framework presented here to stand-alone smoothed Bloom filter language models, we refer thereader to a companion paper (Talbot and Osborne, 2007).

512

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512–519,Prague, Czech Republic, June 2007. c!2007 Association for Computational Linguistics

Randomised Language Modelling for Statistical Machine Translation

David Talbot and Miles OsborneSchool of Informatics, University of Edinburgh2 Buccleuch Place, Edinburgh, EH8 9LW, UK

[email protected], [email protected]

Abstract

A Bloom filter (BF) is a randomised datastructure for set membership queries. Itsspace requirements are significantly belowlossless information-theoretic lower boundsbut it produces false positives with somequantifiable probability. Here we explore theuse of BFs for language modelling in statis-tical machine translation.

We show how a BF containing n-grams canenable us to use much larger corpora andhigher-order models complementing a con-ventional n-gram LM within an SMT sys-tem. We also consider (i) how to include ap-proximate frequency information efficientlywithin a BF and (ii) how to reduce the er-ror rate of these models by first checking forlower-order sub-sequences in candidate n-grams. Our solutions in both cases retain theone-sided error guarantees of the BF whiletaking advantage of the Zipf-like distributionof word frequencies to reduce the space re-quirements.

1 IntroductionLanguage modelling (LM) is a crucial component instatistical machine translation (SMT). Standard n-gram language models assign probabilities to trans-lation hypotheses in the target language, typically assmoothed trigram models, e.g. (Chiang, 2005). Al-though it is well-known that higher-order LMs andmodels trained on additional monolingual corporacan yield better translation performance, the chal-

lenges in deploying large LMs are not trivial. In-creasing the order of an n-gram model can result inan exponential increase in the number of parameters;for corpora such as the English Gigaword corpus, forinstance, there are 300 million distinct trigrams andover 1.2 billion 5-grams. Since a LM may be queriedmillions of times per sentence, it should ideally re-side locally in memory to avoid time-consuming re-mote or disk-based look-ups.

Against this background, we consider a radicallydifferent approach to language modelling: insteadof explicitly storing all distinct n-grams, we store arandomised representation. In particular, we showthat the Bloom filter (Bloom (1970); BF), a sim-ple space-efficient randomised data structure for rep-resenting sets, may be used to represent statisticsfrom larger corpora and for higher-order n-grams tocomplement a conventional smoothed trigram modelwithin an SMT decoder. 1

The space requirements of a Bloom filter are quitespectacular, falling significantly below information-theoretic error-free lower bounds while query timesare constant. This efficiency, however, comes at theprice of false positives: the filter may erroneouslyreport that an item not in the set is a member. Falsenegatives, on the other hand, will never occur: theerror is said to be one-sided.

In this paper, we show that a Bloom filter can beused effectively for language modelling within anSMT decoder and present the log-frequency Bloomfilter, an extension of the standard Boolean BF that

1For extensions of the framework presented here to stand-alone smoothed Bloom filter language models, we refer thereader to a companion paper (Talbot and Osborne, 2007).

512

Algorithm 1 Training frequency BFInput: Strain, {h1, ...hk} and BF = !Output: BFfor all x " Strain do

c(x) # frequency of n-gram x in Strain

qc(x) # quantisation of c(x) (Eq. 1)for j = 1 to qc(x) do

for i = 1 to k dohi(x) # hash of event {x, j} under hi

BF [hi(x)] # 1end for

end forend forreturn BF

3.1 Log-frequency Bloom filterThe efficiency of our scheme for storing n-gramstatistics within a BF relies on the Zipf-like distribu-tion of n-gram frequencies in natural language cor-pora: most events occur an extremely small numberof times, while a small number are very frequent.

We quantise raw frequencies, c(x), using a loga-rithmic codebook as follows,

qc(x) = 1 + $logb c(x)%. (1)

The precision of this codebook decays exponentiallywith the raw counts and the scale is determined bythe base of the logarithm b; we examine the effect ofthis parameter in experiments below.

Given the quantised count qc(x) for an n-gram x,the filter is trained by entering composite events con-sisting of the n-gram appended by an integer counterj that is incremented from 1 to qc(x) into the filter.To retrieve the quantised count for an n-gram, it isfirst appended with a count of 1 and hashed underthe k functions; if this tests positive, the count is in-cremented and the process repeated. The procedureterminates as soon as any of the k hash functions hitsa 0 and the previous count is reported. The one-sidederror of the BF and the training scheme ensure thatthe actual quantised count cannot be larger than thisvalue. As the counts are quantised logarithmically,the counter will be incremented only a small numberof times. The training and testing routines are givenhere as Algorithms 1 and 2 respectively.

Errors for the log-frequency BF scheme are one-sided: frequencies will never be underestimated.

Algorithm 2 Test frequency BFInput: x, MAXQCOUNT , {h1, ...hk} and BFOutput: Upper bound on qc(x) " Strain

for j = 1 to MAXQCOUNT dofor i = 1 to k do

hi(x) # hash of event {x, j} under hi

if BF [hi(x)] = 0 thenreturn j & 1

end ifend for

end for

The probability of overestimating an item’s fre-quency decays exponentially with the size of theoverestimation error d (i.e. as fd for d > 0) sinceeach erroneous increment corresponds to a singlefalse positive and d such independent events mustoccur together.

3.2 Sub-sequence filteringThe error analysis in Section 2 focused on the falsepositive rate of a BF; if we deploy a BF within anSMT decoder, however, the actual error rate will alsodepend on the a priori membership probability ofitems presented to it. The error rate Err is,

Err = Pr(x /" Strain|Decoder)f.

This implies that, unlike a conventional lossless datastructure, the model’s accuracy depends on othercomponents in system and how it is queried.

We take advantage of the monotonicity of the n-gram event space to place upper bounds on the fre-quency of an n-gram prior to testing for it in the filterand potentially truncate the outer loop in Algorithm2 when we know that the test could only return pos-tive in error.

Specifically, if we have stored lower-order n-grams in the filter, we can infer that an n-gram can-not present, if any of its sub-sequences test nega-tive. Since our scheme for storing frequencies cannever underestimate an item’s frequency, this rela-tion will generalise to frequencies: an n-gram’s fre-quency cannot be greater than the frequency of itsleast frequent sub-sequence as reported by the filter,

c(w1, ..., wn) ' min {c(w1, ..., wn!1), c(w2, ..., wn)}.

We use this to reduce the effective error rate of BF-LMs that we use in the experiments below.

514

Algorithm 1 Training frequency BFInput: Strain, {h1, ...hk} and BF = !Output: BFfor all x " Strain do

c(x) # frequency of n-gram x in Strain

qc(x) # quantisation of c(x) (Eq. 1)for j = 1 to qc(x) do

for i = 1 to k dohi(x) # hash of event {x, j} under hi

BF [hi(x)] # 1end for

end forend forreturn BF

3.1 Log-frequency Bloom filterThe efficiency of our scheme for storing n-gramstatistics within a BF relies on the Zipf-like distribu-tion of n-gram frequencies in natural language cor-pora: most events occur an extremely small numberof times, while a small number are very frequent.

We quantise raw frequencies, c(x), using a loga-rithmic codebook as follows,

qc(x) = 1 + $logb c(x)%. (1)

The precision of this codebook decays exponentiallywith the raw counts and the scale is determined bythe base of the logarithm b; we examine the effect ofthis parameter in experiments below.

Given the quantised count qc(x) for an n-gram x,the filter is trained by entering composite events con-sisting of the n-gram appended by an integer counterj that is incremented from 1 to qc(x) into the filter.To retrieve the quantised count for an n-gram, it isfirst appended with a count of 1 and hashed underthe k functions; if this tests positive, the count is in-cremented and the process repeated. The procedureterminates as soon as any of the k hash functions hitsa 0 and the previous count is reported. The one-sidederror of the BF and the training scheme ensure thatthe actual quantised count cannot be larger than thisvalue. As the counts are quantised logarithmically,the counter will be incremented only a small numberof times. The training and testing routines are givenhere as Algorithms 1 and 2 respectively.

Errors for the log-frequency BF scheme are one-sided: frequencies will never be underestimated.

Algorithm 2 Test frequency BFInput: x, MAXQCOUNT , {h1, ...hk} and BFOutput: Upper bound on qc(x) " Strain

for j = 1 to MAXQCOUNT dofor i = 1 to k do

hi(x) # hash of event {x, j} under hi

if BF [hi(x)] = 0 thenreturn j & 1

end ifend for

end for

The probability of overestimating an item’s fre-quency decays exponentially with the size of theoverestimation error d (i.e. as fd for d > 0) sinceeach erroneous increment corresponds to a singlefalse positive and d such independent events mustoccur together.

3.2 Sub-sequence filteringThe error analysis in Section 2 focused on the falsepositive rate of a BF; if we deploy a BF within anSMT decoder, however, the actual error rate will alsodepend on the a priori membership probability ofitems presented to it. The error rate Err is,

Err = Pr(x /" Strain|Decoder)f.

This implies that, unlike a conventional lossless datastructure, the model’s accuracy depends on othercomponents in system and how it is queried.

We take advantage of the monotonicity of the n-gram event space to place upper bounds on the fre-quency of an n-gram prior to testing for it in the filterand potentially truncate the outer loop in Algorithm2 when we know that the test could only return pos-tive in error.

Specifically, if we have stored lower-order n-grams in the filter, we can infer that an n-gram can-not present, if any of its sub-sequences test nega-tive. Since our scheme for storing frequencies cannever underestimate an item’s frequency, this rela-tion will generalise to frequencies: an n-gram’s fre-quency cannot be greater than the frequency of itsleast frequent sub-sequence as reported by the filter,

c(w1, ..., wn) ' min {c(w1, ..., wn!1), c(w2, ..., wn)}.

We use this to reduce the effective error rate of BF-LMs that we use in the experiments below.

514

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512–519,Prague, Czech Republic, June 2007. c!2007 Association for Computational Linguistics

Randomised Language Modelling for Statistical Machine Translation

David Talbot and Miles OsborneSchool of Informatics, University of Edinburgh2 Buccleuch Place, Edinburgh, EH8 9LW, UK

[email protected], [email protected]

Abstract

A Bloom filter (BF) is a randomised datastructure for set membership queries. Itsspace requirements are significantly belowlossless information-theoretic lower boundsbut it produces false positives with somequantifiable probability. Here we explore theuse of BFs for language modelling in statis-tical machine translation.

We show how a BF containing n-grams canenable us to use much larger corpora andhigher-order models complementing a con-ventional n-gram LM within an SMT sys-tem. We also consider (i) how to include ap-proximate frequency information efficientlywithin a BF and (ii) how to reduce the er-ror rate of these models by first checking forlower-order sub-sequences in candidate n-grams. Our solutions in both cases retain theone-sided error guarantees of the BF whiletaking advantage of the Zipf-like distributionof word frequencies to reduce the space re-quirements.

1 IntroductionLanguage modelling (LM) is a crucial component instatistical machine translation (SMT). Standard n-gram language models assign probabilities to trans-lation hypotheses in the target language, typically assmoothed trigram models, e.g. (Chiang, 2005). Al-though it is well-known that higher-order LMs andmodels trained on additional monolingual corporacan yield better translation performance, the chal-

lenges in deploying large LMs are not trivial. In-creasing the order of an n-gram model can result inan exponential increase in the number of parameters;for corpora such as the English Gigaword corpus, forinstance, there are 300 million distinct trigrams andover 1.2 billion 5-grams. Since a LM may be queriedmillions of times per sentence, it should ideally re-side locally in memory to avoid time-consuming re-mote or disk-based look-ups.

Against this background, we consider a radicallydifferent approach to language modelling: insteadof explicitly storing all distinct n-grams, we store arandomised representation. In particular, we showthat the Bloom filter (Bloom (1970); BF), a sim-ple space-efficient randomised data structure for rep-resenting sets, may be used to represent statisticsfrom larger corpora and for higher-order n-grams tocomplement a conventional smoothed trigram modelwithin an SMT decoder. 1

The space requirements of a Bloom filter are quitespectacular, falling significantly below information-theoretic error-free lower bounds while query timesare constant. This efficiency, however, comes at theprice of false positives: the filter may erroneouslyreport that an item not in the set is a member. Falsenegatives, on the other hand, will never occur: theerror is said to be one-sided.

In this paper, we show that a Bloom filter can beused effectively for language modelling within anSMT decoder and present the log-frequency Bloomfilter, an extension of the standard Boolean BF that

1For extensions of the framework presented here to stand-alone smoothed Bloom filter language models, we refer thereader to a companion paper (Talbot and Osborne, 2007).

512

Thursday, July 16, 2009

Page 24: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Storing Counts

• Multiple layers of Bloom filters.

• Store exponent,in unary.

16

...

c(x) ! bqc(x)!1

qc(x) = 1

qc(x) = 2

qc(x) = 3...

Thursday, July 16, 2009

Page 25: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Outline

• Storing Static Counts

• Counting Online

• Experiments

• Additional Comments

17

Thursday, July 16, 2009

Page 26: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

Spectral Bloom Filter

18

SIGMOD 2003

The Spectral Bloom Filter (SBF) replacesthe bit vector V with a vector of m counters, C.

Thursday, July 16, 2009

Page 27: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Spectral Bloom Filter [Cohen & Matias ’03]

19

Thursday, July 16, 2009

Page 28: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200920

Insert(x)

1 1 1

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 29: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200921

Insert(x)

2 2 2

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 30: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200922

Insert(x)

3 3 3

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 31: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200923

Insert(x)

4 4 4

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 32: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200924

Insert(y)

5 5 4 1

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 33: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 200925

Lookup(x)

5 5 4 1

Spectral Bloom Filter [Cohen & Matias ’03]

Thursday, July 16, 2009

Page 34: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall26

Lookup(x)

5 5 4 1

Spectral Bloom Filter [Cohen & Matias ’03]

IJCAI 2009Thursday, July 16, 2009

Page 35: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Collect Counts Online

• Count in log-scale, to save space.

27

Thursday, July 16, 2009

Page 36: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Collect Counts Online

• Count in log-scale, to save space.

• Robert Morris (1978) gave us a way to do this.

.

.

.

1bb2

1! b!1

1! b!2

1! b!3b!3

b!2

b!1

28

Thursday, July 16, 2009

Page 37: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Morris Bloom Counter

• Spectral Bloom Filter,

• but with Morris style updating.

29

Lookup(x)

5 5 4 1

Thursday, July 16, 2009

Page 38: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Morris Bloom Counter

30

Lookup(x)

15 15 7 1

5 5 4 1c(x) ! bf " 1b" 1

• Spectral Bloom Filter,

• but with Morris style updating.

Thursday, July 16, 2009

Page 39: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Morris Bloom Counter

• Same amount of space as Spectral Bloom Filter,

31

Lookup(x)

15 15 7 1

5 5 4 1

Thursday, July 16, 2009

Page 40: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Morris Bloom Counter

• Same amount of space as Spectral Bloom Filter,

• gives exponentially larger max-count,

32

Lookup(x)

15 15 7 1

5 5 4 1

Thursday, July 16, 2009

Page 41: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Morris Bloom Counter

• Same amount of space as Spectral Bloom Filter,

• gives exponentially larger max-count,

• but false positives can therefore have higher relative error.

33

Lookup(x)

15 15 7 1

5 5 4 1

Thursday, July 16, 2009

Page 42: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

34

• Morris Bloom Counter,

Thursday, July 16, 2009

Page 43: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

35

• Morris Bloom Counter,

• split into layers,

Thursday, July 16, 2009

Page 44: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

36

Insert(x)

• Morris Bloom Counter,

• split into layers,

• with different hash functions per layer.

Thursday, July 16, 2009

Page 45: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

37

Insert(x)

• Morris Bloom Counter,

• split into layers,

• with different hash functions per layer.

Thursday, July 16, 2009

Page 46: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

38

Insert(x)

• Morris Bloom Counter,

• split into layers,

• with different hash functions per layer.

Thursday, July 16, 2009

Page 47: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Reduce False Positive Rate

39

Insert(x)

• Morris Bloom Counter,

• split into layers,

• with different hash functions per layer.

Thursday, July 16, 2009

Page 48: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Talbot Osborne Morris Bloom (TOMB)Counter

• Combination of Morris Bloom Counter with Talbot Osborne count storage.

• Stay tuned for related work by Talbot.

40

Thursday, July 16, 2009

Page 49: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Tradeoff

41

• Trade number of layers for expressivity.

M =!

i

2hi ! 1

Thursday, July 16, 2009

Page 50: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

Tradeoff

42

h = 4

0, 1, 2, ..., 14, 15

• Trade number of layers for expressivity.

M =!

i

2hi ! 1

IJCAI 2009Thursday, July 16, 2009

Page 51: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

Tradeoff

43

h1 = 2

h2 = 2

0, 1, 2, 3, 4, 5, 6

• Trade number of layers for expressivity.

M =!

i

2hi ! 1

IJCAI 2009Thursday, July 16, 2009

Page 52: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

Tradeoff

44

• Trade number of layers for expressivity.

h1 = 1

h2 = 1

h3 = 1

h4 = 1

0, 1, 2, 3, 4 M =!

i

2hi ! 1

IJCAI 2009Thursday, July 16, 2009

Page 53: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

Tradeoff

45

• Trade number of layers for expressivity.

h1 = 1

h2 = 3

1, 2, 3, ..., 7, 8 M =!

i

2hi ! 1

IJCAI 2009Thursday, July 16, 2009

Page 54: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & Lall

“Layers”

• Layers are a useful visualization.

• In practice, consecutive layers of equal height are implemented as single vectors with sets of hash functions.

46

h1, h2, h3, h4 = 1

h5, h6, h7 = 3

IJCAI 2009Thursday, July 16, 2009

Page 55: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Outline

• Storing Static Counts

• Counting Online

• Experiments

• Additional Comments

47

Thursday, July 16, 2009

Page 56: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Experiment: Count Accuracy

48

100 MB 500 MB

Count all trigrams in Gigaword,randomly query 1,000 values,

compare to truth

0 200 400 600 800 10000

24

68

Rank

log

Freq

uenc

y

0 200 400 600 800 1000

02

46

8

Rank

log

Freq

uenc

y

Thursday, July 16, 2009

Page 57: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Experiment: MTSIZE (MB) I>0 I>1 I>2 I>3

100 86.5% 74.2% 66.1% 43.5%500 26.9% 6.7% 1.8% 0.3%

2,000 10.9% 0.9% 0.1% 0.0%

Table 1: False positive rates when using indicator functionsI>0, ..., I>3. A perfect counter has a rate of 0.0% using I>0.

TRUE 260MB 100MB 50MB 25MB NO LM22.75 22.93 22.27 21.59 19.06 17.35

- 22.88 21.92 20.52 18.91 -- 22.34 21.82 20.37 18.69 -

Table 2: BLEU scores using language models based on true counts,compared to approximations using various size TOMB counters.Three trials for each counter are reported (recall Morris countingis probabilistic, and thus results may vary between similar trials).

4.3 Language Models for Machine TranslationAs an example of approximate counts in practice, we followTalbot and Osborne [2007] in constructing a n-gram languagemodels for Machine Translation (MT). Experiments com-pared the use of unigram, bigram and trigram counts storedexplicitly in hashtables, to those collected using TOMB coun-ters allowed varying amounts of space. Counters had fivelayers of height one, followed by five layers of height three,with 75% of available space allocated to the first five layers.Smoothing was performed using Absolute Discounting [Neyet al., 1994] with an ad hoc value of ! = 0.75.

The resultant language models were substituted forthe trigram model used in the experiments of Post andGildea [2008], with counts collected over the same approx-imately 833 thousand sentences described therein. Explicit,non-compressed storage of these counts required 260 MB.Case-insensitive BLEU-4 scores were computed for those au-thors’ DEV/10 development set, a collection of 371 Chinesesentences comprised of twenty words or less. While moreadvanced language modeling methods exist (see, e.g., [Yuret,2008]), our concern here is specifically on the impact of ap-proximate counting with respect to a given framework, rela-tive to the use of actual values.6

As shown in Table 2, performance declines as a functionof counter size, verifying that the tradeoff between space andaccuracy in applications explored by Talbot and Osborne ex-tends to approximate counts collected online.

5 ConclusionsBuilding on existing work in randomized count storage, wehave presented a general model for probabilistic countingover large numbers of elements in the context of limitedspace. We have defined a parametrizable structure, the Tal-bot Osborne Morris Bloom (TOMB) counter, and presentedanalysis along with experimental results displaying its abilityto trade space for loss in reported count accuracy.

Future work includes looking at optimal classes of coun-ters for particular tasks and element distributions. While mo-

6Post and Gildea report a trigram-based BLEU score of 26.18,using more sophisticated smoothing and backoff techniques.

tivated by needs within the Computational Linguistics com-munity, there are a variety of fields that could benefit frommethods for space efficient counting. For example, we’verecently begun experimenting with visual n-grams using vo-cabularies built from SIFT features, based on images from theCaltech-256 Object Category Dataset [Griffin et al., 2007].

Finally, developing clever methods for buffered inspectionwill allow for online parameter estimation, a required abilityif TOMB counters are to be best used successfully with noknowledge of the target stream distribution a priori.

Acknowledgements The first author benefited from conversa-tions with David Talbot concerning the work of Morris and Bloom,as well as with Miles Osborne on the emerging need for randomizedstorage. Daniel Gildea and Matt Post provided general feedback andassistance in experimentation.

References[Bloom, 1970] Burton H. Bloom. Space/time trade-offs in hash

coding with allowable errors. Communications of the ACM,13:422–426, 1970.

[Cohen and Matias, 2003] Saar Cohen and Yossi Matias. SpectralBloom Filters. In Proceedings of SIGMOD, 2003.

[Flajolet, 1985] Philippe Flajolet. Approximate counting: a de-tailed analysis. BIT, 25(1):113–134, 1985.

[Goyal et al., 2009] Amit Goyal, Hal Daume III, and SureshVenkatasubramanian. Streaming for large scale NLP: LanguageModeling. In Proceedings of NAACL, 2009.

[Graff, 2003] David Graff. English Gigaword. Linguistic DataConsortium, Philadelphia, 2003.

[Griffin et al., 2007] Gregory Griffin, Alex Holub, and Pietro Per-ona. Caltech-256 Object Category Dataset. Technical report,California Institute of Technology, 2007.

[Manku and Motwani, 2002] Gurmeet Singh Manku and RajeevMotwani. Approximate frequency counts over data streams. InProceedings of VLDB, 2002.

[Morris, 1978] Robert Morris. Counting large numbers of events insmall registers. Communications of the ACM, 21(10):840–842,1978.

[Ney et al., 1994] Hermann Ney, Ute Essen, and Reinhard Kneser.On structuring probabilistic dependences in stochastic languagemodeling. Computer, Speech, and Language, 8:1–38, 1994.

[Post and Gildea, 2008] Matt Post and Daniel Gildea. Parsers aslanguage models for statistical machine translation. In Proceed-ings of AMTA, 2008.

[Talbot and Brants, 2008] David Talbot and Thorsten Brants. Ran-domized language models via perfect hash functions. In Proceed-ings of ACL, 2008.

[Talbot and Osborne, 2007] David Talbot and Miles Osborne. Ran-domised Language Modelling for Statistical Machine Transla-tion. In Proceedings of ACL, 2007.

[Talbot, 2009] David Talbot. Bloom Maps for Big Data. PhD thesis,University of Edinburgh, 2009.

[Wikimedia Foundation, 2004] Wikimedia Foundation. Wikipedia:The free encyclopedia. http://en.wikipedia.org, 2004.

[Yuret, 2008] Deniz Yuret. Smoothing a tera-word language model.In Proceedings of ACL, 2008.

49

Build counters with varying amounts of memory

(based on system of Post & Gildea ’08)

Thursday, July 16, 2009

Page 58: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Experiment: MTSIZE (MB) I>0 I>1 I>2 I>3

100 86.5% 74.2% 66.1% 43.5%500 26.9% 6.7% 1.8% 0.3%

2,000 10.9% 0.9% 0.1% 0.0%

Table 1: False positive rates when using indicator functionsI>0, ..., I>3. A perfect counter has a rate of 0.0% using I>0.

TRUE 260MB 100MB 50MB 25MB NO LM22.75 22.93 22.27 21.59 19.06 17.35

- 22.88 21.92 20.52 18.91 -- 22.34 21.82 20.37 18.69 -

Table 2: BLEU scores using language models based on true counts,compared to approximations using various size TOMB counters.Three trials for each counter are reported (recall Morris countingis probabilistic, and thus results may vary between similar trials).

4.3 Language Models for Machine TranslationAs an example of approximate counts in practice, we followTalbot and Osborne [2007] in constructing a n-gram languagemodels for Machine Translation (MT). Experiments com-pared the use of unigram, bigram and trigram counts storedexplicitly in hashtables, to those collected using TOMB coun-ters allowed varying amounts of space. Counters had fivelayers of height one, followed by five layers of height three,with 75% of available space allocated to the first five layers.Smoothing was performed using Absolute Discounting [Neyet al., 1994] with an ad hoc value of ! = 0.75.

The resultant language models were substituted forthe trigram model used in the experiments of Post andGildea [2008], with counts collected over the same approx-imately 833 thousand sentences described therein. Explicit,non-compressed storage of these counts required 260 MB.Case-insensitive BLEU-4 scores were computed for those au-thors’ DEV/10 development set, a collection of 371 Chinesesentences comprised of twenty words or less. While moreadvanced language modeling methods exist (see, e.g., [Yuret,2008]), our concern here is specifically on the impact of ap-proximate counting with respect to a given framework, rela-tive to the use of actual values.6

As shown in Table 2, performance declines as a functionof counter size, verifying that the tradeoff between space andaccuracy in applications explored by Talbot and Osborne ex-tends to approximate counts collected online.

5 ConclusionsBuilding on existing work in randomized count storage, wehave presented a general model for probabilistic countingover large numbers of elements in the context of limitedspace. We have defined a parametrizable structure, the Tal-bot Osborne Morris Bloom (TOMB) counter, and presentedanalysis along with experimental results displaying its abilityto trade space for loss in reported count accuracy.

Future work includes looking at optimal classes of coun-ters for particular tasks and element distributions. While mo-

6Post and Gildea report a trigram-based BLEU score of 26.18,using more sophisticated smoothing and backoff techniques.

tivated by needs within the Computational Linguistics com-munity, there are a variety of fields that could benefit frommethods for space efficient counting. For example, we’verecently begun experimenting with visual n-grams using vo-cabularies built from SIFT features, based on images from theCaltech-256 Object Category Dataset [Griffin et al., 2007].

Finally, developing clever methods for buffered inspectionwill allow for online parameter estimation, a required abilityif TOMB counters are to be best used successfully with noknowledge of the target stream distribution a priori.

Acknowledgements The first author benefited from conversa-tions with David Talbot concerning the work of Morris and Bloom,as well as with Miles Osborne on the emerging need for randomizedstorage. Daniel Gildea and Matt Post provided general feedback andassistance in experimentation.

References[Bloom, 1970] Burton H. Bloom. Space/time trade-offs in hash

coding with allowable errors. Communications of the ACM,13:422–426, 1970.

[Cohen and Matias, 2003] Saar Cohen and Yossi Matias. SpectralBloom Filters. In Proceedings of SIGMOD, 2003.

[Flajolet, 1985] Philippe Flajolet. Approximate counting: a de-tailed analysis. BIT, 25(1):113–134, 1985.

[Goyal et al., 2009] Amit Goyal, Hal Daume III, and SureshVenkatasubramanian. Streaming for large scale NLP: LanguageModeling. In Proceedings of NAACL, 2009.

[Graff, 2003] David Graff. English Gigaword. Linguistic DataConsortium, Philadelphia, 2003.

[Griffin et al., 2007] Gregory Griffin, Alex Holub, and Pietro Per-ona. Caltech-256 Object Category Dataset. Technical report,California Institute of Technology, 2007.

[Manku and Motwani, 2002] Gurmeet Singh Manku and RajeevMotwani. Approximate frequency counts over data streams. InProceedings of VLDB, 2002.

[Morris, 1978] Robert Morris. Counting large numbers of events insmall registers. Communications of the ACM, 21(10):840–842,1978.

[Ney et al., 1994] Hermann Ney, Ute Essen, and Reinhard Kneser.On structuring probabilistic dependences in stochastic languagemodeling. Computer, Speech, and Language, 8:1–38, 1994.

[Post and Gildea, 2008] Matt Post and Daniel Gildea. Parsers aslanguage models for statistical machine translation. In Proceed-ings of AMTA, 2008.

[Talbot and Brants, 2008] David Talbot and Thorsten Brants. Ran-domized language models via perfect hash functions. In Proceed-ings of ACL, 2008.

[Talbot and Osborne, 2007] David Talbot and Miles Osborne. Ran-domised Language Modelling for Statistical Machine Transla-tion. In Proceedings of ACL, 2007.

[Talbot, 2009] David Talbot. Bloom Maps for Big Data. PhD thesis,University of Edinburgh, 2009.

[Wikimedia Foundation, 2004] Wikimedia Foundation. Wikipedia:The free encyclopedia. http://en.wikipedia.org, 2004.

[Yuret, 2008] Deniz Yuret. Smoothing a tera-word language model.In Proceedings of ACL, 2008.

Three runs per counter size

50

Experiment: MT ...

Thursday, July 16, 2009

Page 59: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

SIZE (MB) I>0 I>1 I>2 I>3

100 86.5% 74.2% 66.1% 43.5%500 26.9% 6.7% 1.8% 0.3%

2,000 10.9% 0.9% 0.1% 0.0%

Table 1: False positive rates when using indicator functionsI>0, ..., I>3. A perfect counter has a rate of 0.0% using I>0.

TRUE 260MB 100MB 50MB 25MB NO LM22.75 22.93 22.27 21.59 19.06 17.35

- 22.88 21.92 20.52 18.91 -- 22.34 21.82 20.37 18.69 -

Table 2: BLEU scores using language models based on true counts,compared to approximations using various size TOMB counters.Three trials for each counter are reported (recall Morris countingis probabilistic, and thus results may vary between similar trials).

4.3 Language Models for Machine TranslationAs an example of approximate counts in practice, we followTalbot and Osborne [2007] in constructing a n-gram languagemodels for Machine Translation (MT). Experiments com-pared the use of unigram, bigram and trigram counts storedexplicitly in hashtables, to those collected using TOMB coun-ters allowed varying amounts of space. Counters had fivelayers of height one, followed by five layers of height three,with 75% of available space allocated to the first five layers.Smoothing was performed using Absolute Discounting [Neyet al., 1994] with an ad hoc value of ! = 0.75.

The resultant language models were substituted forthe trigram model used in the experiments of Post andGildea [2008], with counts collected over the same approx-imately 833 thousand sentences described therein. Explicit,non-compressed storage of these counts required 260 MB.Case-insensitive BLEU-4 scores were computed for those au-thors’ DEV/10 development set, a collection of 371 Chinesesentences comprised of twenty words or less. While moreadvanced language modeling methods exist (see, e.g., [Yuret,2008]), our concern here is specifically on the impact of ap-proximate counting with respect to a given framework, rela-tive to the use of actual values.6

As shown in Table 2, performance declines as a functionof counter size, verifying that the tradeoff between space andaccuracy in applications explored by Talbot and Osborne ex-tends to approximate counts collected online.

5 ConclusionsBuilding on existing work in randomized count storage, wehave presented a general model for probabilistic countingover large numbers of elements in the context of limitedspace. We have defined a parametrizable structure, the Tal-bot Osborne Morris Bloom (TOMB) counter, and presentedanalysis along with experimental results displaying its abilityto trade space for loss in reported count accuracy.

Future work includes looking at optimal classes of coun-ters for particular tasks and element distributions. While mo-

6Post and Gildea report a trigram-based BLEU score of 26.18,using more sophisticated smoothing and backoff techniques.

tivated by needs within the Computational Linguistics com-munity, there are a variety of fields that could benefit frommethods for space efficient counting. For example, we’verecently begun experimenting with visual n-grams using vo-cabularies built from SIFT features, based on images from theCaltech-256 Object Category Dataset [Griffin et al., 2007].

Finally, developing clever methods for buffered inspectionwill allow for online parameter estimation, a required abilityif TOMB counters are to be best used successfully with noknowledge of the target stream distribution a priori.

Acknowledgements The first author benefited from conversa-tions with David Talbot concerning the work of Morris and Bloom,as well as with Miles Osborne on the emerging need for randomizedstorage. Daniel Gildea and Matt Post provided general feedback andassistance in experimentation.

References[Bloom, 1970] Burton H. Bloom. Space/time trade-offs in hash

coding with allowable errors. Communications of the ACM,13:422–426, 1970.

[Cohen and Matias, 2003] Saar Cohen and Yossi Matias. SpectralBloom Filters. In Proceedings of SIGMOD, 2003.

[Flajolet, 1985] Philippe Flajolet. Approximate counting: a de-tailed analysis. BIT, 25(1):113–134, 1985.

[Goyal et al., 2009] Amit Goyal, Hal Daume III, and SureshVenkatasubramanian. Streaming for large scale NLP: LanguageModeling. In Proceedings of NAACL, 2009.

[Graff, 2003] David Graff. English Gigaword. Linguistic DataConsortium, Philadelphia, 2003.

[Griffin et al., 2007] Gregory Griffin, Alex Holub, and Pietro Per-ona. Caltech-256 Object Category Dataset. Technical report,California Institute of Technology, 2007.

[Manku and Motwani, 2002] Gurmeet Singh Manku and RajeevMotwani. Approximate frequency counts over data streams. InProceedings of VLDB, 2002.

[Morris, 1978] Robert Morris. Counting large numbers of events insmall registers. Communications of the ACM, 21(10):840–842,1978.

[Ney et al., 1994] Hermann Ney, Ute Essen, and Reinhard Kneser.On structuring probabilistic dependences in stochastic languagemodeling. Computer, Speech, and Language, 8:1–38, 1994.

[Post and Gildea, 2008] Matt Post and Daniel Gildea. Parsers aslanguage models for statistical machine translation. In Proceed-ings of AMTA, 2008.

[Talbot and Brants, 2008] David Talbot and Thorsten Brants. Ran-domized language models via perfect hash functions. In Proceed-ings of ACL, 2008.

[Talbot and Osborne, 2007] David Talbot and Miles Osborne. Ran-domised Language Modelling for Statistical Machine Transla-tion. In Proceedings of ACL, 2007.

[Talbot, 2009] David Talbot. Bloom Maps for Big Data. PhD thesis,University of Edinburgh, 2009.

[Wikimedia Foundation, 2004] Wikimedia Foundation. Wikipedia:The free encyclopedia. http://en.wikipedia.org, 2004.

[Yuret, 2008] Deniz Yuret. Smoothing a tera-word language model.In Proceedings of ACL, 2008.

51

0

5.75

11.5

17.25

23

True 260MB 100MB 50MB 25MB No LM

22.72 22.00 20.83 18.89(average)

Experiment: MT ...

Thursday, July 16, 2009

Page 60: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Outline

• Storing Static Counts

• Counting Online

• Experiments

• Additional Comments

52

Thursday, July 16, 2009

Page 61: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Related

• Applies method ofManku and Motwani ’02.

• Track most frequent elements in stream.

• Rare elements discarded.

• Strong guarantee on counts for top elements.

53

NAACL 2009

Streaming for large scale NLP: Language Modeling

Amit Goyal, Hal Daume III, and Suresh Venkatasubramanian

University of Utah, School of Computing

{amitg,hal,suresh}@cs.utah.edu

Abstract

In this paper, we explore a streaming al-

gorithm paradigm to handle large amounts

of data for NLP problems. We present an

efficient low-memory method for construct-

ing high-order approximate n-gram frequencycounts. The method is based on a determinis-

tic streaming algorithm which efficiently com-

putes approximate frequency counts over a

stream of data while employing a small mem-

ory footprint. We show that this method eas-

ily scales to billion-word monolingual corpora

using a conventional (8 GB RAM) desktop

machine. Statistical machine translation ex-

perimental results corroborate that the result-

ing high-n approximate small language modelis as effective as models obtained from other

count pruning methods.

1 Introduction

In many NLP problems, we are faced with the chal-

lenge of dealing with large amounts of data. Many

problems boil down to computing relative frequen-

cies of certain items on this data. Items can be

words, patterns, associations, n-grams, and others.Language modeling (Chen and Goodman, 1996),

noun-clustering (Ravichandran et al., 2005), con-

structing syntactic rules for SMT (Galley et al.,

2004), and finding analogies (Turney, 2008) are

examples of some of the problems where we need

to compute relative frequencies. We use language

modeling as a canonical example of a large-scale

task that requires relative frequency estimation.

Computing relative frequencies seems like an

easy problem. However, as corpus sizes grow,

it becomes a highly computational expensive task.

Cutoff Size BLEU NIST MET

Exact 367.6m 28.73 7.691 56.32

2 229.8m 28.23 7.613 56.03

3 143.6m 28.17 7.571 56.53

5 59.4m 28.33 7.636 56.03

10 18.3m 27.91 7.546 55.64

100 1.1m 28.03 7.607 55.91

200 0.5m 27.62 7.550 55.67

Table 1: Effect of count-based pruning on SMT per-

formance using EAN corpus. Results are according to

BLEU, NIST and METEOR (MET) metrics. Bold #s are

not statistically significant worse than exact model.

Brants et al. (2007) used 1500 machines for aday to compute the relative frequencies of n-grams(summed over all orders from 1 to 5) from 1.8TBof web data. Their resulting model contained 300million unique n-grams.

It is not realistic using conventional computing re-

sources to use all the 300 million n-grams for ap-plications like speech recognition, spelling correc-

tion, information extraction, and statistical machine

translation (SMT). Hence, one of the easiest way to

reduce the size of this model is to use count-based

pruning which discards all n-grams whose count isless than a pre-defined threshold. Although count-

based pruning is quite simple, yet it is effective for

machine translation. As we do not have a copy of

the web, we will use a portion of gigaword i.e. EAN

(see Section 4.1) to show the effect of count-based

pruning on performance of SMT (see Section 5.1).

Table 1 shows that using a cutoff of 100 produces amodel of size 1.1 million n-grams with a Bleu scoreof 28.03. If we compare this with an exact modelof size 367.6 million n-grams, we see an increase of0.8 points in Bleu (95% statistical significance level

Thursday, July 16, 2009

Page 62: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Data that is not text

• Not just for Comp. Ling.

• E.g., count n-grams over “vocabularies” based on SIFT features.

54

Thursday, July 16, 2009

Page 63: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Humans

• People store large amounts of information in their heads,

• and they do it online.

• Space efficient online counting provides additional area for interfacing with Cog. Sci. community.

55

Thursday, July 16, 2009

Page 64: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Acknowledgements

• Ashwin Lall (co-author)

56

Thursday, July 16, 2009

Page 65: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Acknowledgements

• Ashwin Lall (co-author)

• David Talbot,Miles Osborne

57

Thursday, July 16, 2009

Page 66: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Acknowledgements

• Ashwin Lall (co-author)

• David Talbot,Miles Osborne

• Matt Post, Nick Morsillo,Dan Gildea

58

Thursday, July 16, 2009

Page 67: Probabilistic Counting with Randomized Storagevandurme/papers/VanDurmeLallIJCAI09-slides.pdf · Bloom Filters [Bloom ’70] ... tem. We also consider (i) how to include ap-proximate

Van Durme & LallIJCAI 2009

Questions?

59

www.cs.rochester.edu/~vandurme

www.cc.gatech.edu/~alall

Thursday, July 16, 2009


Recommended