+ All Categories
Home > Documents > An NLP Approach to the Evaluation of Web Corpora

An NLP Approach to the Evaluation of Web Corpora

Date post: 26-Oct-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
233
An NLP Approach to the Evaluation of Web Corpora Stefan Evert Corpus Linguistics Group, Department Germanistik & Komparatistik Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg, Germany [email protected] Leuven, 17 February 2015 S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 1 / 64
Transcript
Page 1: An NLP Approach to the Evaluation of Web Corpora

An NLP Approachto the Evaluation of Web Corpora

Stefan Evert

Corpus Linguistics Group, Department Germanistik & KomparatistikFriedrich-Alexander-Universitat Erlangen-Nurnberg, Germany

[email protected]

Leuven, 17 February 2015

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 1 / 64

Page 2: An NLP Approach to the Evaluation of Web Corpora

Collaboratorsparts of this presentation are based on the following studies

Biemann, Chris; Bildhauer, Felix; Evert, Stefan; Goldhahn, Dirk; Quasthoff,Uwe; Schafer, Roland; Simon, Johannes; Swiezinski, Leonard; Zesch, Torsten(2013). Scalable construction of high-quality Web corpora. Journal forLanguage Technology and Computational Linguistics (JLCL), 28(2), 23–59.

Bartsch, Sabine and Evert, Stefan (2014). Towards a Firthian notion ofcollocation. In A. Abel and L. Lemnitzer (eds.), Vernetzungsstrategien,Zugriffsstrukturen und automatisch ermittelte Angaben in Internetworter-buchern. 2. Arbeitsbericht des wissenschaftlichen Netzwerks Internetlexikografie,OPAL 02/2014. Institut fur Deutsche Sprache, Mannheim, pp. 48–61.

Lapesa, Gabriella and Evert, Stefan (2014). A large scale evaluation ofdistributional semantic models: Parameters, interactions and model selection.Transactions of the Association for Computational Linguistics, 2, 531–545.

Bartsch, Sabine; Evert, Stefan; Proisl, Thomas; Uhrig, Peter (2015).(Association) measure for measure: comparing collocation dictionaries withco-occurrence data for a better understanding of the notion of collocation.Presentation at ICAME 36 Conference, Trier, Germany.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 2 / 64

Page 3: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?

because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 4: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?

because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 5: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?

because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 6: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?

because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 7: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 8: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Why Web corpora?because more data are better data (Church and Mercer 1993)

Properties of the WebI Internet English, distribution of Web genres, hyperlink graphI Web corpus = random sample of the (public) WWW

Computer-mediated communication (CMC)I Twitter, Facebook, chatroom logs, discussion groups, . . .I many Web genres share aspects of interactive CMCI Web corpus = targeted collection of CMC genres

As replacement for linguistic reference corporaI main goal of the early WaC(ky) communityI cheaper, larger and more up-to-date than traditional corporaI Web corpus should be similar to reference corpus

Scaling up NLP training data (Banko and Brill 2001)I 1964: 1 million words (Brown Corpus)I 1995: 100 million words (British National Corpus)I 2003: 1,000+ million words (English Gigaword, WaCky)I 2006: 1,000,000 million words (Google Web 1T 5-Grams)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 3 / 64

Page 9: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Is bigger always better?

From small, clean and well designed . . .I British National Corpus (BNC)I movie subtitles, newspapers, . . .

. . . to large and messy . . .I WaCky, WebBase, COW, TenTen, GloWbE, Aranea, . . .I sampling frame unclear, lack of metadataI boilerplate, duplicates, non-standard language

. . . to huge n-gram databasesI largest corpora only available as n-gram databases, e.g.

Google’s 1-trillion-word Web corpus (Web 1T 5-Grams)I tend to be even messier, often w/o linguistic annotationI lack of context, incomplete because of frequency threshold

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 4 / 64

Page 10: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Is bigger always better?

From small, clean and well designed . . .I British National Corpus (BNC)I movie subtitles, newspapers, . . .

. . . to large and messy . . .I WaCky, WebBase, COW, TenTen, GloWbE, Aranea, . . .I sampling frame unclear, lack of metadataI boilerplate, duplicates, non-standard language

. . . to huge n-gram databasesI largest corpora only available as n-gram databases, e.g.

Google’s 1-trillion-word Web corpus (Web 1T 5-Grams)I tend to be even messier, often w/o linguistic annotationI lack of context, incomplete because of frequency threshold

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 4 / 64

Page 11: An NLP Approach to the Evaluation of Web Corpora

Introduction The Web as Corpus

Is bigger always better?

From small, clean and well designed . . .I British National Corpus (BNC)I movie subtitles, newspapers, . . .

. . . to large and messy . . .I WaCky, WebBase, COW, TenTen, GloWbE, Aranea, . . .I sampling frame unclear, lack of metadataI boilerplate, duplicates, non-standard language

. . . to huge n-gram databasesI largest corpora only available as n-gram databases, e.g.

Google’s 1-trillion-word Web corpus (Web 1T 5-Grams)I tend to be even messier, often w/o linguistic annotationI lack of context, incomplete because of frequency threshold

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 4 / 64

Page 12: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

The Google Web 1T 5-Gram databaseBrants and Franz (2006)

word 1 word 2 word 3 f

supplement depend on 193supplement depending on 174supplement depends entirely 94supplement depends on 338supplement derived from 2668supplement des coups 77supplement described in 200

excerpt from file 3gm-0088.gz

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 5 / 64

Page 13: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Web1T5 made Easybut not for the computer (Evert 2010)

word 1 word 2 word 3 f

supplement depend on 193supplement depending on 174supplement depends entirely 94supplement depends on 338supplement derived from 2668supplement des coups 77supplement described in 200

This looks very much like a relational database table

So why not just put the data into an off-the-shelf RDBMS?I built-in indexing for quick accessI powerful query language SQL

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 6 / 64

Page 14: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Web1T5 made Easybut not for the computer (Evert 2010)

word id

depend 6094depending 3571depends 3846. . . . . .on 14. . . . . .supplement 5095

id 1 id 2 id 3 f

5095 6094 14 1935095 3571 14 1745095 3846 4585 945095 3846 14 3385095 4207 27 26685095 2298 62481 775095 1840 11 200

Use numeric ID coding as in IR / large-corpus query engines

More efficient to store, index and sort in RDBMS

Frequency-sorted lexicon is beneficial for variable-lengthcoding of integer IDs (used by SQLite)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 7 / 64

Page 15: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Web1T5-Easy database encoding procedure

Pre-processing (normalisation, filtering, . . . )⇓

Numeric ID coding & database insertion [1d 23h]

⇓Collapse duplicate rows (from normalisation) [6d 7h]

⇓Indexing of each n-gram position [3d 2h]

⇓Statistical analysis for query optimisation [not useful]

⇓Build database of co-occurrence frequencies [ca. 3d]

, 211 GiB

Carried out in spring 2009 on quad-core Opteron 2.6 GHz with 16 GiB RAM

— should be faster on state-of-the-art server with latest version of SQLite.S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 8 / 64

Page 16: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Querying the database

It’s easy to search the database for patterns like

association . . . Xal Y

with a “simple” SQL query:

SELECT w3, w4, SUM(f) AS freq FROM ngrams

WHERE w1 IN (SELECT id FROM vocab WHERE w=’association’)

AND w3 IN (SELECT id FROM vocab WHERE w LIKE ’%al’)

GROUP BY w3, w4 ORDER BY freq DESC;

Web1T5-Easy implements a more user-friendly query language:

association ? %al *

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 9 / 64

Page 17: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Querying the database

It’s easy to search the database for patterns like

association . . . Xal Y

with a “simple” SQL query:

SELECT w3, w4, SUM(f) AS freq FROM ngrams

WHERE w1 IN (SELECT id FROM vocab WHERE w=’association’)

AND w3 IN (SELECT id FROM vocab WHERE w LIKE ’%al’)

GROUP BY w3, w4 ORDER BY freq DESC;

Web1T5-Easy implements a more user-friendly query language:

association ? %al *

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 9 / 64

Page 18: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Querying the database

It’s easy to search the database for patterns like

association . . . Xal Y

with a “simple” SQL query:

SELECT w3, w4, SUM(f) AS freq FROM ngrams

WHERE w1 IN (SELECT id FROM vocab WHERE w=’association’)

AND w3 IN (SELECT id FROM vocab WHERE w LIKE ’%al’)

GROUP BY w3, w4 ORDER BY freq DESC;

Web1T5-Easy implements a more user-friendly query language:

association ? %al *

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 9 / 64

Page 19: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Web1T5-Easy demohttp://corpora.linguistik.uni-erlangen.de/demos/cgi-bin/Web1T5/Web1T5_freq.perl

Dutch Twitter N-Grams: http://www.let.rug.nl/~gosse/Ngrams/ngrams.html

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 10 / 64

Page 20: An NLP Approach to the Evaluation of Web Corpora

Introduction Google Web 1T 5-Grams

Web1T5-Easy query performance

Web1T5-Easy query cold cache warm cache

corpus linguistics 0.11s 0.01sweb as corpus 1.29s 0.44stime of * 2.71s 1.09s%ly good fun 181.03s 24.37s[sit,sits,sat,sitting] * ? chair 1.16s 0.31s* linguistics (association ranking) 11.42s 0.05suniversity of * (association ranking) 1.48s 0.48s

(64-bit Linux server with 2.6 GHz AMD Opteron CPUs, 16 GiB RAM and fast

local hard disk; based on timing information from the public Web interface.)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 11 / 64

Page 21: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Evaluating the “quality” of Web corpora

Statistical propertiesI type-token distributions, n-gram frequencies, other markersI representativeness (as sample of the Web)I genre distribution (traditional vs. Web genres)

Corpus comparisonI between Web corpora (Ü reliability)I between Web corpus and reference corpusI compared to within-corpus variation

Training data for NLP applicationI larger amount of training data is often beneficialI confounding factors (NLP algorithm, training regime, . . . )

Linguistic evaluation of Web corporaI as substitute for / extension of reference corpusI need linguistic tasks that can be judged quantitatively and that

make immediate use of corpus frequency data

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 12 / 64

Page 22: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Evaluating the “quality” of Web corpora

Statistical propertiesI type-token distributions, n-gram frequencies, other markersI representativeness (as sample of the Web)I genre distribution (traditional vs. Web genres)

Corpus comparisonI between Web corpora (Ü reliability)I between Web corpus and reference corpusI compared to within-corpus variation

Training data for NLP applicationI larger amount of training data is often beneficialI confounding factors (NLP algorithm, training regime, . . . )

Linguistic evaluation of Web corporaI as substitute for / extension of reference corpusI need linguistic tasks that can be judged quantitatively and that

make immediate use of corpus frequency data

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 12 / 64

Page 23: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Evaluating the “quality” of Web corpora

Statistical propertiesI type-token distributions, n-gram frequencies, other markersI representativeness (as sample of the Web)I genre distribution (traditional vs. Web genres)

Corpus comparisonI between Web corpora (Ü reliability)I between Web corpus and reference corpusI compared to within-corpus variation

Training data for NLP applicationI larger amount of training data is often beneficialI confounding factors (NLP algorithm, training regime, . . . )

Linguistic evaluation of Web corporaI as substitute for / extension of reference corpusI need linguistic tasks that can be judged quantitatively and that

make immediate use of corpus frequency data

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 12 / 64

Page 24: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Evaluating the “quality” of Web corpora

Statistical propertiesI type-token distributions, n-gram frequencies, other markersI representativeness (as sample of the Web)I genre distribution (traditional vs. Web genres)

Corpus comparisonI between Web corpora (Ü reliability)I between Web corpus and reference corpusI compared to within-corpus variation

Training data for NLP applicationI larger amount of training data is often beneficialI confounding factors (NLP algorithm, training regime, . . . )

Linguistic evaluation of Web corporaI as substitute for / extension of reference corpusI need linguistic tasks that can be judged quantitatively and that

make immediate use of corpus frequency data

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 12 / 64

Page 25: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Evaluating the “quality” of Web corpora

Statistical propertiesI type-token distributions, n-gram frequencies, other markersI representativeness (as sample of the Web)I genre distribution (traditional vs. Web genres)

Corpus comparisonI between Web corpora (Ü reliability)I between Web corpus and reference corpusI compared to within-corpus variation

Training data for NLP applicationI larger amount of training data is often beneficialI confounding factors (NLP algorithm, training regime, . . . )

Linguistic evaluation of Web corporaI as substitute for / extension of reference corpusI need linguistic tasks that can be judged quantitatively and that

make immediate use of corpus frequency data

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 12 / 64

Page 26: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Linguistic evaluation of Web corpora

1 Frequency comparisonI “good” Web corpora should agree with reference corpus on

core phenomena Ü correlation between frequency countsI e.g. Basic English vocabulary, compound nouns, . . .

2 Identification of multiword expressions (MWE)I well-know NLP task based on co-occurrence statisticsI some gold standard data sets availableI e.g. “phrasal verbs”, lexical collocations, . . .

3 Distributional semantic models (DSM)I hypothesis: semantic similarity ∼ distributional similarityI distribution quantified by co-occurrences with other wordsI DSMs can be evaluated in various shared tasks

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 13 / 64

Page 27: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Linguistic evaluation of Web corpora

1 Frequency comparisonI “good” Web corpora should agree with reference corpus on

core phenomena Ü correlation between frequency countsI e.g. Basic English vocabulary, compound nouns, . . .

2 Identification of multiword expressions (MWE)I well-know NLP task based on co-occurrence statisticsI some gold standard data sets availableI e.g. “phrasal verbs”, lexical collocations, . . .

3 Distributional semantic models (DSM)I hypothesis: semantic similarity ∼ distributional similarityI distribution quantified by co-occurrences with other wordsI DSMs can be evaluated in various shared tasks

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 13 / 64

Page 28: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Linguistic evaluation of Web corpora

1 Frequency comparisonI “good” Web corpora should agree with reference corpus on

core phenomena Ü correlation between frequency countsI e.g. Basic English vocabulary, compound nouns, . . .

2 Identification of multiword expressions (MWE)I well-know NLP task based on co-occurrence statisticsI some gold standard data sets availableI e.g. “phrasal verbs”, lexical collocations, . . .

3 Distributional semantic models (DSM)I hypothesis: semantic similarity ∼ distributional similarityI distribution quantified by co-occurrences with other wordsI DSMs can be evaluated in various shared tasks

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 13 / 64

Page 29: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Research questions

Are English Web corpora a substitute for the BNC?

What are the differences between Web corpora?

Does size matter more than content?

How useful are n-gram databases?I esp. detrimental effects of frequency thresholds

How important is (automatic) linguistic annotation?

Do Web corpora offer better coverage?

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 14 / 64

Page 30: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 31: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 32: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 33: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 34: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 35: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 36: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 37: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 38: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 39: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 40: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 41: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 42: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 43: An NLP Approach to the Evaluation of Web Corpora

Introduction Evaluation

Corpora in the evaluation

Ref: British National Corpus (Aston and Burnard 1998) C&C 0.1 G

English Movie Subtitles (DESC v2) C&C 0.1 G

Gigaword newspaper corpus (2nd edition) 2.0 G

English Wackypedia Malt 1.0 G

Wackypedia subset (WP500) Malt 0.2 G

ukWaC (Baroni et al. 2009) Malt 2.0 G

WebBase (Han et al. 2013) 3.0 G

UKCOW 2012 (Schafer and Bildhauer 2012) 4.0 G

Joint Web corpus 10.0 G

Web 1T 5-Grams (Brants and Franz 2006) 1000.0 G

LCC n-gram database 1.0 G

ENCOW 2014 Malt 10.0 G

Google Books 2012 EN (Lin et al. 2012) Malt 900.0 G

Google Books 2012 GB Malt 100.0 G

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 15 / 64

Page 44: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency Comparison

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 16 / 64

Page 45: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Comparison of frequency counts

Scatterplots of (log) frequencies in BNC vs. other corporaI Pearson correlation r from regression fukWaC ∼ β · fBNC etc.I only consider items that occur in both corpora

(Ü low coverage is not penalized directly)

Test data setsI Basic English words (lemmatized)I inflected forms of Basic English wordsI binary compound nouns extracted from WordNet 3.0

Morphological query expansion for unannotated n-grams

query f

hear sound 36,304

[hear,hears,heard,hearing]

[sound,sounds] 95,453

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 17 / 64

Page 46: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Comparison of frequency counts

Scatterplots of (log) frequencies in BNC vs. other corporaI Pearson correlation r from regression fukWaC ∼ β · fBNC etc.I only consider items that occur in both corpora

(Ü low coverage is not penalized directly)

Test data setsI Basic English words (lemmatized)I inflected forms of Basic English wordsI binary compound nouns extracted from WordNet 3.0

Morphological query expansion for unannotated n-grams

query f

hear sound 36,304

[hear,hears,heard,hearing]

[sound,sounds] 95,453

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 17 / 64

Page 47: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Comparison of frequency counts

Scatterplots of (log) frequencies in BNC vs. other corporaI Pearson correlation r from regression fukWaC ∼ β · fBNC etc.I only consider items that occur in both corpora

(Ü low coverage is not penalized directly)

Test data setsI Basic English words (lemmatized)I inflected forms of Basic English wordsI binary compound nouns extracted from WordNet 3.0

Morphological query expansion for unannotated n-grams

query f

hear sound 36,304

[hear,hears,heard,hearing]

[sound,sounds] 95,453

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 17 / 64

Page 48: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Comparison of frequency counts

Scatterplots of (log) frequencies in BNC vs. other corporaI Pearson correlation r from regression fukWaC ∼ β · fBNC etc.I only consider items that occur in both corpora

(Ü low coverage is not penalized directly)

Test data setsI Basic English words (lemmatized)I inflected forms of Basic English wordsI binary compound nouns extracted from WordNet 3.0

Morphological query expansion for unannotated n-grams

query f

hear sound 36,304[hear,hears,heard,hearing]

[sound,sounds] 95,453

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 17 / 64

Page 49: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

Basic English (lemmatized)

British National Corpus

Eng

lish

Mov

ie S

ubtit

les

(DE

SC

v2)

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8384

effective size: 0.04 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 50: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●● ●

● ●

● ●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

Gig

awor

d N

ews

Cor

pus

(2nd

ed.

)

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8910

effective size: 1.25 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 51: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

Wac

kype

dia

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8883

effective size: 0.66 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 52: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

Basic English (lemmatized)

British National Corpus

Wac

kype

dia

subs

et (

WP

500)

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8875

effective size: 0.16 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 53: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

● ●

●●

●●●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

ukW

aC W

eb c

orpu

s

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9497

effective size: 1.68 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 54: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

Basic English (lemmatized)

British National Corpus

Web

Bas

e co

rpus

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9144

effective size: 2.50 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 55: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

UK

CO

W W

eb c

orpu

s

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9470

effective size: 3.21 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 56: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●● ●●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●●

●●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

Join

t Web

cor

pus

(10G

)

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9328

effective size: 6.65 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 57: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

Basic English (lemmatized)

British National Corpus

Goo

gle

Web

1T

5−

Gra

ms

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8885

effective size: 1167.79 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 58: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (lemmatized)

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

Basic English (lemmatized)

British National Corpus

LCC

N−

Gra

ms

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.8888

effective size: 2.28 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 18 / 64

Page 59: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (word forms)

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

●●

● ●

Basic English (word forms)

British National Corpus

LCC

N−

Gra

ms

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9161

effective size: 0.69 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 19 / 64

Page 60: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: Basic English (word forms)

● ●

●●

●●

●●

●●

● ●

●● ●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

Basic English (word forms)

British National Corpus

Goo

gle

Web

1T

5−

Gra

ms

1 10 100 1k 10k 100k 1M

110

100

1k10

k10

0k1M

10M

100M

1G

r = 0.9129

effective size: 552.34 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 19 / 64

Page 61: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

● ●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●●

● ●

●●

WordNet noun compounds

British National Corpus

Eng

lish

Mov

ie S

ubtit

les

(DE

SC

v2)

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.2706

effective size: 0.03 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 62: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

WordNet noun compounds

British National Corpus

Gig

awor

d N

ews

Cor

pus

(2nd

ed.

)

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.5394

effective size: 0.85 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 63: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

WordNet noun compounds

British National Corpus

Wac

kype

dia

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.5544

effective size: 0.73 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 64: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

WordNet noun compounds

British National Corpus

Wac

kype

dia

subs

et (

WP

500)

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.5486

effective size: 0.19 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 65: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

WordNet noun compounds

British National Corpus

ukW

aC W

eb c

orpu

s

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.7466

effective size: 1.42 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 66: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ● ●

WordNet noun compounds

British National Corpus

Web

Bas

e co

rpus

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.6064

effective size: 1.92 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 67: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

WordNet noun compounds

British National Corpus

UK

CO

W W

eb c

orpu

s

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.7330

effective size: 2.40 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 68: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

WordNet noun compounds

British National Corpus

Join

t Web

cor

pus

(10G

)

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.6925

effective size: 5.91 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 69: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

WordNet noun compounds

British National Corpus

Goo

gle

Web

1T

5−

Gra

ms

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.6080

effective size: 1610.12 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 70: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Frequency comparison

Frequency comparison: compound nouns (WordNet)

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

WordNet noun compounds

British National Corpus

LCC

N−

Gra

ms

1 10 100 1k 10k

110

100

1k10

k10

0k1M

10M

100M

r = 0.4406

effective size: 1.00 billion words

(dashed lines indicate acceptable frequency difference within one order of magnitude)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 20 / 64

Page 71: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

MWE Identification & Collocations

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 21 / 64

Page 72: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Collocations

Collocation: frequent co-occurrence within short span of up to5 words (Firth 1957; Sinclair 1966, 1991)

I plays important role in lexicography, corpus linguistics,language description, word sense disambiguation, . . .

I key feature for MWE identificationI collocation database is also a sparse representation of a

distributional semantic model (term-term matrix)

Web1T5 only provides exact co-occurrence frequencies forimmediately adjacent bigrams (e.g. * day and day *)

Approximate counts for distance n from n + 1-gram table

day ? ? * and * ? ? day

å quasi-collocations

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 22 / 64

Page 73: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Collocations

Collocation: frequent co-occurrence within short span of up to5 words (Firth 1957; Sinclair 1966, 1991)

I plays important role in lexicography, corpus linguistics,language description, word sense disambiguation, . . .

I key feature for MWE identificationI collocation database is also a sparse representation of a

distributional semantic model (term-term matrix)

Web1T5 only provides exact co-occurrence frequencies forimmediately adjacent bigrams (e.g. * day and day *)

Approximate counts for distance n from n + 1-gram table

day ? ? * and * ? ? day

å quasi-collocations

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 22 / 64

Page 74: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Collocations

Collocation: frequent co-occurrence within short span of up to5 words (Firth 1957; Sinclair 1966, 1991)

I plays important role in lexicography, corpus linguistics,language description, word sense disambiguation, . . .

I key feature for MWE identificationI collocation database is also a sparse representation of a

distributional semantic model (term-term matrix)

Web1T5 only provides exact co-occurrence frequencies forimmediately adjacent bigrams (e.g. * day and day *)

Approximate counts for distance n from n + 1-gram table

day ? ? * and * ? ? day

å quasi-collocations

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 22 / 64

Page 75: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Quasi-collocations database

Web1T5-Easy: pre-compiled database of quasi-collocationsI brute-force, multi-pass algorithmI runtime approx. 3 days on server with 16 GiB RAM

Flexible collocational span L4, . . . , L1 / R1, . . . , R4I separate count for each collocate and positionI co-occurrence frequency in user-defined span and association

scores are calculated on the flyI benefits from tight integration of Perl & SQLite

Standard association measures: X 2, G 2, t, MI, Dice

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 23 / 64

Page 76: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Quasi-collocations demo

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 24 / 64

Page 77: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

English verb-particle constructions (VPC) consisting ofhead verb + one obligatory prepositional particle

I hand in, back off, wake up, set aside, carry on, . . .

Data set of 3,078 candidate VPC typesI extracted from written part of BNC with combination of

tagger-, chunker-, and parser-based methods

Manually annotated as compositional / non-compositionalI baseline: 14.3% non-compositional VPC (440 / 3078)I compositional: carry around, fly away, refer back, . . .I further distinction of transitive/intransitive VPC not used

Evaluation: candidate ranking based on each corpusI surface co-occcurrence (L0,R3) + POS filter (except Web1T5)I standard association measures: G 2, t, MI, Dice, X 2, fI precision/recall graphs; overall quality: average precision (AP)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 25 / 64

Page 78: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

English verb-particle constructions (VPC) consisting ofhead verb + one obligatory prepositional particle

I hand in, back off, wake up, set aside, carry on, . . .

Data set of 3,078 candidate VPC typesI extracted from written part of BNC with combination of

tagger-, chunker-, and parser-based methods

Manually annotated as compositional / non-compositionalI baseline: 14.3% non-compositional VPC (440 / 3078)I compositional: carry around, fly away, refer back, . . .I further distinction of transitive/intransitive VPC not used

Evaluation: candidate ranking based on each corpusI surface co-occcurrence (L0,R3) + POS filter (except Web1T5)I standard association measures: G 2, t, MI, Dice, X 2, fI precision/recall graphs; overall quality: average precision (AP)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 25 / 64

Page 79: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

English verb-particle constructions (VPC) consisting ofhead verb + one obligatory prepositional particle

I hand in, back off, wake up, set aside, carry on, . . .

Data set of 3,078 candidate VPC typesI extracted from written part of BNC with combination of

tagger-, chunker-, and parser-based methods

Manually annotated as compositional / non-compositionalI baseline: 14.3% non-compositional VPC (440 / 3078)I compositional: carry around, fly away, refer back, . . .I further distinction of transitive/intransitive VPC not used

Evaluation: candidate ranking based on each corpusI surface co-occcurrence (L0,R3) + POS filter (except Web1T5)I standard association measures: G 2, t, MI, Dice, X 2, fI precision/recall graphs; overall quality: average precision (AP)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 25 / 64

Page 80: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

English verb-particle constructions (VPC) consisting ofhead verb + one obligatory prepositional particle

I hand in, back off, wake up, set aside, carry on, . . .

Data set of 3,078 candidate VPC typesI extracted from written part of BNC with combination of

tagger-, chunker-, and parser-based methods

Manually annotated as compositional / non-compositionalI baseline: 14.3% non-compositional VPC (440 / 3078)I compositional: carry around, fly away, refer back, . . .I further distinction of transitive/intransitive VPC not used

Evaluation: candidate ranking based on each corpusI surface co-occcurrence (L0,R3) + POS filter (except Web1T5)I standard association measures: G 2, t, MI, Dice, X 2, fI precision/recall graphs; overall quality: average precision (AP)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 25 / 64

Page 81: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

verb particle X 2 TP

bandy about 4857.75 –chat away 1105.58 –chat on 7.30 –fish out 1535.60 +

improve by 576.06 –move forth 231.54 –

play around 394.98 –scrape through 278.53 –shuffle out 85.15 –

start over 75.88 +transfer to 13934.37 –

weld on 30.83 –

n-best list (n = 12)

P = 512 = 41.7%

R = 5440 = 1.1%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 26 / 64

Page 82: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

verb particle X 2↓ TP

talk about 950906.9 –lean forward 510113.9 –

want to 477739.8 –sort out 406072.7 +

base on 398035.3 –depend on 330956.6 –

sit down 329143.2 +go to 289818.9 –

slow down 282418.3 +lag behind 257224.2 +be by 242827.7 –set aside 242238.1 +

n-best list (n = 12)

P = 512 = 41.7%

R = 5440 = 1.1%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 26 / 64

Page 83: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

verb particle X 2↓ TP

talk about 950906.9 –lean forward 510113.9 –

want to 477739.8 –sort out 406072.7 +

base on 398035.3 –depend on 330956.6 –

sit down 329143.2 +go to 289818.9 –

slow down 282418.3 +lag behind 257224.2 +be by 242827.7 –set aside 242238.1 +

n-best list (n = 12)

P = 512 = 41.7%

R = 5440 = 1.1%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 26 / 64

Page 84: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

AveragePrecision

G 2: 31.06%

Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 85: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

t

AveragePrecision

G 2: 31.06%

Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 86: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

tMI

AveragePrecision

G 2: 31.06%

Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 87: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

tMIDice

AveragePrecision

G 2: 31.06%

Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 88: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

tMIDice

AveragePrecision

G 2: 31.06%

Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 89: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

tMIDice

AveragePrecision

G 2: 31.06%Dice: 30.13%

X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 90: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3 (English VPC)

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

AveragePrecision

G 2: 31.06%Dice: 30.13%X 2: 32.12%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 27 / 64

Page 91: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 92: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

DESC L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 93: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

Gigaword L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 94: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

Wackypedia L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 95: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

WP500 L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 96: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

ukWaC L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 97: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

WebBase L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 98: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

UKCOW L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 99: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

Joint Web L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 100: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

Web1T5 L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 28 / 64

Page 101: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

Why does Web1T5 perform so badly in this task despite its size?Possible explanations include:

Co-occurrence counts are underestimated for larger windowsbecause of frequency threshold in n-gram databaseÜ quasi-collocations as poor approximation

No part-of-speech annotation available to filter candidates

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 29 / 64

Page 102: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

Web1T5 L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 103: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

LCC L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 104: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

LCC L0/R3 [f>= 5]

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 105: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

LCC L0/R3 [f>=10]

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 106: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

LCC−tagged L0/R3 [f>=10]

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 107: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 108: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on English VPC extraction task(Baldwin 2008)

0 20 40 60 80 100

010

2030

4050

BNC−raw L0/R3

Recall (%)

Pre

cisi

on (

%)

G2

tMIDiceX2

f

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 30 / 64

Page 109: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

Identification of lexical collocations as habitual, recurrentword combinations (Firth 1957)

I essential for advanced learners (Ü idiomatic English)I different from lexicalised MWEI semi-compositional or fully compositionalI no clear-cut linguistic criteria or tests available

Gold standard: BBI combinatory dict. (Benson et al. 1986)I BBI was compiled manually based on lexicographer intuitionsI lexical collocations automatically extracted from scanned BBI,

manually checked for 224 selected node words (Ü Bartsch224)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 31 / 64

Page 110: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

Identification of lexical collocations as habitual, recurrentword combinations (Firth 1957)

I essential for advanced learners (Ü idiomatic English)I different from lexicalised MWEI semi-compositional or fully compositionalI no clear-cut linguistic criteria or tests available

Gold standard: BBI combinatory dict. (Benson et al. 1986)I BBI was compiled manually based on lexicographer intuitionsI lexical collocations automatically extracted from scanned BBI,

manually checked for 224 selected node words (Ü Bartsch224)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 31 / 64

Page 111: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

Candidate extraction and rankingI for each of the 224 nodes, extract all co-occurrences with the

7,711 words that occur as collocates somewhere in the BBII different spans: syntactic, L3/R3, L5/R5, L10/R10, sentenceI full list of candidates ranked according to standard association

measures (not per individual node) Ü precision-recall graphsI composite: AP50 = average precision up to 50% recall,

selecting best measure for each data set

Dictionary-based evaluation problematic (Evert 2004, 139f)I provides lower bound on n-best precisionI coverage of native speaker intuitions by corpus dataI may be biased against recent corpora, Web texts, etc.

Manual validation of ranked collocates for selected nodesI work in progress, using custom Web-based annotation tool

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 32 / 64

Page 112: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

Candidate extraction and rankingI for each of the 224 nodes, extract all co-occurrences with the

7,711 words that occur as collocates somewhere in the BBII different spans: syntactic, L3/R3, L5/R5, L10/R10, sentenceI full list of candidates ranked according to standard association

measures (not per individual node) Ü precision-recall graphsI composite: AP50 = average precision up to 50% recall,

selecting best measure for each data set

Dictionary-based evaluation problematic (Evert 2004, 139f)I provides lower bound on n-best precisionI coverage of native speaker intuitions by corpus dataI may be biased against recent corpora, Web texts, etc.

Manual validation of ranked collocates for selected nodesI work in progress, using custom Web-based annotation tool

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 32 / 64

Page 113: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

Candidate extraction and rankingI for each of the 224 nodes, extract all co-occurrences with the

7,711 words that occur as collocates somewhere in the BBII different spans: syntactic, L3/R3, L5/R5, L10/R10, sentenceI full list of candidates ranked according to standard association

measures (not per individual node) Ü precision-recall graphsI composite: AP50 = average precision up to 50% recall,

selecting best measure for each data set

Dictionary-based evaluation problematic (Evert 2004, 139f)I provides lower bound on n-best precisionI coverage of native speaker intuitions by corpus dataI may be biased against recent corpora, Web texts, etc.

Manual validation of ranked collocates for selected nodesI work in progress, using custom Web-based annotation tool

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 32 / 64

Page 114: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI collocation identification task(Benson et al. 1986)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 33 / 64

Page 115: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: collocational span(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | sentence window | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.38%

G2

tMIMI2

X2

f

coverage: 98.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 34 / 64

Page 116: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: collocational span(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | L10/R10 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.48%

G2

tMIMI2

X2

f

coverage: 97.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 34 / 64

Page 117: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: collocational span(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | L5/R5 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.61%

G2

tMIMI2

X2

f

coverage: 96.4%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 34 / 64

Page 118: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: collocational span(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.79%

G2

tMIMI2

X2

f

coverage: 95.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 34 / 64

Page 119: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: collocational span(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | syntactic dependency | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 1.31%

G2

tMIMI2

X2

f

coverage: 91.7%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 34 / 64

Page 120: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60British National Corpus [100M] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.79%

G2

tMIMI2

X2

f

coverage: 95.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 121: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60English Movie Subtitles (DESC v2) [80M] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 1.32%

G2

tMIMI2

X2

f

coverage: 84.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 122: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Gigaword News Corpus (2nd ed.) | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.49%

G2

tMIMI2

X2

f

coverage: 81.9%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 123: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Wackypedia [1G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.48%

G2

tMIMI2

X2

f

coverage: 97.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 124: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Wackypedia subset (WP500) [200M] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.72%

G2

tMIMI2

X2

f

coverage: 94.2%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 125: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60ukWaC Web corpus [2G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.38%

G2

tMIMI2

X2

f

coverage: 99.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 126: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60WebBase corpus [3G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.32%

G2

tMIMI2

X2

f

coverage: 98.9%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 127: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60UKCOW Web corpus [4G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.3%

G2

tMIMI2

X2

f

coverage: 98.7%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 128: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Joint Web corpus [10G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.26%

G2

tMIMI2

X2

f

coverage: 99.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 129: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60ENCOW Web corpus [10G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.26%

G2

tMIMI2

X2

f

coverage: 98.9%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 130: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Google Web 1T 5−Grams (quasi−coll.) [1000G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.35%

G2

tMIMI2

X2

f

coverage: 97.7%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 131: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60LCC N−Grams (full coll.) [1G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.45%

G2

tMIMI2

X2

f

coverage: 98.7%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 132: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60LCC N−Grams (quasi−coll., f >= 5) [1G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 1.84%

G2

tMIMI2

X2

f

coverage: 84.1%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 133: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60LCC N−Grams (quasi−coll., f >= 10) [1G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 2.71%

G2

tMIMI2

X2

f

coverage: 74.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 134: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Google Books 2012 BrE (quasi−coll.) [100G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.61%

G2

tMIMI2

X2

f

coverage: 95.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 135: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on BBI: different corpora (L3/R3)(Benson et al. 1986)

0 10 20 30 40 50

010

2030

4050

60Google Books 2012 BrE (quasi−coll., raw) [100G] | L3/R3 span | gold: BBI

recall (%)

prec

isio

n (%

)

baseline = 0.74%

G2

tMIMI2

X2

f

coverage: 95.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 35 / 64

Page 136: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Replication on Oxford Collocations Dictionary (2nd ed.)(McIntosh et al. 2009)

Criticism against BBII lexicographer intuitions may be incompleteI outdated (published 1986, native speakers primed in 1950s)I biased against recent corpora

Additional gold standard: OCD2 (McIntosh et al. 2009)I corpus-based + manual validation (1st ed. was based on BNC)I up to date, covers much broader range of collocates

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 36 / 64

Page 137: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Replication on Oxford Collocations Dictionary (2nd ed.)(McIntosh et al. 2009)

Criticism against BBII lexicographer intuitions may be incompleteI outdated (published 1986, native speakers primed in 1950s)I biased against recent corpora

Additional gold standard: OCD2 (McIntosh et al. 2009)I corpus-based + manual validation (1st ed. was based on BNC)I up to date, covers much broader range of collocates

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 36 / 64

Page 138: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Replication on Oxford Collocations Dictionary (2nd ed.)(McIntosh et al. 2009)

Compilation of gold standardI automatic extraction from XML version of the dictionaryI collocates explicitly marked Ü no manual validation necessaryI Bartsch224 nodes accepted as headwords and as collocates

(OCD follows Hausmann’s (1989) base-collocate distinction)

Candidate extraction and rankingI same candidate data as for BBI experiments (lazy researcher)

å incomplete coverage of OCD2 gold standard:4,636 out of 18,515 collocations discarded (= 25.0%)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 37 / 64

Page 139: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Replication on Oxford Collocations Dictionary (2nd ed.)(McIntosh et al. 2009)

Compilation of gold standardI automatic extraction from XML version of the dictionaryI collocates explicitly marked Ü no manual validation necessaryI Bartsch224 nodes accepted as headwords and as collocates

(OCD follows Hausmann’s (1989) base-collocate distinction)

Candidate extraction and rankingI same candidate data as for BBI experiments (lazy researcher)

å incomplete coverage of OCD2 gold standard:4,636 out of 18,515 collocations discarded (= 25.0%)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 37 / 64

Page 140: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: collocational span(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | sentence window | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.88%

G2

tMIMI2

X2

f

coverage: 99.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 38 / 64

Page 141: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: collocational span(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | L10/R10 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 2.36%

G2

tMIMI2

X2

f

coverage: 99.4%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 38 / 64

Page 142: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: collocational span(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | L5/R5 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 3.02%

G2

tMIMI2

X2

f

coverage: 99.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 38 / 64

Page 143: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: collocational span(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 3.91%

G2

tMIMI2

X2

f

coverage: 98.5%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 38 / 64

Page 144: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: collocational span(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | syntactic dependency | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 6.6%

G2

tMIMI2

X2

f

coverage: 95.8%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 38 / 64

Page 145: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80British National Corpus [100M] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 3.91%

G2

tMIMI2

X2

f

coverage: 98.5%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 146: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80English Movie Subtitles (DESC v2) [80M] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 6.26%

G2

tMIMI2

X2

f

coverage: 83.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 147: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Gigaword News Corpus (2nd ed.) | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 2.29%

G2

tMIMI2

X2

f

coverage: 79.0%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 148: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Wackypedia [1G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 2.33%

G2

tMIMI2

X2

f

coverage: 98.9%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 149: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Wackypedia subset (WP500) [200M] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 3.53%

G2

tMIMI2

X2

f

coverage: 96.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 150: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80ukWaC Web corpus [2G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.82%

G2

tMIMI2

X2

f

coverage: 99.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 151: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80WebBase corpus [3G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.55%

G2

tMIMI2

X2

f

coverage: 99.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 152: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80UKCOW Web corpus [4G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.47%

G2

tMIMI2

X2

f

coverage: 99.5%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 153: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Joint Web corpus [10G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.26%

G2

tMIMI2

X2

f

coverage: 99.7%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 154: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80ENCOW Web corpus [10G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.25%

G2

tMIMI2

X2

f

coverage: 99.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 155: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Google Web 1T 5−Grams (quasi−coll.) [1000G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 1.69%

G2

tMIMI2

X2

f

coverage: 99.3%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 156: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80LCC N−Grams (full coll.) [1G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 2.17%

G2

tMIMI2

X2

f

coverage: 99.6%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 157: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80LCC N−Grams (quasi−coll., f >= 5) [1G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 9.17%

G2

tMIMI2

X2

f

coverage: 86.9%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 158: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80LCC N−Grams (quasi−coll., f >= 10) [1G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 13.39%

G2

tMIMI2

X2

f

coverage: 76.2%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 159: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Google Books 2012 BrE (quasi−coll.) [100G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 2.99%

G2

tMIMI2

X2

f

coverage: 97.4%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 160: An NLP Approach to the Evaluation of Web Corpora

Evaluation results MWE identification

Evaluation on OCD2: different corpora (L3/R3)(McIntosh et al. 2009)

0 10 20 30 40 50

020

4060

80Google Books 2012 BrE (quasi−coll., raw) [100G] | L3/R3 span | gold: OCD2

recall (%)

prec

isio

n (%

)

baseline = 3.59%

G2

tMIMI2

X2

f

coverage: 96.2%

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 39 / 64

Page 161: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional Semantics

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 40 / 64

Page 162: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

Distributional hypothesis (Harris 1954): meaning of a wordcan be inferred from its distribution across contexts

“You shall know a word by the company it keeps!”— (Firth 1957)

Reality check: What is the mystery word?I He handed her her glass of XXXXX.I Nigel staggered to his feet, face flushed from

too much XXXXX.I Malbec, one of the lesser-known XXXXX grapes,

responds well to Australia’s sunshine.I I dined off bread and cheese and this excellent XXXXX.I The drinks were delicious: blood-red XXXXX as well as

light, sweet Rhenish.

XXXXX = claretI all examples from BNC (carefully selected & slightly edited)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 41 / 64

Page 163: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

Distributional hypothesis (Harris 1954): meaning of a wordcan be inferred from its distribution across contexts

“You shall know a word by the company it keeps!”— (Firth 1957)

Reality check: What is the mystery word?I He handed her her glass of XXXXX.I Nigel staggered to his feet, face flushed from

too much XXXXX.I Malbec, one of the lesser-known XXXXX grapes,

responds well to Australia’s sunshine.I I dined off bread and cheese and this excellent XXXXX.I The drinks were delicious: blood-red XXXXX as well as

light, sweet Rhenish.

XXXXX = claretI all examples from BNC (carefully selected & slightly edited)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 41 / 64

Page 164: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

Distributional hypothesis (Harris 1954): meaning of a wordcan be inferred from its distribution across contexts

“You shall know a word by the company it keeps!”— (Firth 1957)

Reality check: What is the mystery word?I He handed her her glass of XXXXX.I Nigel staggered to his feet, face flushed from

too much XXXXX.I Malbec, one of the lesser-known XXXXX grapes,

responds well to Australia’s sunshine.I I dined off bread and cheese and this excellent XXXXX.I The drinks were delicious: blood-red XXXXX as well as

light, sweet Rhenish.

XXXXX = claretI all examples from BNC (carefully selected & slightly edited)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 41 / 64

Page 165: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

A computer can (sometimes) do the same, with sufficientamounts of corpus data and full collocational profiles

get see use hear eat kill

w1 w2 w3 w4 w5 w6

knife 51 20 84 0 3 0

cat 52 58 4 4 6 26

??? 115 83 10 42 33 17

boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 42 / 64

Page 166: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

A computer can (sometimes) do the same, with sufficientamounts of corpus data and full collocational profiles

get see use hear eat kill

w1 w2 w3 w4 w5 w6

knife 51 20 84 0 3 0

cat 52 58 4 4 6 26

??? 115 83 10 42 33 17

boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

sim(???, knife) = 0.770

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 42 / 64

Page 167: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

A computer can (sometimes) do the same, with sufficientamounts of corpus data and full collocational profiles

get see use hear eat kill

w1 w2 w3 w4 w5 w6

knife 51 20 84 0 3 0

cat 52 58 4 4 6 26

??? 115 83 10 42 33 17

boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

sim(???, pig) = 0.939

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 42 / 64

Page 168: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

A computer can (sometimes) do the same, with sufficientamounts of corpus data and full collocational profiles

get see use hear eat kill

w1 w2 w3 w4 w5 w6

knife 51 20 84 0 3 0

cat 52 58 4 4 6 26

??? 115 83 10 42 33 17

boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

sim(???, cat) = 0.961

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 42 / 64

Page 169: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics

A computer can (sometimes) do the same, with sufficientamounts of corpus data and full collocational profiles

get see use hear eat kill

w1 w2 w3 w4 w5 w6

knife 51 20 84 0 3 0

cat 52 58 4 4 6 26

??? 115 83 10 42 33 17

boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

??? = dog

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 42 / 64

Page 170: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics with Web1T5

Basis of distributional semantic model (DSM):term-term co-occurrence matrix of collocational profiles

I very sparse: e.g. 250k × 100k matrix with 24.2 billion cells,but only 245.4 million cells (≈ 1%) have nonzero values

We’ve already computed collocational profilesI 32 GiB collocations database = sparse co-occurrence matrixI export matrix with 25k target words (rows) and 50k

high-frequency word forms as features (columns)

DSM implemented in R (with wordspace package)I column-compressed sparse matrixI logG 2 weights, L2-normalized, angular distance (= cosine),

500 latent dimensions + 50 skipped (randomized SVD)I parameter settings according to Lapesa and Evert (2014)I needs 10 GiB RAM and less than an hour

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 43 / 64

Page 171: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics with Web1T5

Basis of distributional semantic model (DSM):term-term co-occurrence matrix of collocational profiles

I very sparse: e.g. 250k × 100k matrix with 24.2 billion cells,but only 245.4 million cells (≈ 1%) have nonzero values

We’ve already computed collocational profilesI 32 GiB collocations database = sparse co-occurrence matrixI export matrix with 25k target words (rows) and 50k

high-frequency word forms as features (columns)

DSM implemented in R (with wordspace package)I column-compressed sparse matrixI logG 2 weights, L2-normalized, angular distance (= cosine),

500 latent dimensions + 50 skipped (randomized SVD)I parameter settings according to Lapesa and Evert (2014)I needs 10 GiB RAM and less than an hour

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 43 / 64

Page 172: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Distributional semantics with Web1T5

Basis of distributional semantic model (DSM):term-term co-occurrence matrix of collocational profiles

I very sparse: e.g. 250k × 100k matrix with 24.2 billion cells,but only 245.4 million cells (≈ 1%) have nonzero values

We’ve already computed collocational profilesI 32 GiB collocations database = sparse co-occurrence matrixI export matrix with 25k target words (rows) and 50k

high-frequency word forms as features (columns)

DSM implemented in R (with wordspace package)I column-compressed sparse matrixI logG 2 weights, L2-normalized, angular distance (= cosine),

500 latent dimensions + 50 skipped (randomized SVD)I parameter settings according to Lapesa and Evert (2014)I needs 10 GiB RAM and less than an hour

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 43 / 64

Page 173: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

DSM with Web1T5: nearest neighbours

Neighbours of linguistics (cosine angle):

+ sociology (24.6), sociolinguistics (24.6), criminology (29.5),anthropology (30.8), mathematics (31.2), phonetics (33.1),phonology (33.2), philology (33.2), literatures (33.5),gerontology (35.3), proseminar (35.5), geography (35.8),humanities (35.9), archaeology (35.9), science (36.5), . . .

Neighbours of spaniel (cosine angle):

+ terrier (23.0), schnauzer (26.5), pinscher (27.0), weimaraner(28.3), keeshond (29.1), pomeranian (29.4), pekingese (29.6),bichon (30.1), vizsla (30.5), labradoodle (30.6), apso (31.1),spaniels (32.0), frise (32.0), yorkie (32.1), sheepdog (32.3),dachshund (32.4), retriever (32.7), whippet (32.9), havanese(33.1), westie (34.5), mastiff (34.6), dandie (34.7), chihuahua(34.9), dinmont (35.0), elkhound (35.0), . . .

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 44 / 64

Page 174: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

DSM with Web1T5: nearest neighbours

Neighbours of linguistics (cosine angle):

+ sociology (24.6), sociolinguistics (24.6), criminology (29.5),anthropology (30.8), mathematics (31.2), phonetics (33.1),phonology (33.2), philology (33.2), literatures (33.5),gerontology (35.3), proseminar (35.5), geography (35.8),humanities (35.9), archaeology (35.9), science (36.5), . . .

Neighbours of spaniel (cosine angle):

+ terrier (23.0), schnauzer (26.5), pinscher (27.0), weimaraner(28.3), keeshond (29.1), pomeranian (29.4), pekingese (29.6),bichon (30.1), vizsla (30.5), labradoodle (30.6), apso (31.1),spaniels (32.0), frise (32.0), yorkie (32.1), sheepdog (32.3),dachshund (32.4), retriever (32.7), whippet (32.9), havanese(33.1), westie (34.5), mastiff (34.6), dandie (34.7), chihuahua(34.9), dinmont (35.0), elkhound (35.0), . . .

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 44 / 64

Page 175: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

DSM with Web1T5: semantic map(data from ESSLLI 2008 shared task on concrete noun categorization)

●●

●●

●●

●●

●●

● ●

●●●

● ●

● ●

Semantic map (Web1T5)

birdgroundAnimalfruitTreegreentoolvehicle

chickeneagleduck

swan

owl

penguin

peacock

dogelephant

cow

cat

lionpig

snail

turtlecherry

banana pear

pineapple

mushroom

cornlettuce

potatoonionbottle

pencil

pen

cup

bowl

scissors kettle

knife

screwdriver

hammer spoon

chisel

telephone

boat

car

ship

truck

rocket

motorcycle

helicopter

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 45 / 64

Page 176: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratingsI RG65: Rubenstein and Goodenough (1965)I WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasksI TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent primeI SPP: Semantic Priming Project (Hutchison et al. 2013)I GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)Noun clustering vs. semantic classification

I AP: Almuhareb/Poesio (Almuhareb 2006)I Battig: Battig/Montague norms (Van Overschelde et al. 2004)I ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 46 / 64

Page 177: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratingsI RG65: Rubenstein and Goodenough (1965)I WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasksI TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent primeI SPP: Semantic Priming Project (Hutchison et al. 2013)I GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)Noun clustering vs. semantic classification

I AP: Almuhareb/Poesio (Almuhareb 2006)I Battig: Battig/Montague norms (Van Overschelde et al. 2004)I ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 46 / 64

Page 178: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratingsI RG65: Rubenstein and Goodenough (1965)I WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasksI TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent primeI SPP: Semantic Priming Project (Hutchison et al. 2013)I GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)

Noun clustering vs. semantic classificationI AP: Almuhareb/Poesio (Almuhareb 2006)I Battig: Battig/Montague norms (Van Overschelde et al. 2004)I ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 46 / 64

Page 179: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratingsI RG65: Rubenstein and Goodenough (1965)I WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasksI TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent primeI SPP: Semantic Priming Project (Hutchison et al. 2013)I GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)Noun clustering vs. semantic classification

I AP: Almuhareb/Poesio (Almuhareb 2006)I Battig: Battig/Montague norms (Van Overschelde et al. 2004)I ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 46 / 64

Page 180: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratingsI RG65: Rubenstein and Goodenough (1965)I WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasksI TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent primeI SPP: Semantic Priming Project (Hutchison et al. 2013)I GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)Noun clustering vs. semantic classification

I AP: Almuhareb/Poesio (Almuhareb 2006)I Battig: Battig/Montague norms (Van Overschelde et al. 2004)I ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 46 / 64

Page 181: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Evaluating distributional similarity

Correlation with human smilarity ratings+ RG65: Rubenstein and Goodenough (1965)+ WordSim-353: Finkelstein et al. (2002)

Multiple-choice tasks+ TOEFL synonym questionsI SAT analogy questions

Decision tasks for consistent/inconsistent prime+ SPP: Semantic Priming Project (Hutchison et al. 2013)+ GEK: Generalized Event Knowledge (Ferretti et al. 2001;

McRae et al. 2005; Hare et al. 2009)Noun clustering vs. semantic classification

+ AP: Almuhareb/Poesio (Almuhareb 2006)+ Battig: Battig/Montague norms (Van Overschelde et al. 2004)+ ESSLLI: basic-level concrete nouns (ESSLLI ’08 shared task)

Psycholinguistic norms & experimentsI free association normsI property normsI priming effects (∆RT)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 47 / 64

Page 182: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human similarity ratings

Direct comparison with semantic similarity ratings(WordSim-353, Finkelstein et al. 2002)

I 353 noun-noun pairs with “relatedness” ratingsI rated on scale 0–10 by 16 test subjectsI closely related: money/cash, soccer/football, type/kind, . . .I unrelated: king/cabbage, noon/string, sugar/approach, . . .I NB: not all “nouns” are nouns in a traditional sense

(five, live, eat, stupid, . . . )

Correlation with DSM distances for different corporaI DSM parameters: term-term matrix, L4/R4 surface window,

30k feature terms, logG 2 weighting, cosine similarity, SVD to500 dimensions / 50 skipped (Lapesa and Evert 2014)

I quantitative measure: Spearman’s rank correlation ρ(robust against non-linearities)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 48 / 64

Page 183: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human similarity ratings

Direct comparison with semantic similarity ratings(WordSim-353, Finkelstein et al. 2002)

I 353 noun-noun pairs with “relatedness” ratingsI rated on scale 0–10 by 16 test subjectsI closely related: money/cash, soccer/football, type/kind, . . .I unrelated: king/cabbage, noon/string, sugar/approach, . . .I NB: not all “nouns” are nouns in a traditional sense

(five, live, eat, stupid, . . . )

Correlation with DSM distances for different corporaI DSM parameters: term-term matrix, L4/R4 surface window,

30k feature terms, logG 2 weighting, cosine similarity, SVD to500 dimensions / 50 skipped (Lapesa and Evert 2014)

I quantitative measure: Spearman’s rank correlation ρ(robust against non-linearities)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 48 / 64

Page 184: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

● ●

●●

●●

● ●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: British National Corpus

human rating

angu

lar

dist

ance

|rho| = 0.644, p = 0.0000, |r| = 0.500 .. 0.640 (1 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 185: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: English Movie Subtitles (DESC v2)

human rating

angu

lar

dist

ance

|rho| = 0.615, p = 0.0000, |r| = 0.463 .. 0.612 (1 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 186: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: Gigaword News Corpus (2nd ed.)

human rating

angu

lar

dist

ance

|rho| = 0.667, p = 0.0000, |r| = 0.502 .. 0.642 (2 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 187: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: Wackypedia

human rating

angu

lar

dist

ance

|rho| = 0.716, p = 0.0000, |r| = 0.551 .. 0.680

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 188: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: Wackypedia subset (WP500)

human rating

angu

lar

dist

ance

|rho| = 0.699, p = 0.0000, |r| = 0.534 .. 0.667

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 189: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: ukWaC Web corpus

human rating

angu

lar

dist

ance

|rho| = 0.698, p = 0.0000, |r| = 0.554 .. 0.682

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 190: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: WebBase corpus

human rating

angu

lar

dist

ance

|rho| = 0.689, p = 0.0000, |r| = 0.540 .. 0.672 (2 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 191: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

● ●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: UKCOW Web corpus

human rating

angu

lar

dist

ance

|rho| = 0.715, p = 0.0000, |r| = 0.544 .. 0.675 (2 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 192: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: Joint Web corpus (10G)

human rating

angu

lar

dist

ance

|rho| = 0.710, p = 0.0000, |r| = 0.568 .. 0.693

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 193: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: Google Web 1T 5−Grams (quasi−collocations)

human rating

angu

lar

dist

ance

|rho| = 0.677, p = 0.0000, |r| = 0.525 .. 0.660

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 194: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

● ●

●●

●●

●●

●● ●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ● ●●

●●

●●

● ●

● ●

●●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: LCC N−Grams (full collocations)

human rating

angu

lar

dist

ance

|rho| = 0.662, p = 0.0000, |r| = 0.509 .. 0.647 (5 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 195: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

● ● ●

●●●

●●

● ●

●●

● ●●

●●

●●

●●

● ●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: LCC N−Grams (quasi−collocations, f >= 5)

human rating

angu

lar

dist

ance

|rho| = 0.542, p = 0.0000, |r| = 0.296 .. 0.474 (17 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 196: An NLP Approach to the Evaluation of Web Corpora

Evaluation results Distributional Semantics

Correlation with human relatedness ratings(Finkelstein et al. 2002)

●●

●●

● ●

●●

●●

● ●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●● ● ●

●●

●● ●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

0 2 4 6 8

3040

5060

7080

9010

0

WordSim353: LCC N−Grams (quasi−collocations, f >= 10)

human rating

angu

lar

dist

ance

|rho| = 0.486, p = 0.0000, |r| = 0.249 .. 0.434 (19 pairs not found)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 49 / 64

Page 197: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible despite its size?

Insufficient boilerplate removal & de-duplication:

from * to *

from collectibles to cars 9,443,572from collectables to cars 8,844,838from time to time 5,678,941from left to right 793,957from start to finish 749,705from a to z 572,917from year to year 486,669from top to bottom 372,935

“Traditional” Web corpora are much better:

Google ≈ 121,000,000 hitsGoogle.de ≈ 119,600,000 hitsWeb 1T 5-Grams 18,288,410 hitsukWaC 3 hitsBNC 0 hits

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 50 / 64

Page 198: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible despite its size?

Insufficient boilerplate removal & de-duplication:

from * to *from collectibles to cars 9,443,572from collectables to cars 8,844,838from time to time 5,678,941from left to right 793,957from start to finish 749,705from a to z 572,917from year to year 486,669from top to bottom 372,935

“Traditional” Web corpora are much better:

Google ≈ 121,000,000 hitsGoogle.de ≈ 119,600,000 hitsWeb 1T 5-Grams 18,288,410 hitsukWaC 3 hitsBNC 0 hits

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 50 / 64

Page 199: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible despite its size?

Insufficient boilerplate removal & de-duplication:

from * to *from collectibles to cars 9,443,572from collectables to cars 8,844,838from time to time 5,678,941from left to right 793,957from start to finish 749,705from a to z 572,917from year to year 486,669from top to bottom 372,935

“Traditional” Web corpora are much better:

Google ≈ 121,000,000 hitsGoogle.de ≈ 119,600,000 hitsWeb 1T 5-Grams 18,288,410 hitsukWaC 3 hitsBNC 0 hits

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 50 / 64

Page 200: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible?

Which words are semantically similar to hot (in DSM)?I I hope there are no minors in the room!

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 51 / 64

Page 201: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible?

Which words are semantically similar to hot (in DSM)?I I hope there are no minors in the room!

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 51 / 64

Page 202: An NLP Approach to the Evaluation of Web Corpora

Evaluation results What’s wrong with Web1T5?

Why is Web1T5 so terrible?

Which words are semantically similar to hot (in DSM)?I I hope there are no minors in the room!

big (29.5), butt (31.1), ass (31.1), wet (31.2), naughty (31.6), pussy(31.6), sexy (31.6), chicks (32.0), cock (32.2), ebony (32.3), fat (32.4),girls (32.4), asian (32.7), cum (33.1), babes (33.2), dirty (33.2), bikini(33.3), granny (33.4), teen (33.8), pics (33.8), gras (34.1), fucking(34.1), galleries (34.2), fetish (34.3), babe (34.3), blonde (34.5), pussies(34.5), whores (34.6), fuck (34.6), horny (34.7)

Please don’t ask about cats and dogs . . .

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 51 / 64

Page 203: An NLP Approach to the Evaluation of Web Corpora

Overview

Overview & Discussion

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 52 / 64

Page 204: An NLP Approach to the Evaluation of Web Corpora

Overview Frequency comparison

Result overview: Frequency comparison

Frequency Comparison: Basic English (lemmatized)

Pea

rson

cor

rela

tion

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

Web

1T5

0.4

0.5

0.6

0.7

0.8

0.9

1.0

● ● ●

●●

● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 53 / 64

Page 205: An NLP Approach to the Evaluation of Web Corpora

Overview Frequency comparison

Result overview: Frequency comparison

Frequency Comparison: Basic English (word forms)

Pea

rson

cor

rela

tion

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

Web

1T5

0.4

0.5

0.6

0.7

0.8

0.9

1.0

● ● ●

● ●

● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 53 / 64

Page 206: An NLP Approach to the Evaluation of Web Corpora

Overview Frequency comparison

Result overview: Frequency comparison

Frequency Comparison: WordNet noun compounds

Pea

rson

cor

rela

tion

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

Web

1T5

0.4

0.5

0.6

0.7

0.8

0.9

1.0

● ● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 53 / 64

Page 207: An NLP Approach to the Evaluation of Web Corpora

Overview Frequency comparison

Result overview: Frequency comparison

Frequency Comparison: WordNet noun compounds

cove

rage

(%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

Web

1T5

020

4060

8010

0

●●

●●

●●

coverage analysis

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 53 / 64

Page 208: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: MWE identification

MWE Extraction: English VPC (L0/R3 span)

aver

age

prec

isio

n (b

est A

M)

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

tagg

ed L

CC

Web

1T5

2025

3035

40

● ●

●●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 54 / 64

Page 209: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: MWE identification

MWE Extraction: English VPC (L0/R3 span)

aver

age

prec

isio

n (b

est A

M)

BN

C

unta

gged

BN

C

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

tagg

ed L

CC

tagg

ed L

CC

[f >

= 5

]

tagg

ed L

CC

[f >

= 1

0]

Web

1T5

2025

3035

40

●● ● ●

● ● ●

POS-tagging and frequency threshold

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 54 / 64

Page 210: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: MWE identification

MWE Extraction: English VPC (L0/R3 span)

cove

rage

(%

)

BN

C

unta

gged

BN

C

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

tagg

ed L

CC

tagg

ed L

CC

[f >

= 5

]

tagg

ed L

CC

[f >

= 1

0]

Web

1T5

020

4060

8010

0

● ●

coverage analysis

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 54 / 64

Page 211: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: BBI collocations

AP 50 | L3/R3 span | gold: BBI

AP

50 (

%)

BN

C

DE

SC

Gig

awor

d

WP

500

Wik

i

UK

WA

C

WE

BB

AS

E

UK

CO

W

JOIN

T

EN

CO

W

LCC

LCC

5

LCC

10

WE

B1T

5

Boo

ksG

B

Boo

ksG

Bw

f

010

2030

4050

● ●

● ●

● ●● ●

● ● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 55 / 64

Page 212: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: BBI collocations

AP 50 | BNC / ENCOW | gold: BBI

AP

50 (

%)

synt

actic

L3 /

R3

L5 /

R5

L10

/ R10

sent

ence

010

2030

4050

● BNCENCOW

collocational span

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 55 / 64

Page 213: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: BBI collocations

AP 50 | syntactic dependency | gold: BBI

AP

50 (

%)

BN

C

DE

SC

WP

500

Wik

i

UK

WA

C

EN

CO

W

EN

CO

Wgp

Boo

ksG

B

Boo

ksG

Bw

f

Boo

ksE

N

Boo

ksE

Nw

f

010

2030

4050

●●

● ●●

●●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 55 / 64

Page 214: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: BBI collocations

AP 50 | syntactic dependency | gold: BBI

AP

50 (

%)

BN

C

DE

SC

WP

500

Wik

i

UK

WA

C

EN

CO

W

EN

CO

Wgp

Boo

ksG

B

Boo

ksG

Bw

f

Boo

ksE

N

Boo

ksE

Nw

f

010

2030

4050

●●

● ●●

●●

● syntactic dependencyL3/R3 span

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 55 / 64

Page 215: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: BBI collocations

Coverage of BBI gold standard

Cov

erag

e (%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wik

i

UK

WA

C

WE

BB

AS

E

UK

CO

W

JOIN

T

EN

CO

W

LCC

LCC

5

LCC

10

WE

B1T

5

Boo

ksG

B

Boo

ksG

Bw

f

020

4060

8010

0●

●●

●● ● ● ● ● ● ●

●● ●

coverage analysis

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 55 / 64

Page 216: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: OCD2 collocations

AP 50 | L3/R3 span | gold: OCD2

AP

50 (

%)

BN

C

DE

SC

Gig

awor

d

WP

500

Wik

i

UK

WA

C

WE

BB

AS

E

UK

CO

W

JOIN

T

EN

CO

W

LCC

LCC

5

LCC

10

WE

B1T

5

Boo

ksG

B

Boo

ksG

Bw

f

020

4060

80

● ●●

● ● ●

● ● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 56 / 64

Page 217: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: OCD2 collocations

AP 50 | BNC / ENCOW | gold: OCD2

AP

50 (

%)

synt

actic

L3 /

R3

L5 /

R5

L10

/ R10

sent

ence

020

4060

80

● BNCENCOW

collocational span

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 56 / 64

Page 218: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: OCD2 collocations

AP 50 | syntactic dependency | gold: OCD2

AP

50 (

%)

BN

C

DE

SC

WP

500

Wik

i

UK

WA

C

EN

CO

W

EN

CO

Wgp

Boo

ksG

B

Boo

ksG

Bw

f

Boo

ksE

N

Boo

ksE

Nw

f

020

4060

80

●● ●

●●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 56 / 64

Page 219: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: OCD2 collocations

AP 50 | syntactic dependency | gold: OCD2

AP

50 (

%)

BN

C

DE

SC

WP

500

Wik

i

UK

WA

C

EN

CO

W

EN

CO

Wgp

Boo

ksG

B

Boo

ksG

Bw

f

Boo

ksE

N

Boo

ksE

Nw

f

020

4060

80

●● ●

●●

● syntactic dependencyL3/R3 span

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 56 / 64

Page 220: An NLP Approach to the Evaluation of Web Corpora

Overview MWE identification

Result overview: OCD2 collocations

Coverage of OCD2 gold standard

Cov

erag

e (%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wik

i

UK

WA

C

WE

BB

AS

E

UK

CO

W

JOIN

T

EN

CO

W

LCC

LCC

5

LCC

10

WE

B1T

5

Boo

ksG

B

Boo

ksG

Bw

f

020

4060

8010

0●

●●

●● ● ● ● ● ● ●

●● ●

coverage analysis

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 56 / 64

Page 221: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: WordSim−353

rank

cor

rela

tion

ρ (%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

4050

6070

80

●●

●●

● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 222: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: multiple choice (TOEFL synonyms)

accu

racy

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

5060

7080

9010

0

(don’t take this too seriously)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 223: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: multiple choice (SPP)

accu

racy

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

7075

8085

9095

100

●●

●● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 224: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: multiple choice (GEK)

accu

racy

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

7075

8085

9095

100

●● ●

● ● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 225: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: noun clustering (AP)

clus

ter

purit

y (%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

4050

6070

80

●●

●●

● ●

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 226: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Result overview: Distributional semantics

DSM Evaluation: noun clustering (Battig)

clus

ter

purit

y (%

)

BN

C

DE

SC

Gig

awor

d

WP

500

Wac

kype

dia

ukW

aC

Web

Bas

e

UK

CO

W

Join

t Web

LCC

LCC

[f >

= 5

]

LCC

[f >

= 1

0]

Web

1T5

6070

8090

100

(don’t take this too seriously)

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 57 / 64

Page 227: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

Thank You!

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 58 / 64

Page 228: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References I

Almuhareb, Abdulrahman (2006). Attributes in Lexical Acquisition. Ph.D. thesis,University of Essex.

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh UniversityPress, Edinburgh. See also the BNC homepage athttp://www.natcorp.ox.ac.uk/.

Baldwin, Timothy (2008). A resource for evaluating the deep lexical acquisition ofEnglish verb-particle constructions. In Proceedings of the LREC Workshop Towardsa Shared Task for Multiword Expressions (MWE 2008), pages 1–2, Marrakech,Morocco.

Banko, Michele and Brill, Eric (2001). Scaling to very very large corpora for naturallanguage disambiguation. In Proceedings of the 39th Annual Meeting of theAssociation for Computational Linguistics, pages 26–33, Toulouse, France.

Baroni, Marco; Bernardini, Silvia; Ferraresi, Adriano; Zanchetta, Eros (2009). TheWaCky Wide Web: A collection of very large linguistically processed Web-crawledcorpora. Language Resources and Evaluation, 43(3), 209–226.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 59 / 64

Page 229: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References II

Bartsch, Sabine and Evert, Stefan (2014). Towards a Firthian notion of collocation. InA. Abel and L. Lemnitzer (eds.), Vernetzungsstrategien, Zugriffsstrukturen undautomatisch ermittelte Angaben in Internetworterbuchern, number 2/2014 inOPAL – Online publizierte Arbeiten zur Linguistik, pages 48–61. Institut furDeutsche Sprache, Mannheim.

Benson, Morton; Benson, Evelyn; Ilson, Robert (1986). The BBI CombinatoryDictionary of English: A Guide to Word Combinations. John Benjamins,Amsterdam, New York.

Biemann, Chris; Bildhauer, Felix; Evert, Stefan; Goldhahn, Dirk; Quasthoff, Uwe;Schafer, Roland; Simon, Johannes; Swiezinski, Leonard; Zesch, Torsten (2013).Scalable construction of high-quality Web corpora. Journal for LanguageTechnology and Computational Linguistics (JLCL), 28(2), 23–59.

Brants, Thorsten and Franz, Alex (2006). Web 1T 5-gram Version 1. Linguistic DataConsortium, Philadelphia, PA. http:

//www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.

Church, Kenneth W. and Mercer, Robert L. (1993). Introduction to the special issueon computational linguistics using large corpora. Computational Linguistics, 19(1),1–24.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 60 / 64

Page 230: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References III

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs andCollocations. Dissertation, Institut fur maschinelle Sprachverarbeitung, Universityof Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Availablefrom http://www.collocations.de/phd.html.

Evert, Stefan (2010). Google Web 1T5 n-grams made easy (but not for thecomputer). In Proceedings of the 6th Web as Corpus Workshop (WAC-6), pages32–40, Los Angeles, CA.

Ferretti, Todd; McRae, Ken; Hatherell, Ann (2001). Integrating verbs, situationschemas, and thematic role concepts. Journal of Memory and Language, 44(4),516–547.

Finkelstein, Lev; Gabrilovich, Evgeniy; Matias, Yossi; Rivlin, Ehud; Solan, Zach;Wolfman, Gadi; Ruppin, Eytan (2002). Placing search in context: The conceptrevisited. ACM Transactions on Information Systems, 20(1), 116–131.

Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. In Studies in linguisticanalysis, pages 1–32. The Philological Society, Oxford. Reprinted in Palmer (1968),pages 168–205.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 61 / 64

Page 231: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References IV

Han, Lushan; Kashyap, Abhay L.; Finin, Tim; Mayfield, James; Weese, Johnathan(2013). UMBC EBIQUITY-CORE: Semantic textual similarity systems. InProceedings of the Second Joint Conference on Lexical and ComputationalSemantics. Association for Computational Linguistics. Corpus available fromhttp://ebiquity.umbc.edu/resource/html/id/351.

Hare, Mary; Jones, Michael; Thomson, Caroline; Kelly, Sarah; McRae, Ken (2009).Activating event knowledge. Cognition, 111(2), 151–167.

Harris, Zellig (1954). Distributional structure. Word, 10(23), 146–162. Reprinted inHarris (1970, 775–794).

Hausmann, Franz Josef (1989). Le dictionnaire de collocations. In F. J. Hausmann,O. Reichmann, H. E. Wiegand, and L. Zgusta (eds.), Worterbucher, Dictionaries,Dictionnaires. Ein internationales Handbuch, Handbucher zur Sprach- undKommunikationswissenschaft, pages 1010–1019. de Gruyter, Berlin, New York.

Hutchison, Keith A.; Balota, David A.; Neely, James H.; Cortese, Michael J.;Cohen-Shikora, Emily R.; Tse, Chi-Shing; Yap, Melvin J.; Bengson, Jesse J.;Niemeyer, Dale; Buchanan, Erin (2013). The semantic priming project. BehaviorResearch Methods, 45(4), 1099–1114.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 62 / 64

Page 232: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References V

Lapesa, Gabriella and Evert, Stefan (2014). A large scale evaluation of distributionalsemantic models: Parameters, interactions and model selection. Transactions ofthe Association for Computational Linguistics, 2, 531–545.

Lin, Yuri; Michel, Jean-Baptiste; Aiden, Erez Lieberman; Orwant, Jon; Brockman,Will; Petrov, Slav (2012). Syntactic annotations for the google books ngramcorpus. In Proceedings of the ACL 2012 System Demonstrations, pages 169–174,Jeju Island, Korea. Association for Computational Linguistics. Data sets availablefrom http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

McIntosh, Colin; Francis, Ben; Poole, Richard (eds.) (2009). Oxford CollocationsDictionary for students of English. Oxford University Press.

McRae, Ken; Hare, Mary; Elman, Jeffrey L.; Ferretti, Todd (2005). A basis forgenerating expectancies for verbs from nouns. Memory & Cognition, 33(7),1174–1184.

Rubenstein, Herbert and Goodenough, John B. (1965). Contextual correlates ofsynonymy. Communications of the ACM, 8(10), 627–633.

Schafer, Roland and Bildhauer, Felix (2012). Building large corpora from the webusing a new efficient tool chain. In Proceedings of the Eighth InternationalConference on Language Resources and Evaluation (LREC ’12), pages 486–493,Istanbul, Turkey. ELRA.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 63 / 64

Page 233: An NLP Approach to the Evaluation of Web Corpora

Overview Distributional Semantics

References VI

Sinclair, John (1991). Corpus, Concordance, Collocation. Oxford University Press,Oxford.

Sinclair, John McH. (1966). Beginning the study of lexis. In C. E. Bazell, J. C.Catford, M. A. K. Halliday, and R. H. Robins (eds.), In Memory of J. R. Firth,pages 410–430. Longmans, London.

Van Overschelde, James; Rawson, Katherine; Dunlosky, John (2004). Category norms:An updated and expanded version of the Battig and Montague (1969) norms.Journal of Memory and Language, 50, 289–335.

S. Evert ([email protected]) Evaluation of Web Corpora 17 Feb 2015 64 / 64


Recommended