+ All Categories
Home > Documents > 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

Date post: 30-Dec-2015
Category:
Upload: lucy-johnston
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
47
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds
Transcript
Page 1: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

1

Googleology is bad science

Adam KilgarriffLexical Computing LtdUniversities of Sussex, Leeds

Page 2: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

2

Web as language resource

Replaceable or replacable? check

Page 3: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

3

Very very large Most languages Most language types Up-to-date Free Instant access

Page 4: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

4

How to use the web?

Google or other commercial search engines (CSEs)

not

Page 5: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

5

Using CSEs

No setup costsStart querying today

Methods Hit counts ‘snippets’

Metasearch engines, WebCorp Find pages and download

Page 6: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

6

Googleology

CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each

of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations

“we issue exact phrase Google queries of type noun2 THAT * noun1”

Nakov and Hearst 2006

Small community of researchers Corpora mailing list

Very interesting work Intense interest in query syntax

Creativity and person-years

Page 7: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

7

The Trouble with Google

not enough instances max 1000

not enough queries max 1000 per day with API

not enough context 10-word snippet around search term

ridiculous sort order search term in titles and headings

untrustworthy hit counts limited search syntax

No regular expressions linguistically dumb

lemmatised aime/aimer/aimes/aimons/aimez/aiment …

not POS-tagged not parsed not

Page 8: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

8

Appeal Zero-cost entry, just start googling

Reality High-quality work: high-cost methodology

Page 9: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

9

Also:

No replicability Methods, stats not published At mercy of commercial corporation

Page 10: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

10

Also:

No replicability Methods, stats not published At mercy of commercial corporation Bad science

Page 11: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

11

The 5-grams

A present from Google All

1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English

A large dataset

Page 12: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

12

Prognosis

Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP

Page 13: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

13

Prognosis

Next 3 years Exciting new ideas Dazzlingly clever uses

After 5+ years A chain round our necks

Cf Penn Treebank (others? Brickbats?)

Resource-led vs. ideas-led research

Page 14: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

14

How to use the web?

Google or other commercial search engines (CSEs)

not

Page 15: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

15

Language and the web

Web is mostly linguistic Text on web << whole web (in GB)

Not many TB of text Special hardware not needed

We are the experts

Page 16: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

16

Community-building ACL SIGWAC WAC Kool Ynitiative (WaCKY)

Mailing list Open source

WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007

Page 17: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

17

Proof of concept: DeWaC, ItWaC

1.5 B words each, German and Italian Marco Baroni, Bologna (+ AK)

Page 18: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

18

What is out there?

What text types? some are new: chatroom proportions

is it overwhelmed by porn? How much? Hard question

Page 19: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

19

What is out there The web

a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language

we are well placed a lot of people will be interested

Let’s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure

Page 20: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

20

How to do it:Components

1. web crawler2. filters and classifiers

de-duplication

3. linguistic processing• Lemmatise, pos-tag, parse

4. Database• Indexing• user interface

Page 21: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

21

1. Crawling

How big is your hard disk? When will your sysadmin ban you?

DeWaC/ItWaC Open source crawler: heritrix

Page 22: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

22

1.1 Seeding the crawl

Mid-frequency words Spread of text types

Formal and informal, not just newspaper DeWaC

Words from newspaper corpus Words from list with “kitchen” vocab

Use Google to get seeds for crawls

Page 23: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

23

2. Filtering

non ‘running-text’ stripping Function word filtering Porn filtering De-duplication

Page 24: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

24

2.1 Filtering: Sentences

What is the text that we want? Lists? Links? Catalogues? …

For linguistics, NLP in sentences

Use function words

Page 25: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

25

2.2 Filtering: CLEANEVAL “Text cleaning”

Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter

Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard”

paragraph-marked plain text Prepared by people

Workshop Sept 2007. do join us! http://cleaneval.sigwac.org.uk

Page 26: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

26

3. Linguistic processing

Lemmatise, POS-tag, parse Find leading NLP group for each

language Be nice to them Use their tools

Page 27: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

27

Database, interface

Solved problem (at least for 1.5 BW) Sketch Engine

Page 28: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

28

“Despite all the disadvantages, it’s still so much bigger”

Page 29: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

29

How much bigger?

Method Sample words

30 Mid-to-high freq Not common words in other major lgs Min 5 chars

Compare freqs, Google vs ItWaC/DeWaC

Page 30: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

30

Google results (Italian) Arbitrariness

Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference

API: typically 1/18th ‘manual’ figure Language filter

mista bomba clima mostly non-Italian pages

use MAX and MIN of 6 lg-filtered results

Page 31: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

31

Clima= Computational logic in multi-agent systems Centre for Legumes in Mediterranean

Agriculture (5-char limit too short)

Page 32: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

32

Ratios, Google:DeWaC

WORD MAX MIN RAW CLEAN--------------------------------------------------------------besuchte 10.5 3.8 81840 18228stirn 3.38 0.62 32320 11137gerufen 7.14 3.72 66720 27187verringert 6.86 3.46 52160 15987bislang 24.4 11.6 239000 90098brach 4.36 2.26 44520 19824--------------------------------------------------------------

MAX/MIN: max/min of 6 Google values (millions)RAW: DeWaC document frequency before filters, dedupeCLEAN: DeWaC document frequency after filters, dedupe

Page 33: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

33

ItWaC:Google ratio, best estimate For each of 30 words

Calculate ratio, max:raw Calculate ratio, min:raw

Take mid-point and average: 1:33 or 3% Calculate raw:vert

Average = 4.4 half (for conservativeness/uncertainty) = 2.2

3% x 2.2 = 6.6%

ItWaC:Google = 6.6%

Page 34: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

34

Italian web size

ItWaC = 1.67b words Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian

Page 35: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

35

German web size

Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German

Page 36: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

36

Effort

ItWac, DeWac Less than 6 person months Developing the method

(EnWaC: in progress)

Page 37: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

37

Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be:

English: 2% G-scale (still biggest part) 6 other major languages: 30% G-scale 30 other languages: 10% G-scale

Online for Searching as in SkE Specifying, downloading subcorpora for

intensive NLP “corpora on demand”

Don’t quote me

Page 38: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

38

Logjams

Cleaning See CLEANEVAL

Text type “what kind of page is it?” Critical but under-researched WebDoc proposal

(with Serge Sharoff, Tony Hartley) (a different talk)

Page 39: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

39

Moral

Google, CSEs are wonderful Start today but

bad science Not

Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project

Google-scale is achievable

Page 40: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

40

Thank you

http://www.sketchengine.co.uk

Page 41: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

41

Scale and speed, LSE Commercial search engines

banks of computers highly optimised code

but this is for performance no downtime instant responses to millions of queries

This proposal crawling: once a year downtime: acceptable not so many users

Page 42: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

42

…but it’s not representative The web is not representative but nor is anything else Text type variation

under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC;

Biber 1988, Baayen 2001, Kilgarriff 2001 Text type is an issue across NLP

Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there

Page 43: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

43

Oxford English Corpus Method as above Whole domains chosen and

harvested control over text type

1 billion words Public launch April 2006 Loaded into Sketch Engine

Page 44: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

44

Oxford English Corpus

Page 45: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

45

Oxford English Corpus

Page 46: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

46

Examples

DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006

Serge Sharoff, Leeds Univ UK English Chinese Russian English French

Spanish, all searchable online Oxford English corpus

Page 47: 1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

47

Options for academics

Give up Niche markets, obscure languages Leave the mainstream to the big guys

Work out how to work on that scale Web is free, data availability not a

problem


Recommended