1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

transcript

Googleology is bad science

Adam KilgarriffLexical Computing LtdUniversities of Sussex, Leeds

Web as language resource

Replaceable or replacable? check

Very very large Most languages Most language types Up-to-date Free Instant access

How to use the web?

Google or other commercial search engines (CSEs)

Using CSEs

No setup costsStart querying today

Methods Hit counts ‘snippets’

Metasearch engines, WebCorp Find pages and download

Googleology

CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each

of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations

“we issue exact phrase Google queries of type noun2 THAT * noun1”

Nakov and Hearst 2006

Small community of researchers Corpora mailing list

Very interesting work Intense interest in query syntax

Creativity and person-years

The Trouble with Google

not enough instances max 1000

not enough queries max 1000 per day with API

not enough context 10-word snippet around search term

ridiculous sort order search term in titles and headings

untrustworthy hit counts limited search syntax

No regular expressions linguistically dumb

lemmatised aime/aimer/aimes/aimons/aimez/aiment …

not POS-tagged not parsed not

Appeal Zero-cost entry, just start googling

Reality High-quality work: high-cost methodology

No replicability Methods, stats not published At mercy of commercial corporation

No replicability Methods, stats not published At mercy of commercial corporation Bad science

The 5-grams

A present from Google All

1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English

A large dataset

Prognosis

Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP

Prognosis

Next 3 years Exciting new ideas Dazzlingly clever uses

After 5+ years A chain round our necks

Cf Penn Treebank (others? Brickbats?)

Resource-led vs. ideas-led research

How to use the web?

Google or other commercial search engines (CSEs)

Language and the web

Web is mostly linguistic Text on web << whole web (in GB)

Not many TB of text Special hardware not needed

We are the experts

Community-building ACL SIGWAC WAC Kool Ynitiative (WaCKY)

Mailing list Open source

WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007

Proof of concept: DeWaC, ItWaC

1.5 B words each, German and Italian Marco Baroni, Bologna (+ AK)

What is out there?

What text types? some are new: chatroom proportions

is it overwhelmed by porn? How much? Hard question

What is out there The web

a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language

we are well placed a lot of people will be interested

Let’s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure

How to do it:Components

1. web crawler2. filters and classifiers

de-duplication

3. linguistic processing• Lemmatise, pos-tag, parse

4. Database• Indexing• user interface

1. Crawling

How big is your hard disk? When will your sysadmin ban you?

DeWaC/ItWaC Open source crawler: heritrix

1.1 Seeding the crawl

Mid-frequency words Spread of text types

Formal and informal, not just newspaper DeWaC

Words from newspaper corpus Words from list with “kitchen” vocab

Use Google to get seeds for crawls

2. Filtering

non ‘running-text’ stripping Function word filtering Porn filtering De-duplication

2.1 Filtering: Sentences

What is the text that we want? Lists? Links? Catalogues? …

For linguistics, NLP in sentences

Use function words

2.2 Filtering: CLEANEVAL “Text cleaning”

Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter

Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard”

paragraph-marked plain text Prepared by people

Workshop Sept 2007. do join us! http://cleaneval.sigwac.org.uk

3. Linguistic processing

Lemmatise, POS-tag, parse Find leading NLP group for each

language Be nice to them Use their tools

Database, interface

Solved problem (at least for 1.5 BW) Sketch Engine

“Despite all the disadvantages, it’s still so much bigger”

How much bigger?

Method Sample words

30 Mid-to-high freq Not common words in other major lgs Min 5 chars

Compare freqs, Google vs ItWaC/DeWaC

Google results (Italian) Arbitrariness

Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference

API: typically 1/18th ‘manual’ figure Language filter

mista bomba clima mostly non-Italian pages

use MAX and MIN of 6 lg-filtered results

Clima= Computational logic in multi-agent systems Centre for Legumes in Mediterranean

Agriculture (5-char limit too short)

Ratios, Google:DeWaC

WORD MAX MIN RAW CLEAN--------------------------------------------------------------besuchte 10.5 3.8 81840 18228stirn 3.38 0.62 32320 11137gerufen 7.14 3.72 66720 27187verringert 6.86 3.46 52160 15987bislang 24.4 11.6 239000 90098brach 4.36 2.26 44520 19824--------------------------------------------------------------

MAX/MIN: max/min of 6 Google values (millions)RAW: DeWaC document frequency before filters, dedupeCLEAN: DeWaC document frequency after filters, dedupe

ItWaC:Google ratio, best estimate For each of 30 words

Calculate ratio, max:raw Calculate ratio, min:raw

Take mid-point and average: 1:33 or 3% Calculate raw:vert

Average = 4.4 half (for conservativeness/uncertainty) = 2.2

3% x 2.2 = 6.6%

ItWaC:Google = 6.6%

Italian web size

ItWaC = 1.67b words Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian

German web size

Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German

Effort

ItWac, DeWac Less than 6 person months Developing the method

(EnWaC: in progress)

Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be:

English: 2% G-scale (still biggest part) 6 other major languages: 30% G-scale 30 other languages: 10% G-scale

Online for Searching as in SkE Specifying, downloading subcorpora for

intensive NLP “corpora on demand”

Don’t quote me

Logjams

Cleaning See CLEANEVAL

Text type “what kind of page is it?” Critical but under-researched WebDoc proposal

(with Serge Sharoff, Tony Hartley) (a different talk)

Google, CSEs are wonderful Start today but

bad science Not

Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project

Google-scale is achievable

Thank you

http://www.sketchengine.co.uk

Scale and speed, LSE Commercial search engines

banks of computers highly optimised code

but this is for performance no downtime instant responses to millions of queries

This proposal crawling: once a year downtime: acceptable not so many users

…but it’s not representative The web is not representative but nor is anything else Text type variation

under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC;

Biber 1988, Baayen 2001, Kilgarriff 2001 Text type is an issue across NLP

Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there

Oxford English Corpus Method as above Whole domains chosen and

harvested control over text type

1 billion words Public launch April 2006 Loaded into Sketch Engine

Oxford English Corpus

Examples

DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006

Serge Sharoff, Leeds Univ UK English Chinese Russian English French

Spanish, all searchable online Oxford English corpus

Options for academics

Give up Niche markets, obscure languages Leave the mainstream to the big guys

Work out how to work on that scale Web is free, data availability not a

problem

1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.

Documents