+ All Categories
Home > Documents > Token-level noise in large web corpora and non-destructive...

Token-level noise in large web corpora and non-destructive...

Date post: 24-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
45
Token-level noise in large web corpora and non-destructive normalization for linguistic applications Felix Bildhauer and Roland Schäfer SFB 632/A2, German Grammar and Linguistics (FU Berlin) CANS, Lancaster, July 22, 2013
Transcript
Page 1: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise in large web corpora

and non-destructive normalizationfor linguistic applications

Felix Bildhauer and Roland SchäferSFB 632/A2, German Grammar and Linguistics (FU Berlin)

CANS, Lancaster, July 22, 2013

Page 2: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

COW project:http://hpsg.fu-berlin.de/cow/

texrex (current version: texrex-hyperhyper):http://sourceforge.net/projects/texrex/

Our brand new book on web corpora:http://sites.morganclaypool.com/wcc/

http://www.morganclaypool.com/toc/hlt/2/1

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 1/45

Page 3: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Overview

Introduction: Noise in web corpora

HyDRA – Hyphenation removal

Spellingbee – Spelling correction

Non-destructive normalization

[hideallsubsections]

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 2/45

Page 4: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Introduction: Noise in web corpora

We are here. . .

Introduction: Noise in web corpora

HyDRA – Hyphenation removal

Spellingbee – Spelling correction

Non-destructive normalization

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 3/45

Page 5: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Introduction: Noise in web corpora

Dimensions of noisiness

N tokens: 9,108,097,177N types: 63,569,767N hapax legomena: 39,988,127

Proliferation of types: type and token counts for German web corpus DECOW2012Schäfer and Bildhauer [2012, 2013]

similar results in Liu and Curran [2006]

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 4/45

Page 6: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Introduction: Noise in web corpora

Noise in web corpora

Source % 95% CI (˘%)

misspelling 20.0 5.0tokenization error 17.6 4.7non-word 7.6 3.3foreign-language material 6.8 3.1

rare word 46.8 6.2number 1.2 1.3

Classification of hapax legomena in DECOW2012;estimated proportions of different categories (n “ 250), with 95% confidence interval (CI)

Schäfer and Bildhauer [2013]

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 5/45

Page 7: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Introduction: Noise in web corpora

Classification of errors in POS tagging

The background of the work presented here:

§ improve linguistic post-processing (considerablylower quality in web corpora, Giesbrecht and Evert, 2009)

§ allow users to also retrieve misspelled words etc.

Class %

non-standard orthography 32.3lexicon gaps 19.8foreign language material 18.9emoticons 13.7named entities 5.4tokenization errors 3.1other 6.8

sample drawn from a sub-corpus of DECOW2012 containing predominantly informal language

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 6/45

Page 8: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Introduction: Noise in web corpora

Breakdown of non-standard orthography (DECOW2012)

Class %

genre-specific spellings 59.2omitted whitespace 13.4variants 19.7ordinary typos 7.56

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 7/45

Page 9: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

We are here. . .

Introduction: Noise in web corpora

HyDRA – Hyphenation removal

Spellingbee – Spelling correction

Non-destructive normalization

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 8/45

Page 10: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Hyphenated words in web corpora

§ sources: pasted material from word processors, etc.

§ disadvantage: no line endings as additional hint,Grefenstette and Tapanainen [1994] too naïve

§ little discussion available, e. g., Zamorano et al. [2011]

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 9/45

Page 11: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Examples of type I: Merge

§ Seiten- streifen ñ Seitenstreifen (hard shoulder)

§ an- wählen ñ anwählen (select/dial)

§ E- missionen ñ Emissionen (emissions)

§ Physio- kratie ñ Physiokratie (Physiocracy)

All examples are from DECOW2012 (German) or UKCOW2012 (English)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 10/45

Page 12: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Examples of type II: Concatenate

§ Philipps- Lagerverkauf ñPhilipps-Lagerverkauf (Philips stock sale)

§ U- Bootalarm ñ U-Bootalarm (submarine alert)

§ 5- Alpha-Reduktase-Hemmer ñ5-Alpha-Reduktase-Hemmer (5-alpha-reductase inhibitor)

§ 18- karätigem ñ 18-karätigem (18-carat)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 11/45

Page 13: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Examples of type III: Leavealone

§ deutsch- u. bald auch der englischsprachigen Blogosphärethe German- and soon also the English-speaking blogosphere

§ Film- und Entertainment-Gesellschaftmovie(-) and entertainment society

§ die Innen- gegenüber der Außenentwicklungthe domestic(-) versus the foreign development

§ weder ein TV- noch ein Radiosenderneither a TV(-) nor a radio station

§ jeder Sport- insbesondere Volleyballbegeisterteeach sports(-), especially volleyball fan

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 12/45

Page 14: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

English examples

§ It has a graph- ing facility for scatterplots (Merge)

§ any child whose self- esteem needs a boost (Concatenate)

§ I called upon my Uranus- Neptune entity (Concatenate)

§ horseriding, and hang- and paragliding (Leavealone)

§ some cases- and interpretation - of classic 1960sD-class movies (actually type IV: Dashify, currently ignored,i. e., treated as Leavealone)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 13/45

Page 15: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

HyDRA – Hyphenation Detection and Repair Application

Common HyDRA component uses frequencies of bigram b

(of the form A- B) and the frequenciesof its dehyphenation transformations from the corpus:

§ frequency of the bigram itself: f pbq

§ frequency of the concatenation of the bigram: f pC pbqq

§ frequency of the merge of the bigram: f pMpbqq

§ currently not even frequencies of the two parts

§ currently raw frequencies, no (smoothed) probabilities

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 14/45

Page 16: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Decision

The HyDRA API offers one function hydra(), which

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 15/45

Page 17: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Decision

The HyDRA API offers one function hydra(), which

returns the most frequent ofb, C pbq, and Mpbq

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 15/45

Page 18: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Decision

The HyDRA API offers one function hydra(), which

returns the most frequent ofb, C pbq, and Mpbq

This was our first “baseline” attempt. . .

and we left it at this.

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 15/45

Page 19: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Evaluation

§ data and frequencies from a DECOW2012 slice (1.5 bn tokens)

§ n=684

§ type accuracy 63.7%

§ token accuracy 99.6%(tokens of the types from the sample in the whole corpus)

§ primary reason for low type accuracy:Many separated nominal compounds with “-”are not concatenated when concatenation is unseen.

§ example: Foto- Frau ñ Foto-Frau

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 16/45

Page 20: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

HyDRA – Hyphenation removal

Solutions considered/chosen

§ use unigram frequencies of parts of the bigram(probably useless)

§ . . . or use a language-specific rule for Germanbased on mixed capitalization of German nouns

§ one simple exception rule – quite effective:If both parts have mixed capitalization, Concatenate!

§ Type-Accuracy 91.8% (+28.1%)

§ Token-Accuracy 99.9% (+0.03%)

§ downside: does not generalize to other languages,rules and evaluations for other languages missing

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 17/45

Page 21: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

We are here. . .

Introduction: Noise in web corpora

HyDRA – Hyphenation removal

Spellingbee – Spelling correction

Non-destructive normalization

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 18/45

Page 22: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Correct form (frequency) Misspelled form N edit distance

ubernimmt 81 1überninmmt 6 1übernimmnt 1 1überniemt 6 1öbernimmt 6 1übernimnmt 5 1übernimmet 11 1

übernimmt (297440) überniehmt 2 2überniemt 6 1übernihmt 17 1überniiiiiiiimmmmmmmt 1 12überniommt 4 1übernimrnt 4 2überniummt 2 1überninnt 2 2übernimmtt 2 1. . .

Some spelling variants of German übernimmt (‘takes over’) fromDECOW2012.

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 19/45

Page 23: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Automatic spelling correction

1. Identify potential misspellings

2. Select candidate for replacement

2.1 Produce candidates

2.2 Rank candidates

3. Decision: replace “misspelling” with best candidate?

Order also reflects the complexity of these steps .

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 20/45

Page 24: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Identifying potential misspellings

§ using Enchant library

§ consult dictionaries:§ Aspell enGB§ Aspell enUS§ additional custom word list

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 21/45

Page 25: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Ranking candidates

Enchant/Aspell returns a ranked set of candidates.

Ñ re-rank candidates using context information

Language models:

§ from “clean” part of UKCOW2012 corpus (tokens)

§ unigrams (4.2 m types)

§ bigrams (74 m types)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 22/45

Page 26: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Candidate re-ranking

. . . be necessarily brief but efven that being so does not excuse . . .

even

elevenEvanoven. . .

Conditional probabilites:

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 23/45

Page 27: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Candidate re-ranking

. . . be necessarily brief but efven that being so does not excuse . . .

even

elevenEvanoven. . .

Conditional probabilites:

P(candidate|preceding)P(following|candidate)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 23/45

Page 28: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Candidate re-ranking

. . . be necessarily brief but efven that being so does not excuse . . .

even

elevenEvanoven. . .

Conditional probabilites:

P(candidate|preceding)P(following|candidate)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 23/45

Page 29: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Ranking: Evaluation

Best predictor: product of conditional probabilities

§ Test set: 1007 potential mispellings, 85 real misspellings(8.4%)

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 24/45

Page 30: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Ranking: Evaluation

Best predictor: product of conditional probabilities

§ Test set: 1007 potential mispellings, 85 real misspellings(8.4%)

§ Baseline: Aspell’s best candidate: 82.4% correct

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 24/45

Page 31: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Ranking: Evaluation

Best predictor: product of conditional probabilities

§ Test set: 1007 potential mispellings, 85 real misspellings(8.4%)

§ Baseline: Aspell’s best candidate: 82.4% correct

§ Our re-ranking: 88.2% correct

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 24/45

Page 32: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Ranking: Evaluation

Best predictor: product of conditional probabilities

§ Test set: 1007 potential mispellings, 85 real misspellings(8.4%)

§ Baseline: Aspell’s best candidate: 82.4% correct

§ Our re-ranking: 88.2% correct

But: replacement is a rare event.

§ real misspellings of capitalized words: 1.82%

§ do not flag capitalized mid-sentence tokens as misspellings

§ real misspellings among remaing tokens: 32.17%

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 24/45

Page 33: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Replace or leave as is?

§ extract more information about candidate and misspelling

§ use balanced training set (50% replace, 50% leave-alone)

§ model the decision with a logistic regression

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 25/45

Page 34: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Logistic regression: predictors

Model trained on 310 instances.

§ Aspell’s original rank

§ edit distance

§ number of alternative candidates

§ capitalization of misspelling

§ frequency of candidate in document

§ frequency of misspelling in document

§ bigrams: conditional probabilities misspelling

§ bigrams: conditional probabilities candidate

§ product of bigram conditional probabilities

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 26/45

Page 35: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Logistic regression: predictors

Model trained on 310 instances.

§ Aspell’s original rank

§ edit distance

§ number of alternative candidates

§ capitalization of misspelling

§ frequency of candidate in document

§ frequency of misspelling in document

§ bigrams: conditional probabilities misspelling

§ bigrams: conditional probabilities candidate

§ product of bigram conditional probabilities

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 26/45

Page 36: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Model evaluation

On training data:

% corr. .81precision .75recall .92F1 .83

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 27/45

Page 37: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Model evaluation

On training data:

% corr. .81precision .75recall .92F1 .83

On unseen test data (unbalanced: 30% replace)

% corr. .63pecision .45recall .77F1 .56

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 27/45

Page 38: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Summary spelling correction

§ context information helps chosing the right candidate

§ toughest problem with web data: replace vs. leave-alone

§ so far, the model does not generalize well to unseen data

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 28/45

Page 39: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Spellingbee – Spelling correction

Summary spelling correction

§ context information helps chosing the right candidate

§ toughest problem with web data: replace vs. leave-alone

§ so far, the model does not generalize well to unseen data

Next steps:

§ restrict domain to more predictable cases

§ incorporate noisy channel model for frequent misspellings

§ use POS information

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 28/45

Page 40: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

We are here. . .

Introduction: Noise in web corpora

HyDRA – Hyphenation removal

Spellingbee – Spelling correction

Non-destructive normalization

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 29/45

Page 41: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

Reasons for non-destructive normalization

§ our goal: carefully sampled and processed web corporafor fundamental research – theoretical linguistics,linguistic web characterization

§ noise or distortion through processing intolerable

§ Leave major destructive design decisions to the user!

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 30/45

Page 42: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

Web data specific research

§ findings in Schäfer and Sayatz (submitted):non-standard cliticized forms of the German indefinite articleare frequent in web data, totally absent elsewhere

§ ein ñ n, einem ñ nem, etc.

§ longstanding morpho-syntacticand graphemic hypotheses made testable

§ In such cases, aggressive destructive normalizationremoves features from the corpus which make it unique!

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 31/45

Page 43: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

Non-destructive normalization in COW2013Spelling correction can be represented as annotation layer.(Dehyphenation cannot, at least not efficiently.)

Word POS Lemma Corr.Word Corr.POS Corr.Lemma

. . . . . .the DT the the DT theplayers NNS player players NNS playerplay VBP play play VBP play. SENT . . SENT .The DT the The DT theFA NP FA FA NP FAdoes VBZ do does VBZ doabosolutley JJ ăunknowną absolutely RB absolutelynothing NN nothing nothing NN nothingto TO to to TO tohelp VB help help VB helpClubs NNS club Clubs NNS club. . . . . .

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 32/45

Page 44: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

Summary

§ Web corpora are noisy, but the noise is valuable data,and the valuable data is noise.

§ High-quality dehyphenation is surprisingly simple for Germanand need not/cannot be executed non-destructively.

§ Spelling correction is (still) as difficult as we knew it was.

§ Huge non-destructively normalized web corpora are possibleand, in fact, available (soon).

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 33/45

Page 45: Token-level noise in large web corpora and non-destructive ...rolandschaefer.net/wp-content/uploads/2013/08/... · Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar

Token-level noise and non-destructive normalization

Non-destructive normalization

References I

E. Giesbrecht and S. Evert. Part-of-speech (POS) tagging – a solved task? an evaluation of POS taggersfor the German Web as Corpus. In I. Alegria, I. Leturia, and S. Sharoff, editors, Proceedings of theFifth Web as Corpus Workshop (WAC5), pages 27–35, San Sebastián, 2009. Elhuyar Fundazioa.

G. Grefenstette and P. Tapanainen. What is a word? What is a sentence? In Proceedings of 3rdConference on Computational Lexicography and Text Research, 1994.

V. Liu and J. R. Curran. Web text corpus for natural language processing. In 11th Conference of theEuropean Chapter of the Association for Computational Linguistics: EACL 2006, pages 233–240,2006.

R. Schäfer and F. Bildhauer. Building large corpora from the web using a new efficient tool chain. InN. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, andS. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources andEvaluation (LREC’12), pages 486–493, Istanbul, 2012. ELRA.

R. Schäfer and F. Bildhauer. Web Corpus Construction. Synthesis Lectures on Human LanguageTechnologies. Morgan and Claypool, San Francisco, 2013.

R. Schäfer and U. Sayatz. Die Kurzformen des Indefinitartikels im Deutschen, submitted.

J. P. Zamorano, E. del Rosal García, and I. A. Lara. Design and development of Iberia: a corpus ofscientific Spanish. Corpora, 6:145–158, 2011.

Felix Bildhauer, Roland Schäfer 2014, SFB 632/A2, German Grammar and Linguistics (FU Berlin) 34/45


Recommended