+ All Categories
Home > Documents > Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala...

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala...

Date post: 25-Feb-2016
Category:
Upload: varden
View: 72 times
Download: 5 times
Share this document with a friend
Description:
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology. Outline. Different worlds? Corpus-based computational linguistics Computational corpus linguistics Similarities and differences - PowerPoint PPT Presentation
Popular Tags:
28
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology
Transcript
Page 1: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus-based computational linguistics or computational corpus linguistics?

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

Page 2: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Outline

• Different worlds?– Corpus-based computational linguistics– Computational corpus linguistics– Similarities and differences– Opportunities for collaboration

• Computational linguistics – an example– Dependency-based syntactic analysis– Machine learning

Page 3: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Different worlds?

Page 4: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpora and computers

• The empirical revolution in (computational) linguistics:– Increased use of empirical data– Development of large corpora– Annotation of corpus data (syntactic, semantic)

• Underlying causes:– Technical development:

• Availability of machine-readable text (and digitized speech)• Computational capacity:

– Storage– Processing

– Scientific shift:• Criticism of armchair linguistics• Development of statistical language models

Page 5: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Computational corpus linguistics

• Goal:– Knowledge of language

• Descriptive studies• Theoretical hypothesis testing

• Means:– Corpus data as a source of knowledge of language

• Descriptive statistics• Statistical inference for hypothesis testing

– Computer programs for processing corpus data• Corpus development and annotation• Search and visualization (for humans)• Statistical analysis (descriptive and inferential)

Page 6: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus-based computational linguistics

• Goal:– Computer programs that process natural language

• Practical applications (translation, summarization, …)• Models of language learning and use

• Means:– Corpus data as a source of knowledge of language:

• Statistical inference for model parameters (estimation)– Computer programs for processing corpus data

• Corpus development and annotation• Search and information extraction (for computers)• Statistical analysis (estimation/machine learning)

Page 7: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus processing 1

• Corpus development:– Tokenization (minimal units, words, etc.)– Segmentation (on several levels)– Normalization (e.g., abbreviations, orthography, multi-word units;

graphical elements, metadata, etc.)

• Annotation:– Part-of-speech tagging (word word class)– Lemmatization (word base form/lemma)– Syntactic analysis (sentence syntactic representation)– Semantic analysis (word sense, sentence proposition)

• Standard methodology:– Automatic analysis (often based on other corpus data)– Manual validation (and correction)

Page 8: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus processing 2

• Searching and sorting:– Search methods:

• String matching• Regular expressions• Dedicated query languages• Special-purpose programs

– Results: • Concordances• Frequency lists

• Visualization:– Textual:

• Concordances, etc.– Graphical:

• Diagram, syntax trees, etc.

Page 9: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus processing 3

• Statistical analysis:– Descriptive statistics

• Frequency tables and diagrams

– Statistical inference• Hypothesis testing (t-test, 2, Mann-Whitney, etc.)• Machine learning:

– Probabilistic: Estimate probability distributions– Discriminative: Approximate mapping input-output– Induction of lexical and grammatical resources

(e.g. collocations, valency frames)

Page 10: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

User Requirements

• Corpus linguists– Software

• Accessible• Easy to use• General

– Output• Suitable for humans• Perspicuous (graphical

visualization)– Functions

• Specific search• Descriptive statistics

• Computational linguists– Software

• Efficient• Modifiable• Specific

– Output• Suitable for computers• Well-defined format

(annotated text)– Functions

• Exhaustive search• Statistical learning

Page 11: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Summary

• Different goals:– Study language– Create computer programs

• … give (partly) different requirements:– Accessible and usable (for humans)– Efficient and standardized (for computers)

• … but (partly) the same needs:– Corpus development and annotation– Searching, sorting, and statistical analysis

Page 12: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Symbiosis?

• What can computational linguists do for corpus linguists?– Technical and general linguistic competence– Software for automatic analysis (annotation)

• What can corpus linguists do for computational linguists?– Linguistic and language specific competence– Manual validation of automatic analysis

• What can they achieve together?– Automatic annotation improves precision in corpus linguistics– Manual validation improves precision computational linguistics– A virtuous circle?

Page 13: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Computational linguistics – an example

Page 14: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency analysis

0 1 2 3 4 5 6 7 8 9

Economic news had little effect on financial markets .

JJ NN VBD JJ NN IN JJ NNS .

ROOT

NMOD SBJ NMOD NMOD

OBJ PMOD

NMOD

P

Page 15: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Inductive dependency parsing

• Deterministic syntactic analysis (parsing):– Algorithm for deriving dependency structures– Requires decision function in choice situations– All decisions are final (deterministic)

• Inductive machine learning:– Decision function based on previous experience– Generalize from examples (successive refinement)– Examples = Annotated sentences (treebank)– No grammar – just analogy

Page 16: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Algorithm

• Data structures:– Queue of unanalyzed words (next = first in queue)– Stack of partially analyzed words (top = on top of stack)

• Start state:– Empty stack– All words in queue

• Algorithm steps:– Shift: Put next on top of stack (push)– Reduce: Remove top from stack (pop)– Right: Put next on top of stack (push); link top next– Left: Remove top from stack (pop); link next top

Page 17: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

1 2 3 4 5 6 7 8 9

Economic news had little effect on financial markets .

JJ NN VBD JJ NN IN JJ NNS .

REDUCELA(NMOD)SHIFTLA(SBJ)SHIFTSHIFTLA(NMOD)RA(OBJ)RA(NMOD)SHIFTLA(NMOD)RA(PMOD)REDUCEREDUCESHIFTRA(P)

NMOD SBJ NMOD

OBJ

NMOD NMOD

PMOD

Algorithm example

ROOT

0

P

Page 18: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Decision function

• Non-determinism:

• Decision function: (Queue, Stack, Graph) Step• Possible approaches:

– Grammar?– Inductive generalization!

eats pizza with ……

OBJ RA(ATT)? RE?

Page 19: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Machine learning

• Decision function: – (Queue, Stack, Graph) Step

• Model:– (Queue, Stack, Graph) (f1, …, fn)

• Classifier:– (f1, …, fn) Step

• Learning:– { ((f1, …, fn), Step) } Classifier

Page 20: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Model

• Parts of speech: t1, top, next, n1, n2, n3

• Dependency types: t.hd, t.ld, t.rd, n.ld

• Word forms: top, next, top.hd, n1

hdld rd ld

.th next.top . n1…… …… n2 n3t1

Stack Queue

Page 21: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Memory-based learning

• Memory-based learning and classification:– Learning is storing experiences in memory.– Problem solving is achieved by reusing solutions of

similar problems experienced in the past.

• TIMBL (Tilburg Memory-Based Learner):– Basic method: k-nearest neighbor – Parameters:

• Number of neighbors (k)• Distance metrics• Weighting av attributes, values and instances

Page 22: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Learning example

• Instance base:1. (a, b, a, c) A2. (a, b, c, a) B3. (b, a, c, c) C4. (c, a, b, c) A

1. New instance:1. (a, b, b, a)

• Distances:1. D(1, 5) = 22. D(2, 5) = 13. D(3, 5) = 44. D(4, 5) = 3

• k-NN:1. 1-NN(5) = B2. 2-NN(5) = A/B3. 3-NN(5) = A

Page 23: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Experimental evaluation

• Inductive dependency analysis:– Deterministic algorithm– Memory-based decision function

• Data:– English:

• Penn Treebank, WSJ (1M words)• Converted to dependency structure

– Swedish:• Talbanken, Professional prose (100k words)• Dependency structure based on MAMBA annotation

Page 24: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Results

• English:– 87.3% of all words got the correct head– 85.6% of all words got the correct head and label

• Svenska:– 85.9% of all words got the correct head– 81.6% of all words got the correct head and label

Page 25: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency types: English

• High precision (86% F):VC (auxiliary verb main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb subject) 89.3%PMOD (complement of preposition) 88.6%SBAR (complementizer verb) 86.1%

• Medium precision (73% F 83%):ROOT 82.4%OBJ (verb object) 81.1% VMOD (adverbial)76.8%AMOD (adj/adv modifier) 76.7%PRD (predicative complement) 73.8%

• Low precision (F 70%):DEP (other)

Page 26: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency types: Swedish

• High precision (84% F):IM (infinitive marker infinitive) 98.5%PR (preposition noun) 90.6%UK (complementizer verb) 86.4%VC (auxiliary verb main verb) 86.1%DET (noun determiner) 89.5%ROOT 87.8%SUB (verb subject) 84.5%

• Medium precision (76% F 80%):ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb object) 77.7%PRD (verb predicative) 76.8%ADV (adverbial) 76.3%

• Low precision (F 70%):INF, APP, XX, ID

Page 27: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Corpus annotation

• How good is 85%?– Good enough to save time for manual annotators– Good enough to improve search precision– Recent release: SUC with syntactic annotation

• How can accuracy be improved further?– By annotation of more data, which facilitates

machine learning– By refined linguistic analysis of the structures to be

annotated and the errors performed

Page 28: Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

MaltParser

• Software for inductive dependency parsing:– Freely available (open source)

• http//maltparser.org

– Evaluated on close to 30 different languages– Used for annotating corpora at Uppsala University


Recommended