+ All Categories
Home > Documents > Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs....

Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs....

Date post: 30-Jan-2018
Category:
Upload: vodan
View: 219 times
Download: 0 times
Share this document with a friend
26
1 Word-Classes and Part-of-Speech Tagging Christopher Brewster University of Sheffield Computer Science Department Natural Language Processing Group [email protected]
Transcript
Page 1: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

1

Word-Classes and Part-of-Speech Tagging

Christopher BrewsterUniversity of Sheffield

Computer Science DepartmentNatural Language Processing Group

[email protected]

Page 2: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

2

Lecture Outline

• Definition and Example• Motivation• Word-classes• A Basic Tagging System • Transformation-Based Tagging• Tagging Unknown Words

Page 3: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

3

Definition

“the process of assigning a part-of-speech or other lexical class marker to each word in a corpus” – D. Jurafsky and J.H. Martin, 2000, Speech and Language Processing

WORDS

TAGS

the girl kissed the boy on the cheek

N V P ART …

Page 4: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

4

An Example

lemma tagThe the +DET girl girl +NOUN kissed kiss +VPAST the the +DET boy boy +NOUN on on +PREP the the +DET cheek cheek +NOUN

from http://www.xrce.xerox.com/research/mltt/toolhome.html

Page 5: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

5

Motivation:the uses of Tagging

• Speech synthesis – pronunciation• Speech recognition – class-based N-grams• Information retrieval – stemming• Word-sense disambiguation• Corpus analysis of language & lexicography

Page 6: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

6

Word Classes

• Basic words classes: Noun, Verb, Adjective, Adverb, Preposition, …..

• Open vs. Closed classes.– Closed e.g determiners: a, an, thepronouns: she, he, I, othersprepositions: on, under, over, near, by, at, from,

to, with

Page 7: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

7

Word Classes: Tag sets

• Vary in number of tags: a dozen to over 200• Size of tag sets depends on language,

objectives and purpose– Simple morphology = more ambiguity = fewer

tags– Some tagging approaches (e.g. constraint

grammar based) make fewer distinctions eg. conflating adverbs, particles and interjections

Page 8: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

8

Word Classes: Tag set example

llamasnoun, plural

NNSmea culpa

foreign wordFW

llamanoun singular or mass

NNthereexistential ‘there’

EX

biggeradj. compar.

JJRa, thedeterminerDT

yellowadjectiveJJone, two, three

cardinal number

CD

of, in, bypreposition

INand, but, or

coordin. conjunction

CC

from the Penn treebank part-of-speech tag set.

Page 9: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

9

The Problem

• Words often have more than one word class: this– This is a nice day = PR– This day is nice = ADJ– You can go this far. = ADV

Page 10: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

10

Word Class Ambiguity(in the Brown Corpus)

Unambiguous (1 tag) 35, 340Ambiguous (2-7 tags) 4,100

2 tags 3,7603 tags 2644 tags 615 tags 126 tags 27 tags 1 (still)from DeRose (1988)

Page 11: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

11

A Basic System: the PARTS program

• “PARTS – A System for Assigning Word Classes to English Texts”, L.L.Cherry

• Uses list of function words, and list of suffixes and auxiliaries as key sources of information

• many combination classes e.g. noun_adj• words members of >2 classes initially

assigned unk

Page 12: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

12

The PARTS program: input• List of function words and irregular verbs with tags:able,adj will, aux or, conj outside, prepevery, adj do, auxv but, conj up, prepown, adj be, be begun, ed over, prepago, adj_adv and, conj bitten, ed until, prep_adv

• List of suffixes with most probable tag for words of that suffix.

ic, adj ship, noun age, noun ment, nounance, noun ant, noun_adj ize, verb ary, adj

• suffixes chosen by hand• if most words with suffix have only 1 or 2 tags, this single or

combined class assigned, exceptions added to exception list• exception list has many obscure words

• A text

Page 13: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

13

The PARTS program: step 1 pre-processing

1. tokenises words and sentences• word = string of characters separated by blanks or

punctuation• sentence = string of words ending in .?! (other punctuation is

treated as a comma

2. marks capitalised words not starting sentences as noun_adj

3. marks hyphenated words as noun_adj4. lookup function words & irregular verbs in the list

Page 14: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

14

The PARTS program:step 2 suffix analysis

1. applies to words NOT assigned tags in step 12. look up suffix list3. unassigned words go on to step 3

Page 15: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

15

The PARTS program:step 3 word class assignment

1. finds verb in the sentence (using auxiliary)2. finds nouns3. applies a set of rules of form:

verb_adj & ~a => verb“if the word has been assigned the class verb_adj and the verb has not been recognised in the sentence, assign verb to it”

Page 16: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

16

The PARTS program:results and example

• 95% correct assignment• 41.5% of errors arise from noun-adjective

confusion• Example: They act as messengers for the legislators.pronp unk prep_adv nv_pl prep_adv art nv_plpron verb prep noun prep art noun

Page 17: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

17

Other methods: Stochastic Tagging

• Not based on rules, but on probability of a certain tag occurring given …. various possibilities.

• Necessitates a TRAINING CORPUS i.e. a hand tagged text in order to derive probabilities.

• Problem: no probabilities for words not in corpus• Problem: Bad results if training corpus is very

different from test corpus

Page 18: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

18

Stochastic tagging

• Method: Choose most frequent tag in training text for each word.– Result: 90% accuracy– Reason: cf. figures on word class ambiguity

where 90% of words have only one tag– Therefore: this is a base line, and any other

method must do significantly better– cf. HMM tagging (lecture of Nick Webb)

Page 19: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

19Transformation-Based Learning Tagging (Brill Tagging)

• Combination of rule-based AND stochastic tagging methodologies– Like rule-based because rules are used to specify tags

in a certain environment– Like stochastic approach because machine learning is

used using a tagged corpus as input

• Input: – a tagged corpus– a dictionary (with the most frequent tags)

Page 20: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

20

TBL: Rule Application• Example rules:

– Change NN to VB when previous tag is TO– For example: race has the following probabilities in

the Brown corpus:• P(NN|race) = .98• P(VB|race) = .02

… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Page 21: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

21

TBL: Rule Learning• 2 parts to a rule:

– Triggering environment– Rewrite rule

• The range of Triggering environments or templates(from Manning & Schutze 1999:363):

Schema t1-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 *

Page 22: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

22

TBL: Rule Learning (2)• Templates are like under specified rules:

– Replace tag X with tag Y, provided tag Z or word Z’ appears in some position

• Rules are learned in ordered sequence – whichever gives best net improvement at each iteration of the learning algorithm.

• Rules may interact i.e. Rule 1 may make a change which provides context for Rule 2 to fire.

• Rules are compact (a few hundred) and can be inspected by humans (vs. impossibility of inspecting HMM transition probabilities)

Page 23: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

23

TBL: the Algorithm• Step 1: Label every word with most likely tag

(from dictionary)• Step 2: Check every possible transformation &

select one which most improves tagging (with respect to hand tagged corpus)

• Step 3: Re-tag corpus applying the rules• Repeat 2-3 until some stopping criterion is

reached e.g. x % correct with respect to training corpus

• RESULT: a sequence of transformation rules

Page 24: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

24

TBL: Problems

• Execution Speed: TBL tagger is slow compared to HMM approach– Solution: compile the rules to a Finite State

Transducer (FST)

• Learning Speed: Brill’s implementation over a day (600k tokens)

Page 25: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

25

Tagging Unknown Words• New words added to (newspaper) language 20+ per

month.• Plus many proper names ….• Increases error rates by 1-2%• Method 1: assume they are nouns• Method 2: assume the unknown words have a

probability distribution similar to hapax legomena• Method 3: use capitalisation, suffixes, etc. This works

very well for morphologically complex languages

Page 26: Word-Classes and Part-of-Speech · PDF fileAdjective, Adverb, Preposition, .. • Open vs. Closed classes. – Closed e.g determiners: a, an, the ... training text for each word. –

26

Further Reading• Introductory:

– Jurafsky, Daniel & James H. Martin, Speech and Language Processing, Prentice Hall: 2000 Chapter 8, pp285-322

– Manning, Christopher & Hinrich Schutze, Foundations of Statistical Natural Language Processing, Chap 10, pp341-380

• Texts:– Brill, Eric Transformation-based error-driven learning and natural language

processing: A case-study in part-of-speech tagging. Computational Linguistics 21:543-565

– Cherry, L. PART: a system for assigning words classes to English text. AT &T memorandum. 1978

– Church, K. A stochastic parts program and noun phrase parser forunrestricted text. Second Conference on Applied NLP, Austin, 1988

– Garside, Roger, Geoffrey Sampson and Geoffrey Leach (eds) The Computational analysis of English: a corpus-based approach. London: 1987

Also check the papers referred to in the Introductory references.


Recommended