1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 13, 2004.

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 13, 2004

2

Today

Finish hand-built rule systemsMachine Learning approaches to information extraction

Sliding WindowsRule-learners (older)Feature-base ML (more recent)

IE tools

3Adapted from slides by Cunningham & Bontcheva

Two kinds of NE approaches

Knowledge Engineering

rule based developed by experienced language engineers make use of human intuition requires only small amount of training datadevelopment could be very time consuming some changes may be hard to accommodate

Learning Systems

use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpusannotators are cheap (but you get what you pay for!)


Baseline: list lookup approach

System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists)Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity


Creating Gazetteer Lists

Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02])Locations lists

US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million names (e.g., [Manov03])UN site: http://unstats.un.org/unsd/citydataGlobal Discovery database from Europa technologies Ltd, UK (e.g., [Ignat03])

Automatic collection from annotated training data


Rule-based Example: FACILE FACILE - used in MUC-7 [Black et al 98] Uses Inxight’s LinguistiX tools for tagging and morphological analysis Database for external information, role similar to a gazetteerLinguistic info per token, encoded as feature vector:

Text offsets Orthographic pattern (first/all capitals, mixed, lowercase)Token and its normalised formSyntax – category and featuresSemantics – from database or morphological analysisMorphological analyses

Example:(1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F)(("Mrs." "Title" "Abbr")) NIL)PER_CIV_F – female civilian (from database)


FACILE

Context-sensitive rules written in special rule notation, executed by an interpreter

Writing rules in PERL is too error-prone and hard

Rules of the kind: A => B\C/D, where:

A is a set of attribute-value expressions and optional score, the attributes refer to elements of the input token feature vectorB and D are left and right context respectively and can be emptyB, C, D are sequences of attribute-value pairs and Kleene regular expression operations; variables are also supported

[syn=NP, sem=ORG] (0.9) =>\ [norm="university"],[token="of"],[sem=REGION|COUNTRY|CITY] / ;


FACILE

# Rule for the mark up of person names when the first name is not # present or known from the gazetteers: e.g 'Mr J. Cass',

[SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS

=> [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]\[SYN=NAME, ORTH=I|O, TOKEN=_I]?, [ORTH=C|A, SYN=PROP, TOKEN=_F]?, [SYN=NAME, ORTH=I|O, TOKEN=_I]?, [SYN=NAME, TOKEN=_M]?, [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE] #proper name, not recognised by a rule/;


FACILE

Preference mechanism:The rule with the highest score is preferredLonger matches are preferred to shorter matchesResults are always one semantic categorisation of the named entity in the text

Evaluation (MUC-7 scores):Organization: 86% precision, 66% recallPerson: 90% precision, 88% recallLocation: 81% precision, 80% recall Dates: 93% precision, 86% recall

10Slide adapted from William Cohen's

Extraction by Sliding Window


Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation























A Naïve Bayes Sliding Window Model[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

Other examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)


Naïve Bayes Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING




Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%


SRV: a realistic sliding-window-classifier IE system

What windows to consider?all windows containing as many tokens as the shortest example, but no more tokens than the longest example

How to represent a classifier? It might:Restrict the length of window;Restrict the vocabulary or formatting used before/after/inside window;Restrict the relative order of tokens;Use inductive logic programming techniques to express all these…

<title>Course Information for CS213</title>

<h1>CS 213 C++ Programming</h1>

[Frietag AAAI ‘98]


SRV: a rule-learner for sliding-window classification

Primitive predicates used by SRV:token(X,W), allLowerCase(W), numerical(W), …nextToken(W,U), previousToken(W,V)

HTML-specific predicates:inTitleTag(W), inH1Tag(W), inEmTag(W),…emphasized(W) = “inEmTag(W) or inBTag(W) or …”tableNextCol(W,U) = “U is some token in the column after the column W is in”tablePreviousCol(W,V), tableRowHeader(W,T),…

19

Automatic Pattern-Learning Systems

Trainer

Decoder

Model

LanguageInput

Answers

AnswersLanguageInput

20

Automatic Pattern-Learning Systems

Pros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically finds appropriate statistical patternsSystem knowledge not needed by those who supply the domain knowledge.

Cons:Annotated training data, and lots of it, is needed.Isn’t necessarily better or cheaper than hand-built sol’n

Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas):

learn lexico-syntactic patterns from templates

Trainer

Decoder

Model

LanguageInput

Answers

AnswersLanguageInput

21Slide adapted from Chris Manning's

Rapier [Califf & Mooney, AAAI-99]

Rapier learns three regex-style patterns for each slot:Pre-filler pattern Filler pattern Post-filler pattern


Features for IE Learning Systems

Part of speech: syntactic role of a specific wordSemantic Classes: Synonyms or other related words

“Price” class: price, cost, amount, …“Month” class: January, February, March, …, December“US State” class: Alaska, Alabama, …, Washington, WyomingWordNet: large on-line thesaurus containing (among other things) semantic classes


Rapier rule matching example

“…sold to the bank for an undisclosed amount…”POS: vb pr det nn pr det jj nnSClass: price

“…paid Honeywell an undisclosed price…”POS: vb nnp det jj nnSClass: price

Pre-filler Filler Post-Filler1) tag: {nn,nnp} 1) word: “undisclosed” 1) sem: price2) list: length 2 tag: jj


Rapier Rules: DetailsRapier rule :=

pre-filler patternfiller patternpost-filler pattern

pattern := subpattern +subpattern := constraint +constraint :=

Word - exact word that must be presentTag - matched word must have given POS tagClass - semantic class of matched wordCan specify disjunction with “{…}” List length N - between 0 and N words satisfying other constraints


Rapier’s Learning Algorithm

Input: set of training examples (list of documents annotated with “extract this substring”)Output: set of rules

Init: Rules = a rule that exactly matches each training exampleRepeat several times:

Seed: Select M examples randomly and generate the Kmost-accurate maximally-general filler-only rules(prefiller = postfiller = match anything)Grow:Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller contextKeep:Rules = Rules the best of the K rules – subsumed rules


Learning example (one iteration)2 examples:

‘… located in Atlanta, Georgia…”‘… offices in Kansas City, Missouri…’

appropriately general rule (high precision, high recall)

maximally general rules(low precision, high recall)

Seed

Grow

Init

maximally specific rules(high precision, low recall)


Rapier results:Precision vs. # Training Examples


Rapier: results:Recall vs. # Training Examples


Summary: Rule-learning approaches to sliding-window classification

SRV, Rapier, and WHISK [Soderland KDD ‘97]

Representations for classifiers allow restriction of the relationships between tokens, etcRepresentations are carefully chosen subsets of even more powerful representations Use of these “heavyweight” representations is complicated, but seems to pay off in results

30

Successors to MUCCoNNL: Conference on Computational Natural Language Learning

Different topics each year2002, 2003: Language-independent NER2004: Semantic Role recognition2001: Identify clauses in text2000: Chunking boundaries

– http://cnts.uia.ac.be/conll2003/ (also conll2004, conll2002…)– Sponsored by SIGNLL, the Special Interest Group on Natural Language

Learning of the Association for Computational Linguistics.

ACE: Automated Content ExtractionEntity Detection and Tracking

– Sponsored by NIST– http://wave.ldc.upenn.edu/Projects/ACE/

Several others recently See http://cnts.uia.ac.be/conll2003/ner/

31

CoNNL-2003Goal: identify boundaries and types of named entities

People, Organizations, Locations, Misc.

Experiment with incorporating external resources (Gazeteers) and unlabeled data

Data:Using IOB notation4 pieces of info for each term

Word POS Chunk EntityType

32

Details on Training/Test Sets

Reuters Newswire + European Corpus Initiative

Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

33

Summary of Results

16 systems participatedMachine Learning Techniques

Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4)Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, AdaBoost, and memory-based learningCombining techniques often worked well

FeaturesChoice of features is at least as important as ML methodTop-scoring systems used many typesNo one feature stands out as essential (other than words)


34 Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003

35

Use of External InformationImprovement from using Gazeteers vs. unlabeled data nearly equal Gazeteers less useful for German than English (higher quality)


36

Precision, Recall, and F-Scores

* ***

*

* Not significantly different


37

Combining Results

What happens if we combine the results of all of the systems?

Used a majority-vote of 5 systems for each setEnglish:

F = 90.30 (14% error reduction of best system)

German:F = 74.17 (6% error reduction of best system)

Top four systems in more detail …

38

Zhang and Johnson

Experimented with the effects of different featuresUsed a learning method they developed called Robust Risk Minimization

Related to the Winnow methodUsed it to predict the class label ti associated with each token wi

Estimate P(ti = c| xi) for every possible class label c where xi is a feature vector associated with token ixi can including information about previous tags

Found that the relatively simple, language independent features get you much of the way

39

Zhang and Johnson

Simple features include:The tokens themselves, in window of +/- 2The previous 2 predicted tagsThe conjunction of the previous tag and the current tokenInitial capitalization of tokens, in window of +/- 2

More elaborate features include:Word “shape” information: initial caps, all caps, all digits, digits containing punctuationToken prefix (len 3-4) and suffix (len 1-4)POS Chunking info (chunk bag-of-words at current token)Marked up entities from training dataOther dictionaries

40

Languageindependent

41

Florian, Ittycheria, Jing, Zhang

Combined four machine learning algorithmsThe best-performing was the Zhang & Johnson RRM Voting algorithm

– Giving them all equal-weight votes worked well– So did using the RRM algorithm to choose among them

English F-measure went from 89.94 to 91.63

Did well with the supplied features; did even better with some complex additional features:

The output of 2 other NER systems– Trained on 1.7M annotated words in 32 categories

A list of gazetteersImproved English F-measure to 93.9

– (21% error reduction)

42

Effects of Unknown Words

Florian et al. note that German is harderHas more unknown wordsAll nouns are capitalized

43

Klein, Smarr, Nguyen, Manning

Standard approach for unknown words is to extract features like suffixes, prefixes, and capitalizationIdea: use all-character n-grams, rather than words, as the primary representation

Integrates unknown words seamlessly into the modelImproved results of their classifier by 25%

44

Balancing n-grams with Other Evidence

Example: “morning at Grace Road”Need the classifiers to determine “Grace” is part of a location rather than a PersonUsed Conditional Markov Model (aka Maximum Entropy Model)Also, added other “shape” information

“20-month” -> d-x“Italy” -> Xx

45

46

47

Chieu and Ng

Used a Maximum Entropy approachEstimates probabilities based on the principle of making as few assumptions as possibleBut allows specification of constraints between featurs and outcome (derived from training data)

Used a rich feature set, like those already discussedInteresting additional features:

Lists derived from training set“Global” features: look at how the words appeared elsewhere within the document

Doesn’t say which of these features do well

48

Lists Derived from Training Data

UNI: (useful unigrams)Top 20 words that precede instances of that class Computed using a correlation metric

UBI (useful bigrams): pairs of preceding wordsCITY OF, ARRIVES INThe bigram have higher probability of preceding the class than the unigram

– CITY OF better evidence than just OF

NCS: Useful Name Class SuffixesTokens that frequenty terminate a class

– INC, COMMITTEE

49

Using Other Occurrences within the Document

Zone:Where is the token from? (headline, author, body)

UnigramsIf UNI holds for an occurrence of w elsewhere

BigramsIf UBI holds for an occurrence of w elsewhere

SuffixIf NCS holds of an occurrence of w elsewhere

InitCapsA way to check if a word is capitalized due to its position in the sentence or not. Also, check the first work in sequence of capitalized words.

– Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement.

50

MUC Redux

Task: fill slots of templatesMUC-4 (1992)

All systems hand-engineeredOne MUC-6 entry used learning; failed miserably

51

52

MUC Redux

Fast forward 12 years … now use ML!Chieu et. al. show a machine learning approach that can do as well as most of the hand-engineered MUC-4 systems

Uses state-of-the-art:– Sentence segmenter– POS tagger– NER– Statistical Parser– Co-reference resolution

Features look at syntactic context– Use subject-verb-object information– Use head-words of NPs

Train classifiers for each slot type

Chieu, Hai Leong, Ng, Hwee Tou, & Lee, Yoong Keok (2003). Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods, In (ACL-03).

53

Best systems took 10.5 person-months of hand-coding!

54

IE Techniques: Summary

Machine learning approaches are doing well, even without comprehensive word lists

Can develop a pretty good starting list with a bit of web page scraping

Features mainly have to do with the preceding and following tags, as well as syntax and word “shape”

The latter is somewhat language dependent

With enough training data, results are getting pretty decent on well-defined entitiesML is the way of the future!

55

IE Tools

Research toolsGate

– http://gate.ac.uk/

MinorThird– http://minorthird.sourceforge.net/

Alembic (only NE tagging)– http://www.mitre.org/tech/alembic-workbench/

Commercial?? I don’t know which ones work well


NE Annotation Tools - GATE


NE Annotation Tools - Alembic


NE Annotation Tools – Alembic (2)


GATE

GATE – University of Sheffield’s open-source infrastructure for language processing

Automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugginghas a finite-state pattern-action rule languageHas an example rule-based system called ANNIEANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 corpus


NE ComponentsThe ANNIE system – a reusable and easily extendable set of components


Gate’s Named Entity Grammars

Phases run sequentially and constitute a cascade of FSTs over the pre-processing resultsHand-coded rules applied to annotations to identify NEs Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules Use of contextual information Finds person names, locations, organisations, dates, addresses.


Nam

ed E

ntiti

es in

GA

TE


Named Entity Coreference

Date post:	21-Dec-2015
Category:	Documents
View:	218 times
Download:	4 times

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 13, 2004.

Documents