Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 4 times |
2
Today
Finish hand-built rule systemsMachine Learning approaches to information extraction
Sliding WindowsRule-learners (older)Feature-base ML (more recent)
IE tools
3Adapted from slides by Cunningham & Bontcheva
Two kinds of NE approaches
Knowledge Engineering
rule based developed by experienced language engineers make use of human intuition requires only small amount of training datadevelopment could be very time consuming some changes may be hard to accommodate
Learning Systems
use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpusannotators are cheap (but you get what you pay for!)
4Adapted from slides by Cunningham & Bontcheva
Baseline: list lookup approach
System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists)Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
5Adapted from slides by Cunningham & Bontcheva
Creating Gazetteer Lists
Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02])Locations lists
US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million names (e.g., [Manov03])UN site: http://unstats.un.org/unsd/citydataGlobal Discovery database from Europa technologies Ltd, UK (e.g., [Ignat03])
Automatic collection from annotated training data
6Adapted from slides by Cunningham & Bontcheva
Rule-based Example: FACILE FACILE - used in MUC-7 [Black et al 98] Uses Inxight’s LinguistiX tools for tagging and morphological analysis Database for external information, role similar to a gazetteerLinguistic info per token, encoded as feature vector:
Text offsets Orthographic pattern (first/all capitals, mixed, lowercase)Token and its normalised formSyntax – category and featuresSemantics – from database or morphological analysisMorphological analyses
Example:(1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F)(("Mrs." "Title" "Abbr")) NIL)PER_CIV_F – female civilian (from database)
7Adapted from slides by Cunningham & Bontcheva
FACILE
Context-sensitive rules written in special rule notation, executed by an interpreter
Writing rules in PERL is too error-prone and hard
Rules of the kind: A => B\C/D, where:
A is a set of attribute-value expressions and optional score, the attributes refer to elements of the input token feature vectorB and D are left and right context respectively and can be emptyB, C, D are sequences of attribute-value pairs and Kleene regular expression operations; variables are also supported
[syn=NP, sem=ORG] (0.9) =>\ [norm="university"],[token="of"],[sem=REGION|COUNTRY|CITY] / ;
8Adapted from slides by Cunningham & Bontcheva
FACILE
# Rule for the mark up of person names when the first name is not # present or known from the gazetteers: e.g 'Mr J. Cass',
[SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS
=> [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]\[SYN=NAME, ORTH=I|O, TOKEN=_I]?, [ORTH=C|A, SYN=PROP, TOKEN=_F]?, [SYN=NAME, ORTH=I|O, TOKEN=_I]?, [SYN=NAME, TOKEN=_M]?, [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE] #proper name, not recognised by a rule/;
9Adapted from slides by Cunningham & Bontcheva
FACILE
Preference mechanism:The rule with the highest score is preferredLonger matches are preferred to shorter matchesResults are always one semantic categorisation of the named entity in the text
Evaluation (MUC-7 scores):Organization: 86% precision, 66% recallPerson: 90% precision, 88% recallLocation: 81% precision, 80% recall Dates: 93% precision, 86% recall
11Slide adapted from William Cohen's
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
12Slide adapted from William Cohen's
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
13Slide adapted from William Cohen's
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
14Slide adapted from William Cohen's
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
15Slide adapted from William Cohen's
A Naïve Bayes Sliding Window Model[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
Other examples of sliding window: [Baluja et al 2000](decision tree over individual words & their context)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
16Slide adapted from William Cohen's
Naïve Bayes Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
17Slide adapted from William Cohen's
SRV: a realistic sliding-window-classifier IE system
What windows to consider?all windows containing as many tokens as the shortest example, but no more tokens than the longest example
How to represent a classifier? It might:Restrict the length of window;Restrict the vocabulary or formatting used before/after/inside window;Restrict the relative order of tokens;Use inductive logic programming techniques to express all these…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
[Frietag AAAI ‘98]
18Slide adapted from William Cohen's
SRV: a rule-learner for sliding-window classification
Primitive predicates used by SRV:token(X,W), allLowerCase(W), numerical(W), …nextToken(W,U), previousToken(W,V)
HTML-specific predicates:inTitleTag(W), inH1Tag(W), inEmTag(W),…emphasized(W) = “inEmTag(W) or inBTag(W) or …”tableNextCol(W,U) = “U is some token in the column after the column W is in”tablePreviousCol(W,V), tableRowHeader(W,T),…
19
Automatic Pattern-Learning Systems
Trainer
Decoder
Model
LanguageInput
Answers
AnswersLanguageInput
20
Automatic Pattern-Learning Systems
Pros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically finds appropriate statistical patternsSystem knowledge not needed by those who supply the domain knowledge.
Cons:Annotated training data, and lots of it, is needed.Isn’t necessarily better or cheaper than hand-built sol’n
Examples: Riloff et al., AutoSlog (UMass); Soderland WHISK (UMass); Mooney et al. Rapier (UTexas):
learn lexico-syntactic patterns from templates
Trainer
Decoder
Model
LanguageInput
Answers
AnswersLanguageInput
21Slide adapted from Chris Manning's
Rapier [Califf & Mooney, AAAI-99]
Rapier learns three regex-style patterns for each slot:Pre-filler pattern Filler pattern Post-filler pattern
22Slide adapted from Chris Manning's
Features for IE Learning Systems
Part of speech: syntactic role of a specific wordSemantic Classes: Synonyms or other related words
“Price” class: price, cost, amount, …“Month” class: January, February, March, …, December“US State” class: Alaska, Alabama, …, Washington, WyomingWordNet: large on-line thesaurus containing (among other things) semantic classes
23Slide adapted from Chris Manning's
Rapier rule matching example
“…sold to the bank for an undisclosed amount…”POS: vb pr det nn pr det jj nnSClass: price
“…paid Honeywell an undisclosed price…”POS: vb nnp det jj nnSClass: price
Pre-filler Filler Post-Filler1) tag: {nn,nnp} 1) word: “undisclosed” 1) sem: price2) list: length 2 tag: jj
24Slide adapted from Chris Manning's
Rapier Rules: DetailsRapier rule :=
pre-filler patternfiller patternpost-filler pattern
pattern := subpattern +subpattern := constraint +constraint :=
Word - exact word that must be presentTag - matched word must have given POS tagClass - semantic class of matched wordCan specify disjunction with “{…}” List length N - between 0 and N words satisfying other constraints
25Slide adapted from Chris Manning's
Rapier’s Learning Algorithm
Input: set of training examples (list of documents annotated with “extract this substring”)Output: set of rules
Init: Rules = a rule that exactly matches each training exampleRepeat several times:
Seed: Select M examples randomly and generate the Kmost-accurate maximally-general filler-only rules(prefiller = postfiller = match anything)Grow:Repeat For N = 1, 2, 3, … Try to improve K best rules by adding N context words of prefiller or postfiller contextKeep:Rules = Rules the best of the K rules – subsumed rules
26Slide adapted from Chris Manning's
Learning example (one iteration)2 examples:
‘… located in Atlanta, Georgia…”‘… offices in Kansas City, Missouri…’
appropriately general rule (high precision, high recall)
maximally general rules(low precision, high recall)
Seed
Grow
Init
maximally specific rules(high precision, low recall)
29Slide adapted from William Cohen's
Summary: Rule-learning approaches to sliding-window classification
SRV, Rapier, and WHISK [Soderland KDD ‘97]
Representations for classifiers allow restriction of the relationships between tokens, etcRepresentations are carefully chosen subsets of even more powerful representations Use of these “heavyweight” representations is complicated, but seems to pay off in results
30
Successors to MUCCoNNL: Conference on Computational Natural Language Learning
Different topics each year2002, 2003: Language-independent NER2004: Semantic Role recognition2001: Identify clauses in text2000: Chunking boundaries
– http://cnts.uia.ac.be/conll2003/ (also conll2004, conll2002…)– Sponsored by SIGNLL, the Special Interest Group on Natural Language
Learning of the Association for Computational Linguistics.
ACE: Automated Content ExtractionEntity Detection and Tracking
– Sponsored by NIST– http://wave.ldc.upenn.edu/Projects/ACE/
Several others recently See http://cnts.uia.ac.be/conll2003/ner/
31
CoNNL-2003Goal: identify boundaries and types of named entities
People, Organizations, Locations, Misc.
Experiment with incorporating external resources (Gazeteers) and unlabeled data
Data:Using IOB notation4 pieces of info for each term
Word POS Chunk EntityType
32
Details on Training/Test Sets
Reuters Newswire + European Corpus Initiative
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003
33
Summary of Results
16 systems participatedMachine Learning Techniques
Combinations of Maximum Entropy Models (5) + Hidden Markov Models (4) + Winnow/Perceptron (4)Others used once were Support Vector Machines, Conditional Random Fields, Transformation-Based learning, AdaBoost, and memory-based learningCombining techniques often worked well
FeaturesChoice of features is at least as important as ML methodTop-scoring systems used many typesNo one feature stands out as essential (other than words)
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003
34 Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003
35
Use of External InformationImprovement from using Gazeteers vs. unlabeled data nearly equal Gazeteers less useful for German than English (higher quality)
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003
36
Precision, Recall, and F-Scores
* ***
*
* Not significantly different
Sang and De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Proceedings of CoNLL-2003
37
Combining Results
What happens if we combine the results of all of the systems?
Used a majority-vote of 5 systems for each setEnglish:
F = 90.30 (14% error reduction of best system)
German:F = 74.17 (6% error reduction of best system)
Top four systems in more detail …
38
Zhang and Johnson
Experimented with the effects of different featuresUsed a learning method they developed called Robust Risk Minimization
Related to the Winnow methodUsed it to predict the class label ti associated with each token wi
Estimate P(ti = c| xi) for every possible class label c where xi is a feature vector associated with token ixi can including information about previous tags
Found that the relatively simple, language independent features get you much of the way
39
Zhang and Johnson
Simple features include:The tokens themselves, in window of +/- 2The previous 2 predicted tagsThe conjunction of the previous tag and the current tokenInitial capitalization of tokens, in window of +/- 2
More elaborate features include:Word “shape” information: initial caps, all caps, all digits, digits containing punctuationToken prefix (len 3-4) and suffix (len 1-4)POS Chunking info (chunk bag-of-words at current token)Marked up entities from training dataOther dictionaries
41
Florian, Ittycheria, Jing, Zhang
Combined four machine learning algorithmsThe best-performing was the Zhang & Johnson RRM Voting algorithm
– Giving them all equal-weight votes worked well– So did using the RRM algorithm to choose among them
English F-measure went from 89.94 to 91.63
Did well with the supplied features; did even better with some complex additional features:
The output of 2 other NER systems– Trained on 1.7M annotated words in 32 categories
A list of gazetteersImproved English F-measure to 93.9
– (21% error reduction)
42
Effects of Unknown Words
Florian et al. note that German is harderHas more unknown wordsAll nouns are capitalized
43
Klein, Smarr, Nguyen, Manning
Standard approach for unknown words is to extract features like suffixes, prefixes, and capitalizationIdea: use all-character n-grams, rather than words, as the primary representation
Integrates unknown words seamlessly into the modelImproved results of their classifier by 25%
44
Balancing n-grams with Other Evidence
Example: “morning at Grace Road”Need the classifiers to determine “Grace” is part of a location rather than a PersonUsed Conditional Markov Model (aka Maximum Entropy Model)Also, added other “shape” information
“20-month” -> d-x“Italy” -> Xx
47
Chieu and Ng
Used a Maximum Entropy approachEstimates probabilities based on the principle of making as few assumptions as possibleBut allows specification of constraints between featurs and outcome (derived from training data)
Used a rich feature set, like those already discussedInteresting additional features:
Lists derived from training set“Global” features: look at how the words appeared elsewhere within the document
Doesn’t say which of these features do well
48
Lists Derived from Training Data
UNI: (useful unigrams)Top 20 words that precede instances of that class Computed using a correlation metric
UBI (useful bigrams): pairs of preceding wordsCITY OF, ARRIVES INThe bigram have higher probability of preceding the class than the unigram
– CITY OF better evidence than just OF
NCS: Useful Name Class SuffixesTokens that frequenty terminate a class
– INC, COMMITTEE
49
Using Other Occurrences within the Document
Zone:Where is the token from? (headline, author, body)
UnigramsIf UNI holds for an occurrence of w elsewhere
BigramsIf UBI holds for an occurrence of w elsewhere
SuffixIf NCS holds of an occurrence of w elsewhere
InitCapsA way to check if a word is capitalized due to its position in the sentence or not. Also, check the first work in sequence of capitalized words.
– Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement.
50
MUC Redux
Task: fill slots of templatesMUC-4 (1992)
All systems hand-engineeredOne MUC-6 entry used learning; failed miserably
52
MUC Redux
Fast forward 12 years … now use ML!Chieu et. al. show a machine learning approach that can do as well as most of the hand-engineered MUC-4 systems
Uses state-of-the-art:– Sentence segmenter– POS tagger– NER– Statistical Parser– Co-reference resolution
Features look at syntactic context– Use subject-verb-object information– Use head-words of NPs
Train classifiers for each slot type
Chieu, Hai Leong, Ng, Hwee Tou, & Lee, Yoong Keok (2003). Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods, In (ACL-03).
54
IE Techniques: Summary
Machine learning approaches are doing well, even without comprehensive word lists
Can develop a pretty good starting list with a bit of web page scraping
Features mainly have to do with the preceding and following tags, as well as syntax and word “shape”
The latter is somewhat language dependent
With enough training data, results are getting pretty decent on well-defined entitiesML is the way of the future!
55
IE Tools
Research toolsGate
– http://gate.ac.uk/
MinorThird– http://minorthird.sourceforge.net/
Alembic (only NE tagging)– http://www.mitre.org/tech/alembic-workbench/
Commercial?? I don’t know which ones work well
59Adapted from slides by Cunningham & Bontcheva
GATE
GATE – University of Sheffield’s open-source infrastructure for language processing
Automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugginghas a finite-state pattern-action rule languageHas an example rule-based system called ANNIEANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 corpus
60Adapted from slides by Cunningham & Bontcheva
NE ComponentsThe ANNIE system – a reusable and easily extendable set of components
61Adapted from slides by Cunningham & Bontcheva
Gate’s Named Entity Grammars
Phases run sequentially and constitute a cascade of FSTs over the pre-processing resultsHand-coded rules applied to annotations to identify NEs Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules Use of contextual information Finds person names, locations, organisations, dates, addresses.