ANLP Fall 2004 1 Marti Hearst
Computational Linguistics
James Pustejovsky
Brandeis University
COSCI 35
April 7, 2006
What is Computational Linguistics?
Computational Linguistics is the computational analysis of natural
languages.
Process information contained in natural language.
Can machines understand human language?
Define ‘understand’
Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.
ANLP Fall 2004 2 Marti Hearst
Goals of this Lecture
Learn about the problems and possibilities of natural
language analysis:
What are the major issues?
What are the major solutions?
At the end you should:
Agree that language is subtle and interesting!
Know about some of the algorithms.
Know how difficult it can be!
It’s 2006,but we’re not anywhere close
to realizing the dream(or nightmare …) of 2001
ANLP Fall 2004 3 Marti Hearst
Dave Bowman: “Open the pod bay doors.”
HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
Dave Bowman: “Open the pod bay doors, please, HAL.”
Why is NLP difficult?
Computers are not brains
There is evidence that much of language
understanding is built-in to the human brain
Computers do not socialize
Much of language is about communicating with people
Key problems:
Representation of meaning
Language presupposed knowledge about the world
Language only reflects the surface of meaning
Language presupposes communication between people
ANLP Fall 2004 4 Marti Hearst
Hidden Structure
English plural pronunciation
Toy + s toyz ; add z
Book + s books ; add s
Church + s churchiz ; add iz
Box + s boxiz ; add iz
Sheep + s sheep ; add nothing
What about new words?
Bach + ‘s boxs ; why not boxiz?
Language subtleties
Adjective order and placement
A big black dog
A big black scary dog
A big scary dog
A scary big dog
A black big dog
Antonyms
Which sizes go together?
– Big and little
– Big and small
– Large and small
Large and little
ANLP Fall 2004 5 Marti Hearst
World Knowledge is subtle
He arrived at the lecture.
He chuckled at the lecture.
He arrived drunk.
He chuckled drunk.
He chuckled his way through the lecture.
He arrived his way through the lecture.
Words are ambiguous(have multiple meanings)
I know that.
I know that block.
I know that blocks the sun.
I know that block blocks the sun.
ANLP Fall 2004 6 Marti Hearst
Headline Ambiguity
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
Ban on nude dancing on Governor’s desk
Local high school dropouts cut in half
The Role of Memorization
Children learn words quickly
As many as 9 words/day
Often only need one exposure to associate meaning
with word
– Can make mistakes, e.g., overgeneralization
“I goed to the store.”
Exactly how they do this is still under study
ANLP Fall 2004 7 Marti Hearst
The Role of Memorization
Dogs can do word association too!
Rico, a border collie in Germany
Knows the names of each of 100 toys
Can retrieve items called out to him with over 90%
accuracy.
Can also learn and remember the names of
unfamiliar toys after just one encounter, putting
him on a par with a three-year-old child.
httpQ//www.nature.com/news/2004/040607/pf/040607-8_pf.html
But there is too much to memorize!
establish
establishment
the church of England as the official state church.
disestablishment
antidisestablishment
antidisestablishmentarian
antidisestablishmentarianism
is a political philosophy that is opposed to the
separation of church and state.
ANLP Fall 2004 8 Marti Hearst
Rules and Memorization
Current thinking in psycholinguistics is that we use a combination of rules and memorization
However, this is very controversial
Mechanism:
If there is an applicable rule, apply it
However, if there is a memorized version, that takes precedence. (Important for irregular words.)
– Artists paint “still lifes”
� Not “still lives”
– Past tense of
� think thought
� blink blinked
This is a simplification; for more on this, see Pinker’s “Words and Language” and “The Language Instinct”.
Representation of Meaning
I know that block blocks the sun.
How do we represent the meanings of “block”?
How do we represent “I know”?
How does that differ from “I know that.”?
Who is “I”?
How do we indicate that we are talking about earth’s sun
vs. some other planet’s sun?
When did this take place? What if I move the block?
What if I move my viewpoint? How do we represent this?
ANLP Fall 2004 9 Marti Hearst
How to tackle these problems?
The field was stuck for quite some time.
A new approach started around 1990Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz
Main idea: combine memorizing and rules
How to do it:Get large text collections (corpora)
Compute statistics over the words in those collections
Surprisingly effectiveEven better now with the Web
Corpus-based Example: Pre-Nominal Adjective Ordering
Important for translation and generation
Examples:big fat Greek wedding
fat Greek big wedding
Some approaches try to characterize this as semantic rules, e.g.:
Age < color, value < dimension
Data-intensive approachesAssume adjective ordering is independent of the noun they modify
Compare how often you see {a, b} vs {b, a}
Keller & Lapata, ì The Web as Baselineî , HLT-NAACLí04
ANLP Fall 2004 10 Marti Hearst
Corpus-based Example: Pre-Nominal Adjective Ordering
Data-intensive approachesCompare how often you see {a, b} vs {b, a}
What happens when you encounter an unseen pair?
– Shaw and Hatzivassiloglou ’99 use transitive closures
– Malouf ’00 uses a back-off bigram model
� P(<a,b>|{a,b}) vs. P(<b,a>|{a,b})
� He also uses morphological analysis, semantic similarity calculations and positional probabilities
Keller and Lapata ’04 use just the very simple algorithm
– But they use the web as their training set
– Gets 90% accuracy on 1000 sequences
– As good as or better than the complex algorithms
Keller & Lapata, ì The Web as Baselineî , HLT-NAACLí04
Real-World Applications of NLP
Spelling Suggestions/Corrections
Grammar Checking
Synonym Generation
Information Extraction
Text Categorization
Automated Customer Service
Speech Recognition (limited)
Machine Translation
In the (near?) future:
Question Answering
Improving Web Search Engine results
Automated Metadata Assignment
Online Dialogs
ANLP Fall 2004 11 Marti Hearst
Synonym Generation
Synonym Generation
ANLP Fall 2004 12 Marti Hearst
Synonym Generation
Levels of Language
Sound Structure (Phonetics and Phonology)The sounds of speech and their production
The systematic way that sounds are differently realized in different environments.
Word Structure (Morphology)From morphos = shape (not transform, as in morph)
Analyzes how words are formed from minimal units of meaning; also derivational rules
– dog + s = dogs; eat, eats, ate
Phrase Structure (Syntax)From the Greek syntaxis, arrange together
Describes grammatical arrangements of words into hierarchical structure
ANLP Fall 2004 13 Marti Hearst
Levels of Language
Thematic Structure
Getting closer to meaning
Who did what to whom– Subject, object, predicate
Semantic Structure
How the lower levels combine to convey meaning
Pragmatics and Discourse Structure
How language is used across sentences.
Parsing at Every Level
Transforming from a surface representation to an underlying
representation
It’s not straightforward to do any of these mappings!
Ambiguity at every level
– Word: is “saw” a verb or noun?
– Phrase: “I saw the guy on the hill with the telescope.”
� Who is on the hill?
– Semantic: which hill?
ANLP Fall 2004 14 Marti Hearst
Tokens and Types
The term word can be used in two different ways:
1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.
To avoid confusion use more precise terminology:
1. Word token: an occurrence of a word
2. Word Type: a vocabulary item
Tokenization (continued)
Tokenization is harder that it seems
I’ll see you in New York.
The aluminum-export ban.
The simplest approach is to use “graphic words” (i.e., separate
words using whitespace)
Another approach is to use regular expressions to specify which
substrings are valid words.
NLTK provides a generic tokenization interface: TokenizerI
ANLP Fall 2004 15 Marti Hearst
Terminology
Tagging
The process of associating labels with each token in a text
Tags
The labels
Tag Set
The collection of tags used for a particular task
Example
Typically a tagged text is a sequence of white-space separated base/tag tokens:
The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./.
ANLP Fall 2004 16 Marti Hearst
What does Tagging do?
1. Collapses Distinctions• Lexical identity may be discarded
• e.g. all personal pronouns tagged with PRP
2. Introduces Distinctions• Ambiguities may be removed
• e.g. deal tagged with NN or VB
• e.g. deal tagged with DEAL1 or DEAL2
3. Helps classification and prediction
Significance of Parts of Speech
A word’s POS tells us a lot about the word and its
neighbors:
Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)
Helps in stemming
Limits the range of following words for Speech Recognition
Can help select nouns from a document for IR
Basis for partial parsing (chunked parsing)
Parsers can build trees directly on the POS tags instead of maintaining a lexicon
ANLP Fall 2004 17 Marti Hearst
Choosing a tagset
The choice of tagset greatly affects the difficulty of
the problem
Need to strike a balance between
Getting better information about context (best:
introduce more distinctions)
Make it possible for classifiers to do their job (need
to minimize distinctions)
Some of the best-known Tagsets
Brown corpus: 87 tags
Penn Treebank: 45 tags
Lancaster UCREL C5 (used to tag the BNC): 61 tags
Lancaster C7: 145 tags
ANLP Fall 2004 18 Marti Hearst
The Brown Corpus
The first digital corpus (1961)
Francis and Kucera, Brown University
Contents: 500 texts, each 2000 words long
From American books, newspapers, magazines
Representing genres:
– Science fiction, romance fiction, press reportage scientific writing, popular lore
Penn Treebank
First syntactically annotated corpus
1 million words from Wall Street Journal
Part of speech tags and syntax trees
ANLP Fall 2004 19 Marti Hearst
How hard is POS tagging?
1
7
2
6
12
5
61264376035340Number of words types
4321Number of tags
In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS
Important Penn Treebank tags
ANLP Fall 2004 20 Marti Hearst
Verb inflection tags
The entire Penn Treebank tagset
ANLP Fall 2004 21 Marti Hearst
Tagging methods
Hand-coded
Statistical taggers
Brill (transformation-based) tagger
Default Tagger
We need something to use for unseen wordsE.g., guess NNP for a word with an initial capital
How to do this?Apply a sequence of regular expression tests
Assign the word to a suitable tag
If there are no matches…Assign to the most frequent unknown tag, NN
– Other common ones are verb, proper noun, adjective
Note the role of closed-class words in English
– Prepositions, auxiliaries, etc.
– New ones do not tend to appear.
ANLP Fall 2004 22 Marti Hearst
Training vs. Testing
A fundamental idea in computational linguistics
Start with a collection labeled with the right answersSupervised learning
Usually the labels are done by hand
“Train” or “teach” the algorithm on a subset of the labeled text.
Test the algorithm on a different set of data.Why?
– If memorization worked, we’d be done.
– Need to generalize so the algorithm works on examples that you haven’t seen yet.
– Thus testing only makes sense on examples you didn’t train on.
Evaluating a Tagger
Tagged tokens – the original data
Untag (exclude) the data
Tag the data with your own tagger
Compare the original and new tagsIterate over the two lists checking for identity and
counting
Accuracy = fraction correct
ANLP Fall 2004 23 Marti Hearst
Language Modeling
Another fundamental concept in NLP
Main idea:
For a given language, some words are more likely than others to follow each other, or
You can predict (with some degree of accuracy) the probability that a given word will follow another word.
N-GramsThe N stands for how many terms are used
Unigram: 1 term
Bigram: 2 terms
Trigrams: 3 terms
– Usually don’t go beyond this
You can use different kinds of terms, e.g.:Character based n-grams
Word-based n-grams
POS-based n-grams
OrderingOften adjacent, but not required
We use n-grams to help determine the context in which some linguistic phenomenon happens.
E.g., look at the words before and after the period to see if it is the end of a sentence or not.
ANLP Fall 2004 24 Marti Hearst
Features and Contexts
wn-2 wn-1 wn wn+1
CONTEXT FEATURE CONTEXT
tn-2tn-1 tn tn+1
Unigram Tagger
Trained using a tagged corpus to determine which tags are most
common for each word.
E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time
Performance is highly dependent on the quality of its training set.
Can’t be too small
Can’t be too different from texts we actually want to tag
ANLP Fall 2004 25 Marti Hearst
Nth Order TaggingOrder refers to how much context
It’s one less than the N in N-gram here because we use the target word itself as part of the context.
– Oth order = unigram tagger– 1st order = bigrams
– 2nd order = trigrams
Bigram taggerFor tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens
What is the most likely tag for w_n, given w_n-1 and t_n-1?
The tagger picks the tag which is most likely for that context.
Tagging with lexical frequencies
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/INthe/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater
P(race|VB)
P(race|NN)
Actual estimate from the Switchboard corpus:
P(race|NN) = .00041
P(race|VB) = .00003
ANLP Fall 2004 26 Marti Hearst
Rule-Based Tagger
The Linguistic ComplaintWhere is the linguistic knowledge of a tagger?
Just a massive table of numbers
Aren’t there any linguistic insights that could emerge from the data?
Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.
The Brill tagger
An example of TRANSFORMATION-BASED LEARNING
Very popular (freely available, works fairly well)
A SUPERVISED method: requires a tagged corpus
Basic idea: do a quick job first (using frequency), then revise it using contextual rules
ANLP Fall 2004 27 Marti Hearst
Brill Tagging: In more detail
Start with simple (less accurate) rules…learn better ones from
tagged corpus
Tag each word initially with most likely POS
Examine set of transformations to see which improves tagging decisions compared to tagged corpus
Re-tag corpus using best transformation
Repeat until, e.g., performance doesn’t improve
Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text
An example
Examples:
It is expected to race tomorrow.
The race for outer space.
Tagging algorithm:
1. Tag all uses of “race” as NN (most likely tag in the Brown corpus)
• It is expected to race/NN tomorrow
• the race/NN for outer space
2. Use a transformation rule to replace the tag NN with VB for all
uses of “race” preceded by the tag TO:
• It is expected to race/VB tomorrow
• the race/NN for outer space
ANLP Fall 2004 28 Marti Hearst
Transformation-based learning in the Brill tagger
1. Tag the corpus with the most likely tag for each word
2. Choose a TRANSFORMATION that deterministically replaces an existing tag with a new one such that the resulting tagged corpus has the lowest error rate
3. Apply that transformation to the training corpus
4. Repeat
5. Return a tagger thata. first tags using unigrams
b. then applies the learned transformations in order
Examples of learned transformations
ANLP Fall 2004 29 Marti Hearst
Templates
Probabilities in Language Modeling
A fundamental concept in NLP
Main idea:
For a given language, some words are more likely than others to follow each other, or
You can predict (with some degree of accuracy) the probability that a given word will follow another word.
ANLP Fall 2004 30 Marti Hearst
Next Word Prediction
From a NY Times story...
Stocks ...
Stocks plunged this ….
Stocks plunged this morning, despite a cut in interest rates
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ...
Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall Street
began trading for the first time since last …
Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall Street
began trading for the first time since last Tuesday's
terrorist attacks.
ANLP Fall 2004 31 Marti Hearst
Human Word Prediction
Clearly, at least some of us have the ability to predict
future words in an utterance.
How?
Domain knowledge
Syntactic knowledge
Lexical knowledge
Claim
A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques
In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)
ANLP Fall 2004 32 Marti Hearst
Applications
Why do we want to predict a word, given some preceding words?
Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR
Theatre owners say popcorn/unicorn sales have doubled...
Assess the likelihood/goodness of a sentence– for text generation or machine translation.
The doctor recommended a cat scan.
El doctor recommendó una exploración del gato.
N-Gram Models of Language
Use the previous N-1 words in a sequence to predict the next word
Language Model (LM)unigrams, bigrams, trigrams,…
How do we train these models?Very large corpora
ANLP Fall 2004 33 Marti Hearst
Simple N-Grams
Assume a language has V word types in its lexicon, how likely isword x to follow word y?
Simplest model of word probability: 1/V
Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability)
popcorn is more likely to occur than unicorn
Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…)
mythical unicorn is more likely than mythical popcorn
A Word on Notation
P(unicorn)
Read this as “The probability of seeing the token unicorn”
Unigram tagger uses this.
P(unicorn|mythical)
Called the Conditional Probability.
Read this as “The probability of seeing the token unicorngiven that you’ve seen the token mythical
Bigram tagger uses this.
Related to the conditional frequency distributions that we’ve been working with.
ANLP Fall 2004 34 Marti Hearst
Computing the Probability of a Word Sequence
Compute the product of component conditional probabilities?
P(the mythical unicorn) =
P(the) P(mythical|the) P(unicorn|the mythical)
The longer the sequence, the less likely we are to find it in a training corpus
P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal)
Solution: approximate using n-grams
Bigram Model
Approximate by
P(unicorn|the mythical) by P(unicorn|mythical)
Markov assumption:
The probability of a word depends only on the probability of a limited history
Generalization:
The probability of a word depends only on the probability of the n previous words
– trigrams, 4-grams, …
– the higher n is, the more data needed to train
– backoff models
)11|P nn wwP )|P 1nn wwP
ANLP Fall 2004 35 Marti Hearst
Using N-Grams
For N-gram models
P(wn-1
,wn) = P(w
n| w
n-1) P(w
n-1)
By the Chain Rule we can decompose a joint probability, e.g. P(w
1,w
2,w
3)
P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-
1|wn) P(wn)
For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams
P(the,mythical,unicorn) = P(unicorn|mythical)P(mythical|the) P(the|<start>)
n
kkkn wwPwP
111 )|P)P
Training and Testing
N-Gram probabilities come from a training corpus
overly narrow corpus: probabilities don't generalize
overly general corpus: probabilities don't reflect task or domain
A separate test corpus is used to evaluate the model, typically using standard metrics
held out test set; development test set
cross validation
results tested for statistical significance
ANLP Fall 2004 36 Marti Hearst
Shallow (Chunk) Parsing
Goal: divide a sentence into a sequence of chunks.
Chunks are non-overlapping regions of a text
[I] saw [a tall man] in [the park].
Chunks are non-recursive
A chunk can not contain other chunks
Chunks are non-exhaustive
Not all words are included in chunks
Chunk Parsing Examples
Noun-phrase chunking:[I] saw [a tall man] in [the park].
Verb-phrase chunking:The man who [was in the park] [saw me].
Prosodic chunking:
[I saw] [a tall man] [in the park].
Question answering:What [Spanish explorer] discovered [the Mississippi River]?
ANLP Fall 2004 37 Marti Hearst
Shallow Parsing: Motivation
Locating information
e.g., text retrieval
– Index a document collection on its noun phrases
Ignoring information
Generalize in order to study higher-level patterns
– e.g. phrases involving “gave” in Penn treebank:
� gave NP; gave up NP in NP; gave NP up; gave NP help; gave NP to NP
Sometimes a full parse has too much structure
– Too nested
– Chunks usually are not recursive
RepresentationBIO (or IOB)
Trees
ANLP Fall 2004 38 Marti Hearst
Comparison with Full Syntactic Parsing
Parsing is usually an intermediate stage
Builds structures that are used by later stages of processing
Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks
Parsing often provides more information than we need
Shallow parsing is an easier problem
Less word-order flexibility within chunks than between chunks
More locality:
– Fewer long-range dependencies
– Less context-dependence
– Less ambiguity
Chunks and Constituency
Constituents: [[a tall man] [ in [the park]]].
Chunks: [a tall man] in [the park].
A constituent is part of some higher unit in the hierarchical syntactic parse
Chunks are not constituents
Constituents are recursive
But, chunks are typically subsequences of constituents
Chunks do not cross major constituent boundaries
ANLP Fall 2004 39 Marti Hearst
ChunkingDefine a regular expression that matches the sequences of tags in a chunk
A simple noun phrase chunk regexp:(Note that <NN.*> matches any tag starting with NN)
<DT>? <JJ>* <NN.?>
Chunk all matching subsequences:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
If matching subsequences overlap, first 1 gets priority
Unchunking
Remove any chunk with a given patterne.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)
Combine with Chunk Rule <NN|DT|JJ>+
Chunk all matching subsequences:Input:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
Apply chunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]Apply unchunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
ANLP Fall 2004 40 Marti Hearst
Chinking
A chink is a subsequence of the text that is not a chunk.
Define a regular expression that matches the sequences of tags in a chink
A simple chink regexp for finding NP chunks:
(<VB.?>|<IN>)+
First apply chunk rule to chunk everything
InputQthe/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
ChunkRuleP'S.*>+', ‘Chunk everything’)
[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]Apply Chink rule above:
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]ChinkChunk Chunk
Merging
Combine adjacent chunks into a single chunkDefine a regular expression that matches the sequences of tags on both sides of the point to be merged
Example:Merge a chunk ending in JJ with a chunk starting with NN
MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)
[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
Splitting is the opposite of merging
ANLP Fall 2004 41 Marti Hearst
Applying Chunking to Treebank Data
ANLP Fall 2004 42 Marti Hearst
Classifying at Different Granularies
Text Categorization:
Classify an entire document
Information Extraction (IE):
Identify and classify small units within documents
Named Entity Extraction (NE):
A subset of IE
Identify and classify proper names
– People, locations, organizations
ANLP Fall 2004 43 Marti Hearst
Example: The Problem:
Looking for a Job
Martin Baker, a person
Genomics job
Employers job posting form
What is Information Extraction
Filling slots in a database from sub-segments of text.As a taskQ
October 14, 2002, 4Q00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.
Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì
Richard Stallman, founder of the Free Software Foundation, countered sayingÖ
NAME TITLE ORGANIZATION
ANLP Fall 2004 44 Marti Hearst
What is Information Extraction
Filling slots in a database from sub-segments of text.As a taskQ
October 14, 2002, 4Q00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.
Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì
Richard Stallman, founder of the Free Software Foundation, countered sayingÖ
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
What is Information Extraction
Information Extraction =segmentation + classification + association
As a familyof techniquesQ
October 14, 2002, 4Q00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.
Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì
Richard Stallman, founder of the Free Software Foundation, countered sayingÖ
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka ì named entity extractionî
ANLP Fall 2004 45 Marti Hearst
What is Information Extraction
Information Extraction =segmentation + classification + association
A familyof techniquesQ
October 14, 2002, 4Q00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.
Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì
Richard Stallman, founder of the Free Software Foundation, countered sayingÖ
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is Information Extraction
Information Extraction =segmentation + classification + association
A familyof techniquesQ
October 14, 2002, 4Q00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.
Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì
Richard Stallman, founder of the Free Software Foundation, countered sayingÖ
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
ANLP Fall 2004 46 Marti Hearst
IE in ContextCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Landscape of IE Tasks:Degree of Formatting
Grammatical sentencesand some formatting & links
Text paragraphswithout formatting
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting & links Tables
ANLP Fall 2004 47 Marti Hearst
Landscape of IE Tasks:Intended Breadth of Coverage
Web site specific Genre specific Wide, non-specific
Amazon.com Book Pages Resumes University Names
Formatting Layout Language
Landscape of IE Tasks”Complexity
Closed set
He was born in AlabamaÖ
The big Wyoming skyÖ
U.S. states
Regular set
PhoneQ P413) 545-1323
The CALD main office can be reached at 412-268-1299
U.S. phone numbers
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802
U.S. postal addresses
HeadquartersQ1128 Main Street, 4th FloorCincinnati, Ohio 45210
Ö was among the six houses sold by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
Person names
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
ANLP Fall 2004 48 Marti Hearst
Landscape of IE Tasks:Single Field/Record
Single entity
PersonQ Jack Welch
Binary relationship
RelationQPerson-TitlePersonQ Jack WelchTitleQ CEO
N-ary record
ìNamed entityî extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
RelationQ Company-LocationCompanyQ General ElectricLocationQ Connecticut
RelationQ SuccessionCompanyQGeneral ElectricTitleQ CEOOutQ Jack WelshInQ Jeffrey Immelt
PersonQJeffrey Immelt
LocationQConnecticut
State of the Art Performance: a sample
Named entity recognition from newswire textPerson, Location, Organization, …
F1 in high 80’s or low- to mid-90’s
Binary relation extractionContained-in (Location1, Location2)Member-of (Person1, Organization1)
F1 in 60’s or 70’s or 80’s
Web site structure recognitionExtremely accurate performance obtainable
Human effort (~10min?) required on each site
ANLP Fall 2004 49 Marti Hearst
Three generations of IE systems
Hand-Built Systems – Knowledge Engineering [1980s– ]
Rules written by hand
Require experts who understand both the systems and the domain
Iterative guess-test-tweak-repeat cycle
Automatic, Trainable Rule-Extraction Systems [1990s– ]
Rules discovered automatically using predefined templates, using automated rule learners
Require huge, labeled corpora (effort is just moved!)
Statistical Models [1997 – ]
Use machine learning to learn which features indicate boundaries and types of entities.
Learning usually supervised; may be partially unsupervised
Landscape of IE Techniques
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaskaÖWisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizesQ
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?