+ All Categories
Home > Documents > Computational Linguistics - Brandeis University

Computational Linguistics - Brandeis University

Date post: 08-Jan-2022
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
49
ANLP Fall 2004 1 Marti Hearst
Transcript
Page 1: Computational Linguistics - Brandeis University

ANLP Fall 2004 1 Marti Hearst

Computational Linguistics

James Pustejovsky

Brandeis University

COSCI 35

April 7, 2006

What is Computational Linguistics?

Computational Linguistics is the computational analysis of natural

languages.

Process information contained in natural language.

Can machines understand human language?

Define ‘understand’

Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.

Page 2: Computational Linguistics - Brandeis University

ANLP Fall 2004 2 Marti Hearst

Goals of this Lecture

Learn about the problems and possibilities of natural

language analysis:

What are the major issues?

What are the major solutions?

At the end you should:

Agree that language is subtle and interesting!

Know about some of the algorithms.

Know how difficult it can be!

It’s 2006,but we’re not anywhere close

to realizing the dream(or nightmare …) of 2001

Page 3: Computational Linguistics - Brandeis University

ANLP Fall 2004 3 Marti Hearst

Dave Bowman: “Open the pod bay doors.”

HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

Dave Bowman: “Open the pod bay doors, please, HAL.”

Why is NLP difficult?

Computers are not brains

There is evidence that much of language

understanding is built-in to the human brain

Computers do not socialize

Much of language is about communicating with people

Key problems:

Representation of meaning

Language presupposed knowledge about the world

Language only reflects the surface of meaning

Language presupposes communication between people

Page 4: Computational Linguistics - Brandeis University

ANLP Fall 2004 4 Marti Hearst

Hidden Structure

English plural pronunciation

Toy + s toyz ; add z

Book + s books ; add s

Church + s churchiz ; add iz

Box + s boxiz ; add iz

Sheep + s sheep ; add nothing

What about new words?

Bach + ‘s boxs ; why not boxiz?

Language subtleties

Adjective order and placement

A big black dog

A big black scary dog

A big scary dog

A scary big dog

A black big dog

Antonyms

Which sizes go together?

– Big and little

– Big and small

– Large and small

Large and little

Page 5: Computational Linguistics - Brandeis University

ANLP Fall 2004 5 Marti Hearst

World Knowledge is subtle

He arrived at the lecture.

He chuckled at the lecture.

He arrived drunk.

He chuckled drunk.

He chuckled his way through the lecture.

He arrived his way through the lecture.

Words are ambiguous(have multiple meanings)

I know that.

I know that block.

I know that blocks the sun.

I know that block blocks the sun.

Page 6: Computational Linguistics - Brandeis University

ANLP Fall 2004 6 Marti Hearst

Headline Ambiguity

Iraqi Head Seeks Arms

Juvenile Court to Try Shooting Defendant

Teacher Strikes Idle Kids

Kids Make Nutritious Snacks

British Left Waffles on Falkland Islands

Red Tape Holds Up New Bridges

Bush Wins on Budget, but More Lies Ahead

Hospitals are Sued by 7 Foot Doctors

Ban on nude dancing on Governor’s desk

Local high school dropouts cut in half

The Role of Memorization

Children learn words quickly

As many as 9 words/day

Often only need one exposure to associate meaning

with word

– Can make mistakes, e.g., overgeneralization

“I goed to the store.”

Exactly how they do this is still under study

Page 7: Computational Linguistics - Brandeis University

ANLP Fall 2004 7 Marti Hearst

The Role of Memorization

Dogs can do word association too!

Rico, a border collie in Germany

Knows the names of each of 100 toys

Can retrieve items called out to him with over 90%

accuracy.

Can also learn and remember the names of

unfamiliar toys after just one encounter, putting

him on a par with a three-year-old child.

httpQ//www.nature.com/news/2004/040607/pf/040607-8_pf.html

But there is too much to memorize!

establish

establishment

the church of England as the official state church.

disestablishment

antidisestablishment

antidisestablishmentarian

antidisestablishmentarianism

is a political philosophy that is opposed to the

separation of church and state.

Page 8: Computational Linguistics - Brandeis University

ANLP Fall 2004 8 Marti Hearst

Rules and Memorization

Current thinking in psycholinguistics is that we use a combination of rules and memorization

However, this is very controversial

Mechanism:

If there is an applicable rule, apply it

However, if there is a memorized version, that takes precedence. (Important for irregular words.)

– Artists paint “still lifes”

� Not “still lives”

– Past tense of

� think thought

� blink blinked

This is a simplification; for more on this, see Pinker’s “Words and Language” and “The Language Instinct”.

Representation of Meaning

I know that block blocks the sun.

How do we represent the meanings of “block”?

How do we represent “I know”?

How does that differ from “I know that.”?

Who is “I”?

How do we indicate that we are talking about earth’s sun

vs. some other planet’s sun?

When did this take place? What if I move the block?

What if I move my viewpoint? How do we represent this?

Page 9: Computational Linguistics - Brandeis University

ANLP Fall 2004 9 Marti Hearst

How to tackle these problems?

The field was stuck for quite some time.

A new approach started around 1990Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz

Main idea: combine memorizing and rules

How to do it:Get large text collections (corpora)

Compute statistics over the words in those collections

Surprisingly effectiveEven better now with the Web

Corpus-based Example: Pre-Nominal Adjective Ordering

Important for translation and generation

Examples:big fat Greek wedding

fat Greek big wedding

Some approaches try to characterize this as semantic rules, e.g.:

Age < color, value < dimension

Data-intensive approachesAssume adjective ordering is independent of the noun they modify

Compare how often you see {a, b} vs {b, a}

Keller & Lapata, ì The Web as Baselineî , HLT-NAACLí04

Page 10: Computational Linguistics - Brandeis University

ANLP Fall 2004 10 Marti Hearst

Corpus-based Example: Pre-Nominal Adjective Ordering

Data-intensive approachesCompare how often you see {a, b} vs {b, a}

What happens when you encounter an unseen pair?

– Shaw and Hatzivassiloglou ’99 use transitive closures

– Malouf ’00 uses a back-off bigram model

� P(<a,b>|{a,b}) vs. P(<b,a>|{a,b})

� He also uses morphological analysis, semantic similarity calculations and positional probabilities

Keller and Lapata ’04 use just the very simple algorithm

– But they use the web as their training set

– Gets 90% accuracy on 1000 sequences

– As good as or better than the complex algorithms

Keller & Lapata, ì The Web as Baselineî , HLT-NAACLí04

Real-World Applications of NLP

Spelling Suggestions/Corrections

Grammar Checking

Synonym Generation

Information Extraction

Text Categorization

Automated Customer Service

Speech Recognition (limited)

Machine Translation

In the (near?) future:

Question Answering

Improving Web Search Engine results

Automated Metadata Assignment

Online Dialogs

Page 11: Computational Linguistics - Brandeis University

ANLP Fall 2004 11 Marti Hearst

Synonym Generation

Synonym Generation

Page 12: Computational Linguistics - Brandeis University

ANLP Fall 2004 12 Marti Hearst

Synonym Generation

Levels of Language

Sound Structure (Phonetics and Phonology)The sounds of speech and their production

The systematic way that sounds are differently realized in different environments.

Word Structure (Morphology)From morphos = shape (not transform, as in morph)

Analyzes how words are formed from minimal units of meaning; also derivational rules

– dog + s = dogs; eat, eats, ate

Phrase Structure (Syntax)From the Greek syntaxis, arrange together

Describes grammatical arrangements of words into hierarchical structure

Page 13: Computational Linguistics - Brandeis University

ANLP Fall 2004 13 Marti Hearst

Levels of Language

Thematic Structure

Getting closer to meaning

Who did what to whom– Subject, object, predicate

Semantic Structure

How the lower levels combine to convey meaning

Pragmatics and Discourse Structure

How language is used across sentences.

Parsing at Every Level

Transforming from a surface representation to an underlying

representation

It’s not straightforward to do any of these mappings!

Ambiguity at every level

– Word: is “saw” a verb or noun?

– Phrase: “I saw the guy on the hill with the telescope.”

� Who is on the hill?

– Semantic: which hill?

Page 14: Computational Linguistics - Brandeis University

ANLP Fall 2004 14 Marti Hearst

Tokens and Types

The term word can be used in two different ways:

1. To refer to an individual occurrence of a word

2. To refer to an abstract vocabulary item

For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.

To avoid confusion use more precise terminology:

1. Word token: an occurrence of a word

2. Word Type: a vocabulary item

Tokenization (continued)

Tokenization is harder that it seems

I’ll see you in New York.

The aluminum-export ban.

The simplest approach is to use “graphic words” (i.e., separate

words using whitespace)

Another approach is to use regular expressions to specify which

substrings are valid words.

NLTK provides a generic tokenization interface: TokenizerI

Page 15: Computational Linguistics - Brandeis University

ANLP Fall 2004 15 Marti Hearst

Terminology

Tagging

The process of associating labels with each token in a text

Tags

The labels

Tag Set

The collection of tags used for a particular task

Example

Typically a tagged text is a sequence of white-space separated base/tag tokens:

The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./.

Page 16: Computational Linguistics - Brandeis University

ANLP Fall 2004 16 Marti Hearst

What does Tagging do?

1. Collapses Distinctions• Lexical identity may be discarded

• e.g. all personal pronouns tagged with PRP

2. Introduces Distinctions• Ambiguities may be removed

• e.g. deal tagged with NN or VB

• e.g. deal tagged with DEAL1 or DEAL2

3. Helps classification and prediction

Significance of Parts of Speech

A word’s POS tells us a lot about the word and its

neighbors:

Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)

Helps in stemming

Limits the range of following words for Speech Recognition

Can help select nouns from a document for IR

Basis for partial parsing (chunked parsing)

Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Page 17: Computational Linguistics - Brandeis University

ANLP Fall 2004 17 Marti Hearst

Choosing a tagset

The choice of tagset greatly affects the difficulty of

the problem

Need to strike a balance between

Getting better information about context (best:

introduce more distinctions)

Make it possible for classifiers to do their job (need

to minimize distinctions)

Some of the best-known Tagsets

Brown corpus: 87 tags

Penn Treebank: 45 tags

Lancaster UCREL C5 (used to tag the BNC): 61 tags

Lancaster C7: 145 tags

Page 18: Computational Linguistics - Brandeis University

ANLP Fall 2004 18 Marti Hearst

The Brown Corpus

The first digital corpus (1961)

Francis and Kucera, Brown University

Contents: 500 texts, each 2000 words long

From American books, newspapers, magazines

Representing genres:

– Science fiction, romance fiction, press reportage scientific writing, popular lore

Penn Treebank

First syntactically annotated corpus

1 million words from Wall Street Journal

Part of speech tags and syntax trees

Page 19: Computational Linguistics - Brandeis University

ANLP Fall 2004 19 Marti Hearst

How hard is POS tagging?

1

7

2

6

12

5

61264376035340Number of words types

4321Number of tags

In the Brown corpus,- 11.5% of word types ambiguous- 40% of word TOKENS

Important Penn Treebank tags

Page 20: Computational Linguistics - Brandeis University

ANLP Fall 2004 20 Marti Hearst

Verb inflection tags

The entire Penn Treebank tagset

Page 21: Computational Linguistics - Brandeis University

ANLP Fall 2004 21 Marti Hearst

Tagging methods

Hand-coded

Statistical taggers

Brill (transformation-based) tagger

Default Tagger

We need something to use for unseen wordsE.g., guess NNP for a word with an initial capital

How to do this?Apply a sequence of regular expression tests

Assign the word to a suitable tag

If there are no matches…Assign to the most frequent unknown tag, NN

– Other common ones are verb, proper noun, adjective

Note the role of closed-class words in English

– Prepositions, auxiliaries, etc.

– New ones do not tend to appear.

Page 22: Computational Linguistics - Brandeis University

ANLP Fall 2004 22 Marti Hearst

Training vs. Testing

A fundamental idea in computational linguistics

Start with a collection labeled with the right answersSupervised learning

Usually the labels are done by hand

“Train” or “teach” the algorithm on a subset of the labeled text.

Test the algorithm on a different set of data.Why?

– If memorization worked, we’d be done.

– Need to generalize so the algorithm works on examples that you haven’t seen yet.

– Thus testing only makes sense on examples you didn’t train on.

Evaluating a Tagger

Tagged tokens – the original data

Untag (exclude) the data

Tag the data with your own tagger

Compare the original and new tagsIterate over the two lists checking for identity and

counting

Accuracy = fraction correct

Page 23: Computational Linguistics - Brandeis University

ANLP Fall 2004 23 Marti Hearst

Language Modeling

Another fundamental concept in NLP

Main idea:

For a given language, some words are more likely than others to follow each other, or

You can predict (with some degree of accuracy) the probability that a given word will follow another word.

N-GramsThe N stands for how many terms are used

Unigram: 1 term

Bigram: 2 terms

Trigrams: 3 terms

– Usually don’t go beyond this

You can use different kinds of terms, e.g.:Character based n-grams

Word-based n-grams

POS-based n-grams

OrderingOften adjacent, but not required

We use n-grams to help determine the context in which some linguistic phenomenon happens.

E.g., look at the words before and after the period to see if it is the end of a sentence or not.

Page 24: Computational Linguistics - Brandeis University

ANLP Fall 2004 24 Marti Hearst

Features and Contexts

wn-2 wn-1 wn wn+1

CONTEXT FEATURE CONTEXT

tn-2tn-1 tn tn+1

Unigram Tagger

Trained using a tagged corpus to determine which tags are most

common for each word.

E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time

Performance is highly dependent on the quality of its training set.

Can’t be too small

Can’t be too different from texts we actually want to tag

Page 25: Computational Linguistics - Brandeis University

ANLP Fall 2004 25 Marti Hearst

Nth Order TaggingOrder refers to how much context

It’s one less than the N in N-gram here because we use the target word itself as part of the context.

– Oth order = unigram tagger– 1st order = bigrams

– 2nd order = trigrams

Bigram taggerFor tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens

What is the most likely tag for w_n, given w_n-1 and t_n-1?

The tagger picks the tag which is most likely for that context.

Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/INthe/DT race/NN for/IN outer/JJ space/NN

Problem: assign a tag to race given its lexical frequency

Solution: we choose the tag that has the greater

P(race|VB)

P(race|NN)

Actual estimate from the Switchboard corpus:

P(race|NN) = .00041

P(race|VB) = .00003

Page 26: Computational Linguistics - Brandeis University

ANLP Fall 2004 26 Marti Hearst

Rule-Based Tagger

The Linguistic ComplaintWhere is the linguistic knowledge of a tagger?

Just a massive table of numbers

Aren’t there any linguistic insights that could emerge from the data?

Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

The Brill tagger

An example of TRANSFORMATION-BASED LEARNING

Very popular (freely available, works fairly well)

A SUPERVISED method: requires a tagged corpus

Basic idea: do a quick job first (using frequency), then revise it using contextual rules

Page 27: Computational Linguistics - Brandeis University

ANLP Fall 2004 27 Marti Hearst

Brill Tagging: In more detail

Start with simple (less accurate) rules…learn better ones from

tagged corpus

Tag each word initially with most likely POS

Examine set of transformations to see which improves tagging decisions compared to tagged corpus

Re-tag corpus using best transformation

Repeat until, e.g., performance doesn’t improve

Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

An example

Examples:

It is expected to race tomorrow.

The race for outer space.

Tagging algorithm:

1. Tag all uses of “race” as NN (most likely tag in the Brown corpus)

• It is expected to race/NN tomorrow

• the race/NN for outer space

2. Use a transformation rule to replace the tag NN with VB for all

uses of “race” preceded by the tag TO:

• It is expected to race/VB tomorrow

• the race/NN for outer space

Page 28: Computational Linguistics - Brandeis University

ANLP Fall 2004 28 Marti Hearst

Transformation-based learning in the Brill tagger

1. Tag the corpus with the most likely tag for each word

2. Choose a TRANSFORMATION that deterministically replaces an existing tag with a new one such that the resulting tagged corpus has the lowest error rate

3. Apply that transformation to the training corpus

4. Repeat

5. Return a tagger thata. first tags using unigrams

b. then applies the learned transformations in order

Examples of learned transformations

Page 29: Computational Linguistics - Brandeis University

ANLP Fall 2004 29 Marti Hearst

Templates

Probabilities in Language Modeling

A fundamental concept in NLP

Main idea:

For a given language, some words are more likely than others to follow each other, or

You can predict (with some degree of accuracy) the probability that a given word will follow another word.

Page 30: Computational Linguistics - Brandeis University

ANLP Fall 2004 30 Marti Hearst

Next Word Prediction

From a NY Times story...

Stocks ...

Stocks plunged this ….

Stocks plunged this morning, despite a cut in interest rates

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall ...

Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began

Stocks plunged this morning, despite a cut in

interest rates by the Federal Reserve, as Wall Street

began trading for the first time since last …

Stocks plunged this morning, despite a cut in

interest rates by the Federal Reserve, as Wall Street

began trading for the first time since last Tuesday's

terrorist attacks.

Page 31: Computational Linguistics - Brandeis University

ANLP Fall 2004 31 Marti Hearst

Human Word Prediction

Clearly, at least some of us have the ability to predict

future words in an utterance.

How?

Domain knowledge

Syntactic knowledge

Lexical knowledge

Claim

A useful part of the knowledge needed to allow word prediction can be captured using simple statistical techniques

In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)

Page 32: Computational Linguistics - Brandeis University

ANLP Fall 2004 32 Marti Hearst

Applications

Why do we want to predict a word, given some preceding words?

Rank the likelihood of sequences containing various alternative hypotheses, e.g. for ASR

Theatre owners say popcorn/unicorn sales have doubled...

Assess the likelihood/goodness of a sentence– for text generation or machine translation.

The doctor recommended a cat scan.

El doctor recommendó una exploración del gato.

N-Gram Models of Language

Use the previous N-1 words in a sequence to predict the next word

Language Model (LM)unigrams, bigrams, trigrams,…

How do we train these models?Very large corpora

Page 33: Computational Linguistics - Brandeis University

ANLP Fall 2004 33 Marti Hearst

Simple N-Grams

Assume a language has V word types in its lexicon, how likely isword x to follow word y?

Simplest model of word probability: 1/V

Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus (unigram probability)

popcorn is more likely to occur than unicorn

Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…)

mythical unicorn is more likely than mythical popcorn

A Word on Notation

P(unicorn)

Read this as “The probability of seeing the token unicorn”

Unigram tagger uses this.

P(unicorn|mythical)

Called the Conditional Probability.

Read this as “The probability of seeing the token unicorngiven that you’ve seen the token mythical

Bigram tagger uses this.

Related to the conditional frequency distributions that we’ve been working with.

Page 34: Computational Linguistics - Brandeis University

ANLP Fall 2004 34 Marti Hearst

Computing the Probability of a Word Sequence

Compute the product of component conditional probabilities?

P(the mythical unicorn) =

P(the) P(mythical|the) P(unicorn|the mythical)

The longer the sequence, the less likely we are to find it in a training corpus

P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal)

Solution: approximate using n-grams

Bigram Model

Approximate by

P(unicorn|the mythical) by P(unicorn|mythical)

Markov assumption:

The probability of a word depends only on the probability of a limited history

Generalization:

The probability of a word depends only on the probability of the n previous words

– trigrams, 4-grams, …

– the higher n is, the more data needed to train

– backoff models

)11|P nn wwP )|P 1nn wwP

Page 35: Computational Linguistics - Brandeis University

ANLP Fall 2004 35 Marti Hearst

Using N-Grams

For N-gram models

P(wn-1

,wn) = P(w

n| w

n-1) P(w

n-1)

By the Chain Rule we can decompose a joint probability, e.g. P(w

1,w

2,w

3)

P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-

1|wn) P(wn)

For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams

P(the,mythical,unicorn) = P(unicorn|mythical)P(mythical|the) P(the|<start>)

n

kkkn wwPwP

111 )|P)P

Training and Testing

N-Gram probabilities come from a training corpus

overly narrow corpus: probabilities don't generalize

overly general corpus: probabilities don't reflect task or domain

A separate test corpus is used to evaluate the model, typically using standard metrics

held out test set; development test set

cross validation

results tested for statistical significance

Page 36: Computational Linguistics - Brandeis University

ANLP Fall 2004 36 Marti Hearst

Shallow (Chunk) Parsing

Goal: divide a sentence into a sequence of chunks.

Chunks are non-overlapping regions of a text

[I] saw [a tall man] in [the park].

Chunks are non-recursive

A chunk can not contain other chunks

Chunks are non-exhaustive

Not all words are included in chunks

Chunk Parsing Examples

Noun-phrase chunking:[I] saw [a tall man] in [the park].

Verb-phrase chunking:The man who [was in the park] [saw me].

Prosodic chunking:

[I saw] [a tall man] [in the park].

Question answering:What [Spanish explorer] discovered [the Mississippi River]?

Page 37: Computational Linguistics - Brandeis University

ANLP Fall 2004 37 Marti Hearst

Shallow Parsing: Motivation

Locating information

e.g., text retrieval

– Index a document collection on its noun phrases

Ignoring information

Generalize in order to study higher-level patterns

– e.g. phrases involving “gave” in Penn treebank:

� gave NP; gave up NP in NP; gave NP up; gave NP help; gave NP to NP

Sometimes a full parse has too much structure

– Too nested

– Chunks usually are not recursive

RepresentationBIO (or IOB)

Trees

Page 38: Computational Linguistics - Brandeis University

ANLP Fall 2004 38 Marti Hearst

Comparison with Full Syntactic Parsing

Parsing is usually an intermediate stage

Builds structures that are used by later stages of processing

Full parsing is a sufficient but not necessary intermediate stage for many NLP tasks

Parsing often provides more information than we need

Shallow parsing is an easier problem

Less word-order flexibility within chunks than between chunks

More locality:

– Fewer long-range dependencies

– Less context-dependence

– Less ambiguity

Chunks and Constituency

Constituents: [[a tall man] [ in [the park]]].

Chunks: [a tall man] in [the park].

A constituent is part of some higher unit in the hierarchical syntactic parse

Chunks are not constituents

Constituents are recursive

But, chunks are typically subsequences of constituents

Chunks do not cross major constituent boundaries

Page 39: Computational Linguistics - Brandeis University

ANLP Fall 2004 39 Marti Hearst

ChunkingDefine a regular expression that matches the sequences of tags in a chunk

A simple noun phrase chunk regexp:(Note that <NN.*> matches any tag starting with NN)

<DT>? <JJ>* <NN.?>

Chunk all matching subsequences:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]

If matching subsequences overlap, first 1 gets priority

Unchunking

Remove any chunk with a given patterne.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)

Combine with Chunk Rule <NN|DT|JJ>+

Chunk all matching subsequences:Input:

the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

Apply chunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]Apply unchunk rule

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

Page 40: Computational Linguistics - Brandeis University

ANLP Fall 2004 40 Marti Hearst

Chinking

A chink is a subsequence of the text that is not a chunk.

Define a regular expression that matches the sequences of tags in a chink

A simple chink regexp for finding NP chunks:

(<VB.?>|<IN>)+

First apply chunk rule to chunk everything

InputQthe/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN

ChunkRuleP'S.*>+', ‘Chunk everything’)

[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]Apply Chink rule above:

[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]ChinkChunk Chunk

Merging

Combine adjacent chunks into a single chunkDefine a regular expression that matches the sequences of tags on both sides of the point to be merged

Example:Merge a chunk ending in JJ with a chunk starting with NN

MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)

[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN

[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN

Splitting is the opposite of merging

Page 41: Computational Linguistics - Brandeis University

ANLP Fall 2004 41 Marti Hearst

Applying Chunking to Treebank Data

Page 42: Computational Linguistics - Brandeis University

ANLP Fall 2004 42 Marti Hearst

Classifying at Different Granularies

Text Categorization:

Classify an entire document

Information Extraction (IE):

Identify and classify small units within documents

Named Entity Extraction (NE):

A subset of IE

Identify and classify proper names

– People, locations, organizations

Page 43: Computational Linguistics - Brandeis University

ANLP Fall 2004 43 Marti Hearst

Example: The Problem:

Looking for a Job

Martin Baker, a person

Genomics job

Employers job posting form

What is Information Extraction

Filling slots in a database from sub-segments of text.As a taskQ

October 14, 2002, 4Q00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.

Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì

Richard Stallman, founder of the Free Software Foundation, countered sayingÖ

NAME TITLE ORGANIZATION

Page 44: Computational Linguistics - Brandeis University

ANLP Fall 2004 44 Marti Hearst

What is Information Extraction

Filling slots in a database from sub-segments of text.As a taskQ

October 14, 2002, 4Q00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.

Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì

Richard Stallman, founder of the Free Software Foundation, countered sayingÖ

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

What is Information Extraction

Information Extraction =segmentation + classification + association

As a familyof techniquesQ

October 14, 2002, 4Q00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.

Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì

Richard Stallman, founder of the Free Software Foundation, countered sayingÖ

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka ì named entity extractionî

Page 45: Computational Linguistics - Brandeis University

ANLP Fall 2004 45 Marti Hearst

What is Information Extraction

Information Extraction =segmentation + classification + association

A familyof techniquesQ

October 14, 2002, 4Q00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.

Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì

Richard Stallman, founder of the Free Software Foundation, countered sayingÖ

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

What is Information Extraction

Information Extraction =segmentation + classification + association

A familyof techniquesQ

October 14, 2002, 4Q00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a OcancerO that stifled technological innovation.

Today, Microsoft claims to OloveO the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gateshimself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

OWe can be open source. We love the concept of shared source,O said Bill Veghte, a Microsoft VP. OThat's a super-important shift for us in terms of code access.ì

Richard Stallman, founder of the Free Software Foundation, countered sayingÖ

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 46: Computational Linguistics - Brandeis University

ANLP Fall 2004 46 Marti Hearst

IE in ContextCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Landscape of IE Tasks:Degree of Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Non-grammatical snippets,rich formatting & links Tables

Page 47: Computational Linguistics - Brandeis University

ANLP Fall 2004 47 Marti Hearst

Landscape of IE Tasks:Intended Breadth of Coverage

Web site specific Genre specific Wide, non-specific

Amazon.com Book Pages Resumes University Names

Formatting Layout Language

Landscape of IE Tasks”Complexity

Closed set

He was born in AlabamaÖ

The big Wyoming skyÖ

U.S. states

Regular set

PhoneQ P413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

HeadquartersQ1128 Main Street, 4th FloorCincinnati, Ohio 45210

Ö was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

Page 48: Computational Linguistics - Brandeis University

ANLP Fall 2004 48 Marti Hearst

Landscape of IE Tasks:Single Field/Record

Single entity

PersonQ Jack Welch

Binary relationship

RelationQPerson-TitlePersonQ Jack WelchTitleQ CEO

N-ary record

ìNamed entityî extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

RelationQ Company-LocationCompanyQ General ElectricLocationQ Connecticut

RelationQ SuccessionCompanyQGeneral ElectricTitleQ CEOOutQ Jack WelshInQ Jeffrey Immelt

PersonQJeffrey Immelt

LocationQConnecticut

State of the Art Performance: a sample

Named entity recognition from newswire textPerson, Location, Organization, …

F1 in high 80’s or low- to mid-90’s

Binary relation extractionContained-in (Location1, Location2)Member-of (Person1, Organization1)

F1 in 60’s or 70’s or 80’s

Web site structure recognitionExtremely accurate performance obtainable

Human effort (~10min?) required on each site

Page 49: Computational Linguistics - Brandeis University

ANLP Fall 2004 49 Marti Hearst

Three generations of IE systems

Hand-Built Systems – Knowledge Engineering [1980s– ]

Rules written by hand

Require experts who understand both the systems and the domain

Iterative guess-test-tweak-repeat cycle

Automatic, Trainable Rule-Extraction Systems [1990s– ]

Rules discovered automatically using predefined templates, using automated rule learners

Require huge, labeled corpora (effort is just moved!)

Statistical Models [1997 – ]

Use machine learning to learn which features indicate boundaries and types of entities.

Learning usually supervised; may be partially unsupervised

Landscape of IE Techniques

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaskaÖWisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizesQ

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?


Recommended