1 Natural Language Processing Information Retrieval Speech Recognition Syntactic Parsing Semantic...

1

Natural Language Processing

Information RetrievalSpeech Recognition

Syntactic ParsingSemantic Interpretation

CSE 592 Applications of AIWinter 2003

2

Example Applications

• Spelling and grammar checkers

• Finding information on the WWW

• Spoken language control systems: banking, shopping

• Classification systems for messages, articles

• Machine translation tools

3

The Dream

4

Information Retrieval

(Thanks to Adam Carlson)

5

Motivation and Outline

• Background– Definitions

• The Problem– 100,000+ pages

• The Solution– Ranking docs– Vector space– Probabilistic

approaches

• Extensions– Relevance feedback, clustering, query expansion, etc.

6

What is Information Retrieval

• Given a large repository of documents, how do I get at the ones that I want– Examples: Lexus/Nexus, Medical reports,

AltaVista

• Different from databases– Unstructured (or semi-structured) data– Information is (typically) text– Requests are (typically) word-based

7

Information Retrieval Task

• Start with a set of documents• User specifies information need

– Keyword query, Boolean expression, high-level description

• System returns a list of documents– Ordered according to relevance

• Known as the ad-hoc retrieval problem

8

Measuring Performance

• Precision– Proportion of selected

items that are correct

• Recall– Proportion of target

items that were selected

• Precision-Recall curve– Shows tradeoff

tn

fp tp fn

System returned these

Actual relevant docs

fptp

tp

fntp

tp

Precision

Recall

9

Basic IR System

• Use word overlap to determine relevance– Word overlap alone is inaccurate

• Rank documents by similarity to query

• Computed using Vector Space Model

10

Vector Space Model

• Represent documents as a matrix– Words are rows– Documents are columns– Cell i,j contains the number of times word i

appears in document j– Similarity between two documents is the

cosine of the angle between the vectors representing those words

11

Vector Space Example

a: System and human system engineering testing of EPS

b: A survey of user opinion of computer system response time

c: The EPS user interface management system d: Human machine interface for ABC computer

applications e: Relation of user perceived response time to

error measurement f: The generation of random, binary, ordered

trees g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and well-

quasi-ordering i: Graph minors: A survey

a b c d e f g h IInterface 0 0 1 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 2 1 1 0 0 0 0 0 0Human 1 0 0 1 0 0 0 0 0Computer 0 1 0 1 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 1 0 1 0 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 1Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1

12

Vector Space Example cont.

system

interfaceuser

a

c

b

||||)cos(

BA

BAAB

a b cInterface 0 0 1User 0 1 1System 2 1 1

13

Similarity in Vector Space

n

iiAA

1

2||

nnBABABABA ...2211

BA

BAAB

)cos(

Measures word overlap

Normalizes for different length vectors

Other m

etrics

exist

14

Answering a Query UsingVector Space

• Represent query as vector

• Compute distances to all documents

• Rank according to distance

• Example– “computer system”

Query a b c d e f g h IInterface 0 0 0 1 0 0 0 0 0 0User 0 0 1 1 0 1 0 0 0 0System 1 2 1 1 0 0 0 0 0 0Human 0 1 0 0 1 0 0 0 0 0Computer 1 0 1 0 1 0 0 0 0 0Response 0 0 1 0 0 1 0 0 0 0Time 0 0 1 0 0 1 0 0 0 0EPS 0 1 0 1 0 0 0 0 0 0Survey 0 0 1 0 0 0 0 0 0 1Trees 0 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 0 1 1

15

Common Improvements

• The vector space model– Doesn’t handle morphology (eat, eats, eating)– Favors common terms

• Possible fixes– Stemming

• Convert each word to a common root form

– Stop lists– Term weighting

16

Handling Common Terms

• Stop list– List of words to ignore

• “a”, “and”, “but”, “to”, etc.

• Term weighting– Words which appear everywhere aren’t very

good discriminators – give higher weight to rare words

17

tf * idf

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

18

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

19

Probabilistic IR

• Vector space model robust in practice• Mathematically ad-hoc

– How to generalize to more complex queries?(intel or microsoft) and (not stock)

• Alternative approach: model problem as finding documents with highest probability of being relevant to the query– Requires making some simplifying assumptions about

underlying probability distributions– In certain cases can be shown to yield same results as

vector space model

20

Probability Ranking Principle

For a given query Q, find the documents D that maximize the odds that the document is relevant (R):

( | , ) ( | )( | , )

( | , ) ( | )

P r D Q P r DP Q D r

P r D Q P r D

21



( | , ) ( | )( | , )

( | , ) ( | )


P r D Q P r D

Probability of document relevance to any query – i.e., the inherent quality of the document

22



( | , ) ( | )( | , )

( | , ) ( | )


P r D Q P r D

Probability that if document is indeed relevant, then the query is in fact Q

But where do we get that number?

23

Bayesian nets for text retrieval

d1 d2

w1 w3

c1 c3

q1 q2

q0

w2

c2

DocumentNetwork

QueryNetwork

Documents

Words

Concepts

Query operators(AND/OR/NOT)

Information need

24


d1 d2

w1 w3

c1 c3

q1 q2

q0

w2

c2

DocumentNetwork

QueryNetwork

Documents

Words

Concepts


Information need

Computed once for entire collection

25


d1 d2

w1 w3

c1 c3

q1 q2

q0

w2

c2

DocumentNetwork

QueryNetwork

Documents

Words

Concepts


Information need

Computed for each query

26

Conditional Probability Tables

P(d) = prior probability document d is relevant Uniform model: P(d) = 1 / Number docs In general, document quality P(r | d)

P(w | d) = probability that a random word from document d is w Term frequency

P(c | w) = probability that a given document word w has same meaning as a query word c Thesarus

P(q | c1, c2, …) = canonical form of operators AND, OR, NOT, etc.

27

Example

Hamlet Macbeth

reason double

reason two

OR NOT

AND

trouble

trouble

DocumentNetwork

QueryNetwork

User Query

28

Details

Set head q0 of user query to “true” Compute posterior probability P(D | q0) “User information need” doesn’t have to

be a query - can be a user profile, e.g., other documents user has read

Instead of just words, can include phrases, inter-document links

Link matrices can be modified over time. User feedback The promise of “personalization”

29

Extensions

• Meet demands of web-based systems

• Modified ranking functions for the web

• Relevance feedback

• Query expansion

• Document clustering

• Latent Semantic Indexing

• Other IR tasks

30

IR on the Web

• Query AltaVista with “Java”– Almost 107 pages found

• Avoiding latency– User wants (initial) results fast

• Solution– Rank documents using word-overlap– Use special data structure - inverted index

31

Improved Ranking on the Web

• Not just arbitrary documents

• Can use HTML tags and other properties– Query term in <TITLE></TITLE>– Query term in <IMG>, <HREF>, etc. tag– Check date of document (prefer recent docs)– PageRank (Google)

32

PageRank

• Idea: Good pages link to other good pages– Round 1: count in-links Problems?– Round 2: sum weighted in-links– Round 3: and again, and again…

• Implementation: Repeated random walk on snapshot of the web– weight frequency

visited

33

Relevance Feedback

• System returns initial set of documents

• User identifies relevant documents

• System refines query to get documents more like those identified by user– Add words common to relevant docs– Reposition query vector closer to relevant docs

• Lather, rinse, repeat…

34

Query Expansion

• Given query, add words to improve recall– Workaround for synonym problem

• Example– boat boat OR ship

• Can involve user feedback or not

• Can use thesaurus or other online source– WordNet

35

Document Clustering

• Group similar documents– Similar means “close in vector space”

• If a document is relevant, return whole cluster

• Can be combined with relevance feedback

• GROUPERhttp://www.cs.washington.edu/research/clustering

36

Clustering Algorithms

• K-means

• Hierarchical Agglomerative Clustering

Initialize k cluster centersLoop

Assign all document to closest centerMove cluster centers to better fit assignment

Until little movement

Initialize each document to a singleton clusterLoop

Merge two closest clustersUntil k clusters exist

Clusters

Cluster centers

Many ways to measure

distance between cluste

rs

37

Latent Semantic Indexing

• Creates modified vector space

• Captures transitive co-occurrence information– If docs A & B don’t share any words, with

each other, but both share lots of words with doc C, then A & B will be considered similar

• Simulates query expansion and document clustering (sort of)

38

Variations on a Theme

• Text Categorization– Assign each document to a category

– Example: automatically put web pages in Yahoo hierarchy

• Routing & Filtering– Match documents with users

– Example: news service that allows subscribers to specify “send news about high-tech mergers”

39

Speech Recognition

TO BE COMPLETED

40

Syntactic ParsingSemantic Interpretation

TO BE COMPLETED

41

NLP Research Areas

• Morphology: structure of words• Syntactic interpretation (parsing): create a parse

tree of a sentence.• Semantic interpretation: translate a sentence into

the representation language.– Pragmatic interpretation: incorporate current situation

into account.

– Disambiguation: there may be several interpretations. Choose the most probable

42

Some Difficult Examples

• From the newspapers:– Squad helps dog bite victim.– Helicopter powered by human flies.– Levy won’t hurt the poor.– Once-sagging cloth diaper industry saved by full

dumps.

• Ambiguities:– Lexical: meanings of ‘hot’, ‘back’.– Syntactic: I heard the music in my room.– Referential: The cat ate the mouse. It was ugly.

43

Parsing

• Context-free grammars:

EXPR -> NUMBEREXPR -> VARIABLEEXPR -> (EXPR + EXPR)EXPR -> (EXPR * EXPR)

• (2 + X) * (17 + Y) is in the grammar.• (2 + (X)) is not.• Why do we call them context-free?

44

Using CFG’s for Parsing

• Can natural language syntax be captured using a context-free grammar?– Yes, no, sort of, for the most part, maybe.

• Words:– nouns, adjectives, verbs, adverbs.– Determiners: the, a, this, that– Quantifiers: all, some, none– Prepositions: in, onto, by, through– Connectives: and, or, but, while.– Words combine together into phrases: NP, VP

45

An Example Grammar

• S -> NP VP• VP -> V NP• NP -> NAME• NP -> ART N• ART -> a | the• V -> ate | saw• N -> cat | mouse• NAME -> Sue | Tom

46

Example Parse

• The mouse saw Sue.

47

Ambiguity

• S -> NP VP• VP -> V NP • VP -> V NP NP• NP -> N• NP -> N N• NP -> Det NP• Det -> the• V -> ate | saw | bought• N -> cat | mouse |biscuits | Sue | Tom

“Sue bought the cat biscuits”

48

Example: Chart Parsing

• Three main data structures: a chart, a key list, and a set of edges

• Chart:

1 32

1

4

2

3

4

Starting pointslengt

h

Name of terminal or non-terminal

49

Key List and Edges

• Key list: Push down stack of chart entries– “the” “box” “floats”

• Edges: rules that can be applied to chart entries to build up larger entries

1 32

1

4

2

3

4length

the

box

floats

detthe o

50

Chart Parsing Algorithm

• Loop while entries in key list– 1. Remove entry from key list

– 2. If entry already in chart, • Add edge list

• Break

– 3. Add entry from key list to chart

– 4. For all rules that begin with entry’s type, add an edge for that rule

– 5. For all edges that need the entry next, add an extended edge (see algorithm on right)

– 6. If the edge is finished, add an entry to the key list with type, start point, length, and edge list

• To extend an edge with chart entry c

– Create a new edge e’

– Set start (e’) to start (e)

– Set end(e’) to end(e)

– Set rule(e’) to rule(e) with “o” moved beyond c.

– Set the righthandside(e’) to the righthandside(e)+c

51

Try it

• S -> NP VP

• VP -> V

• NP -> Det N

• Det -> the

• N -> box

• V -> floats

52

Semantic Interpretation

• Our goal: to translate sentences into a logical form.

• But: sentences convey more than true/false:– It will rain in Seattle tomorrow.– Will it rain in Seattle tomorrow?

• A sentence can be analyzed by:– propositional content, and– speech act: tell, ask, request, deny, suggest

53

Propositional Content

• We develop a logic-like language for representing propositional content:– Word-sense ambiguity – Scope ambiguity

• Proper names --> objects (John, Alon)• Nouns --> unary predicates (woman, house)• Verbs -->

– transitive: binary predicates (find, go)– intransitive: unary predicates (laugh, cry)

• Quantifiers: most, some• Example: Mary: Loves(John, Mary)

54

From Syntax to Semantics

• ADD SLIDES ON SEMANTIC INTERPRETATION

55

Word Sense Disambiguation

• ADD SLIDES!

56

Statistical NLP

• Consider the problem of tagging part-of-speech:– “The box floats”– “The” Det; “Box” N; “Floats” V;

• Given a sentence w(1,n), where w(i) is the i-th word, we want to find tags t(i) assigned to each word w(i)

57

The Equations

• Find the t(1,n) that maximizes– P[t(1,n)|w(1,n)]=P[w(1,n)|t(1,n)]/P(w(1,n))– So, only need to maximize P[w(1,n)|t(1,n)]

• Assume that – A word depends only on previous tag– A tag depends only on previous tag– We have:

• P[w(j)|w(1,j-1),t(1,j)]=P[w(j)|t(j)], and• P[t(j)|w(1,j-1),t(1,j-1)] = P(t(j)|t(j-1)]

– Thus, want to maximize• P[w(n)|t(n-1)]*P[t(n+1)|t(n)]*P[w(n-1)|t(n-2)]*P[t(n)|t(n-1)]…

58

Example• “The box floats”: given a corpus (a training set)

– Assignment one:• T(1)=det, T(2) = V, T(3)=V

• P(V|det) is rather low, so is P(V|V). Thus is less likely compared to

– Assignment two: • T(t)=det, T(2) = N; t(3) = V

• P(N|det) is high, and P(V|N) is high, thus is more likely!

– In general, can use Hidden Markov Models to find probabilities

det N Vbox

floats

the

59

Experiments

• Charniak and Colleagues did some experiments on a collection of documents called the “Brown Corpus”, where tags are assigned by hand.

• 90% of the corpus are used for training and the other 10% for testing

• They show they can get 95% correctness with HMM’s.

• A really simple algorithm: assign t to w by the highest probability tag P(t|w) 91% correctness!

60

Natural Language Summary

• Parsing:– context free grammars with features.

• Semantic interpretation:– Translate sentences into logic-like language– Use additional domain knowledge for word-

sense disambiguation.– Use context to disambiguate references.

Date post:	20-Dec-2015
Category:	Documents
View:	223 times
Download:	0 times