+ All Categories
Home > Documents > Natural Language Processing for IR & IR...

Natural Language Processing for IR & IR...

Date post: 11-Mar-2018
Category:
Upload: lamkhanh
View: 215 times
Download: 3 times
Share this document with a friend
69
Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty Jannik Strötgen [email protected] [email protected] ATIR – April 28, 2016
Transcript
Page 1: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Advanced Topics in Information RetrievalNatural Language Processing for IR & IR Evaluation

Vinay Setty Jannik Strötgen

[email protected] [email protected]

ATIR – April 28, 2016

Page 2: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Organizational Things

please register – if you haven’t done somail to atir16 (at) mpi-inf.mpg.de(i) name, (ii) matriculation number, (iii) preferred email addresseven if you do not want to get the ECTS pointsimportant for announcements about assignments, rooms etc.

assignmentsfirst assignment todayremember: we can only open pdfs50% of points (not of exercises) with serious, presentable

c© Jannik Strötgen – ATIR-02 2 / 68

Page 3: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Outline

1 Simple Linguistic Preprocessing

2 Linguistics

3 Further Linguistic (Pre-)Processing

4 NLP Pipeline Architectures

5 Evaluation Measures

c© Jannik Strötgen – ATIR-02 3 / 68

Page 4: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?

c© Jannik Strötgen – ATIR-02 4 / 68

Page 5: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?

different types of datastructured data vs. unstructured data (vs. semi-structured data)

structured data

typically refers to information in tables

Employee Manager SalaryJohnny Frank 50000

Jack Johnny 60000Jim Johnny 50000

numerical range and exact match (for text) queries, e.g.,Salary < 60000 AND Manager = Johnny

c© Jannik Strötgen – ATIR-02 5 / 68

Page 6: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?unstructured data

typically refers to “free text”not just string matching queries

typical distinctionstructured data→ “databases”unstructured data→ “information retrieval”

NLP foundationsimportant for IR

actually: semi-structured dataalmost always some structure: title, bulletsfacilitates semi-structured search

title contains NLP and bullet contains data(not to mention the linguistic structure of text . . . )

c© Jannik Strötgen – ATIR-02 6 / 68

Page 7: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?

standard procedure in IRstarting point: documents and queriespre-processing of documents and queries typically includes

– tokenization (e.g., splitting at white spaces and hyphens)– stemming or lemmatization (group variants of same word)– stopword removal (get rid of words with little information)

this results in a bag (or sequence) of indexable terms

c© Jannik Strötgen – ATIR-02 7 / 68

Page 8: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

�������������������‣ ����������������������������������������������������������

‣ ������������������������������������������������������������������

‣ ��������������������������������������������������������������

‣ �������������������������������������������������������������������

‣ ������������������������������������������������������

��

���������������������������������

����������������������������������

�������������

�����������������

������������������ �

�������������

�������

c© Jannik Strötgen – ATIR-02 8 / 68

Page 9: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?

standard procedure in IRstarting point: documents and queriespre-processing of documents and queries typically includes

– tokenization (e.g., splitting at white spaces and hyphens)– stemming or lemmatization (group variants of same word)– stopword removal (get rid of words with little information)

this results in a bag (or sequence) of indexable terms

many NLP concepts mentioned in previous lecturetoday: linguistic / NLP foundations for IR

c© Jannik Strötgen – ATIR-02 9 / 68

Page 10: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Why NLP Foundations for IR?

goal of this lectureNLP concepts are not just buzz words,

NLP concepts shall be understood

example:what’s the difference between lemmatization and stemming?

c© Jannik Strötgen – ATIR-02 10 / 68

Page 11: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Contents

1 Simple Linguistic PreprocessingTokenizationLemmatization & Stemming

2 Linguistics

3 Further Linguistic (Pre-)Processing

4 NLP Pipeline Architectures

5 Evaluation Measures

c© Jannik Strötgen – ATIR-02 11 / 68

Page 12: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Tokenizationthe task

given a character sequence, split it into pieces called tokens

tokens are often loosely referred to as terms/wordslast lecture: “splitting at white spaces and hyphens”seems to be trivial

type vs. token (vs. term)token: instance of a sequence of characters in some particulardocument that are grouped together as a useful semantic unittype: class of all tokens containing same character sequenceterm: (normalized) type included in IR system’s dictionary

c© Jannik Strötgen – ATIR-02 12 / 68

Page 13: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Tokenization – Exampletype vs. token – example

a rose is a rose is a rosehow many tokens? 8how many types? 3 ({a, is, rose})

type vs. token – exampleA rose is a rose is a roseknowing about normalization is important

set-theoretical viewtokens→ multiset

(multiset: bag of words)types→ set

c© Jannik Strötgen – ATIR-02 13 / 68

Page 14: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Tokenization – Example

tokenization – exampleMr. O’Neill thinks rumors about Chile’s capital aren’t amusing.

simple strategiessplit at white spaces and hyphenssplit on all non-alphanumeric charactersmr | o | neill | thinks | rumors | about |chile | s | captial | aren | t | amusing

is that good? there are many alternatives→ o | neill – oneill – neill – o’neill – o’ | neill→ aren | t – arent – are | n’t – aren’t

even simple (NLP)tasks not trivial!

most importantqueries anddocumentshave to be

preprocessedidentically!

c© Jannik Strötgen – ATIR-02 14 / 68

Page 15: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Tokenizationqueries and documents have to be preprocessed identically

tokenization choices determine which (Boolean) queries matchguarantees that sequence of characters in query matches thesame sequence in text

further issueswhat about hyphens? co-education vs. drag-and-dropwhat about names? San Francisco, Los Angelestokenization is language-specific

– “this is a sequence of several words”

– noun compounds are not separated in German:“Lebensversicherungsgesellschaftsangestellter”vs. “life insurance company employee”

compoundsplitter mayimprove IR

c© Jannik Strötgen – ATIR-02 15 / 68

Page 16: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Lemmatization & Stemming

tokenization is just one step during preprocessinglemmatizationstemmingstopword removal

lemmatization and stemmingtwo tasks, same goal

→ to group variants of the same word

what’s the difference?stemming vs. lemmatization

stem vs. lemma

c© Jannik Strötgen – ATIR-02 16 / 68

Page 17: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Lemma & Lemmatizationidea

reduce inflectional forms (all variants of a “word”) to base form

examplesam, are, be, is→ becar, cars, car’s, cars’→ car

lemmatizationproper reduction to dictionary headword form

lemmadictionary form of a set of words

c© Jannik Strötgen – ATIR-02 17 / 68

Page 18: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stem & Stemming

ideareduce terms to their “roots”

examplesare→ arautomate, automates, automatic, automation→ automat

stemmingsuggests crude affix chopping

stemroot form of a set of words (not necessarily a word itself)

c© Jannik Strötgen – ATIR-02 18 / 68

Page 19: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stemming and Lemmatization – Examples

the boy’s cars are different colors

lemmatizedthe | boy | car | be |

different | color

stemmedthe | boy | car | ar |

differ | color

c© Jannik Strötgen – ATIR-02 19 / 68

Page 20: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stemming and Lemmatization – Examples

for example compressed and compression are both accepted asequivalent to compress.

lemmatizedfor | example | compress |

and | compression | be | both| accept | as | equivalent

| to | compress

stemmedfor | exampl | compress |and | compress | ar | both

| accept | as | equival| to | compress

c© Jannik Strötgen – ATIR-02 20 / 68

Page 21: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stemming

popular stemmersporter’s algorithm(http://tartarus.org/martin/PorterStemmer/)snowball (http://snowballstem.org/demo.html)

what’s better for IR? stemming or lemmatization?try it yourself!

c© Jannik Strötgen – ATIR-02 21 / 68

Page 22: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stop Words

stop wordshave little semantic contentare extremely frequent: about 30% of postings top 30 wordsoccur in almost each document, i.e., are not discriminative

→ high document frequency

example of a stop word lista, an, and, are, as, at, be, by, for, from, has, he, inis, it, its, of, on, that, the, to, was, were, will, with

what types of words are these?

c© Jannik Strötgen – ATIR-02 22 / 68

Page 23: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Stop Word Removal

ideabased on stop list, remove all stop words,i.e., stop words are not part of IR system’s dictionarysaves a lot of memorymakes query processing much faster

trend (in particular in web search):no stop word removalthere are good compression techniquesthere are good query optimization techniques

stop words are needed – examplesKing of Norwaylet it beto be or not to be

c© Jannik Strötgen – ATIR-02 23 / 68

Page 24: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Contents

1 Simple Linguistic Preprocessing

2 LinguisticsParts-of-SpeechAmbiguitiesSemantic RelationsNamed Entities

3 Further Linguistic (Pre-)Processing

4 NLP Pipeline Architectures

5 Evaluation Measures

c© Jannik Strötgen – ATIR-02 24 / 68

Page 25: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Parts-of-Speech

alternative distinction between stop words and othersfunction words: used to make sentences grammatically correctcontent words: carry the meaning of a sentence

function wordsauxiliary verbsprepositionsconjunctionsdeterminerspronouns

content wordsnounsverbsadjectivesadverbs

how many parts-of-speech are there?between 8 and hundreds of different parts-of-speechwhat’s useful depends on the application and language

c© Jannik Strötgen – ATIR-02 25 / 68

Page 26: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Ambiguities

one word, one part-of-speech?can we can fish in a can?can: auxiliary, verb, noun

c© Jannik Strötgen – ATIR-02 26 / 68

Page 27: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

����������������

��

���������

c© Jannik Strötgen – ATIR-02 27 / 68

Page 28: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Levels of Ambiguities

speech recognitionit’s hard to recognize speechit’s hard to wreck a nice beach

prepositional attachmentthe boy saw the man with the telescope

syntax / morphologytime flies (noun / verb) like (verb / preposition) an arrow

word level ambiguities“can”: auxiliary, verb, noun

disambiguationresolution ofambiguities

word level ambiguitiesmost crucial for IR

c© Jannik Strötgen – ATIR-02 28 / 68

Page 29: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Semantic Relations between Words

synonyms→ query for one, find documents with either onedifferent words, same meaningcar vs. automotive

homographs→ disambiguate or diversify resultssame spelling, different meaningbank vs. bank

homophons→ problem with spoken queriessame pronunciation, different meaningthere vs. their vs. they’re

homonymssame spelling, same pronunciation, different meaning

c© Jannik Strötgen – ATIR-02 29 / 68

Page 30: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Named Entities

entityanything you can refer to with a name

location, person, organizationfacilities, vehicles, songs, movies, products(and domain-dependent ones: genes & proteins, ...)sometimes: numbers, dates

relevant in IRentities are popular and extremely frequent in queries

names are highly ambiguousWashington→ place(s), person(s), (government)Springfield

c© Jannik Strötgen – ATIR-02 30 / 68

Page 31: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Contents

1 Simple Linguistic Preprocessing

2 Linguistics

3 Further Linguistic (Pre-)ProcessingNormalizationsPart-of-Speech TaggingChunkingParsing – Syntactic Analysis

4 NLP Pipeline Architectures

5 Evaluation Measures

c© Jannik Strötgen – ATIR-02 31 / 68

Page 32: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Normalizationsindexed terms have to be normalized

lemmatizationstemming

some things need to be done before that:U.S.A. vs. USAanti-discriminatory vs. antidiscriminatoryusa vs. USA

termsnormalization results in termsa term is a normalized word type, an entry in an IR system’sdictionary

c© Jannik Strötgen – ATIR-02 32 / 68

Page 33: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Taggingidea

number of words in a language unlimited– few frequent words, many infrequent words

Zipf’s lawPn ∝ 1/na

number of parts-of-speech limited– Dionysios Thrax von Alexandria (100 BC): 8 parts-of-speech– in NLP: up to hundreds of part-of-speech tags

(application- and language-dependent)

many words are ambiguous

exampleThe/DET newspaper/NN published/VD ten/CD articles/NNS ./.Can/AUX we/PRP can/VB fish/NN in/IN a/DET can/NN ./.

c© Jannik Strötgen – ATIR-02 33 / 68

Page 34: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

part-of-speech tagsallow for higher degree of abstraction to estimate likelihoods

what’s the likelihood of:“an amazing” – is followed by “goalkeeper”“an amazing” – is followed by “scored”“determiner adjective” – is followed by “noun”“determiner adjective” – is followed by “verb”

automatic assignment of part-of-speech tagse.g., Penn Treebank tagset: 36 tags (+ 9 punctuation tags)ambiguities can be resolved via contexts

c© Jannik Strötgen – ATIR-02 34 / 68

Page 35: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

way to go:input: sequence of (tokenized) wordsoutput: chain of tokens with their part-of-speech tagsgoal: most likely part-of-speech tags for the sequence→ ambiguities shall be resolveda typical classification problem

is it tough?most words in English are not ambiguousmost occurring words in English are ambiguousdisambiguation is required

today’s taggersabout 97% accuracy (but highly domain-dependent)

c© Jannik Strötgen – ATIR-02 35 / 68

Page 36: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

approachesrule-based taggersprobabilistic taggerstransformation-based taggers

probabilistic taggersgiven: manually annotated training data (“gold standard”)learn probabilities based on training dataestimate probabilities of pos tags given a word in a context→ Hidden Markov Models

c© Jannik Strötgen – ATIR-02 36 / 68

Page 37: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

Hidden Markov Modelsbased on Bayesian inferencegoal: given sequence of tokens, assign sequence of pos tagsgiven all possible tag sequences, which one is most likely?

t̂n1 = argmax P(tn

1 |wn1 )

using Bayes, we gett̂n1 = argmax P(wn

1 |tn1 )P(tn

1 )

P(wn1 )

→ t̂n1 = argmax P(wn

1 |tn1 )P(tn

1 )

assumptions:probability of a word depends on own tag onlyP(wn

1 |tn1 ) ≈

∏ni=1 P(wi |ti)

probability of a tag depends on previous tag onlyP(tn

1 ) ≈∏n

i=1 P(ti |ti−1)

c© Jannik Strötgen – ATIR-02 37 / 68

Page 38: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

Hidden Markov Modelsbased on Bayesian inferencegoal: given sequence of tokens, assign sequence of pos tagsgiven all possible tag sequences, which one is most likely?

t̂n1 = argmax P(wn

1 |tn1 )P(tn

1 ) ≈ argmax∏n

i=1 P(wi |ti)P(ti |ti−1)

maximum likelihood estimation based on a corpus

P(ti |ti−1) =C(ti−1,ti )C(ti−1)

P(wi |ti) = C(ti ,wi )C(ti )

c© Jannik Strötgen – ATIR-02 38 / 68

Page 39: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Part-of-Speech Tagging

in information retrievaldetermine content words in a query based on pos tagshelpful for named entity recognition→ semantic search

c© Jannik Strötgen – ATIR-02 39 / 68

Page 40: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Chunking

(simple) grouping of token’s that belong togethermost popular: noun phrase (NP) chunkingbut also: verb phrases

example[ Paris ]NP [ has been ]VP [ a wonderful stop ]NP during [ mytravel ]NP – just as [ New York City ]NP .

why chunking for IR?simpler than full syntactic analysisalready provides some structure

c© Jannik Strötgen – ATIR-02 40 / 68

Page 41: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Parsing

goal: syntactic structure of a sentence

two views of linguistic structureconstituency (phrase) structure

example (man has the telescope)The boy saw the man with the telescope[ [ The boy ]NP[ [ saw ]VP [ [ the man ]NP [ with [ the telescope ]NP ]PP ]NP ]VP]S

c© Jannik Strötgen – ATIR-02 41 / 68

Page 42: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Parsing

goal: syntactic structure of a sentence

two views of linguistic structureconstituency (phrase) structuredependency structure

example (man has the telescope)The boy saw the man with the telescope

helpful for IR?relation extraction

for knowledgeharvesting

The boy saw the man with the telescope

ROOT

subj

obj

det

c© Jannik Strötgen – ATIR-02 42 / 68

Page 43: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Named Entity Recognition

tasksextraction→ determine the boundariesclassification→ assign class (PER, LOC, ORG, . . . )

systemsrule-based→ with gazetteers, context-based rules (Mr.), . . .machine learning→ features: mixed case (eBay), ends in digit(A9), all caps (BMW), . . .several tools available (e.g., Stanford NER)

extraction is good, but normalization is better

c© Jannik Strötgen – ATIR-02 43 / 68

Page 44: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Named Entity Normalization

same task, many namesnormalizationlinkingresolutiongrounding

example: Washington/wiki/Washington,_D.C./wiki/Washington_%28state%29/wiki/Washington_Irving/wiki/Washington_Redskins/wiki/George_Washington

toolsseveral tools available (AIDA, . . . )

c© Jannik Strötgen – ATIR-02 44 / 68

Page 45: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Contents

1 Simple Linguistic Preprocessing

2 Linguistics

3 Further Linguistic (Pre-)Processing

4 NLP Pipeline Architectures

5 Evaluation Measures

c© Jannik Strötgen – ATIR-02 45 / 68

Page 46: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

NLP Pipeline Architectures

NLP tasks can often be split into multiple sub-taskse.g., dependency parsing:

– sentence splitting– tokenization– part-of-speech tagging– parsing

severalpre-processingcomponents inElasticsearch

pre-processing of corpora, e.g., for semantic searchUIMA https://uima.apache.org/

GATE https://gate.ac.uk/

NLTK http://www.nltk.org/

Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/

c© Jannik Strötgen – ATIR-02 46 / 68

Page 47: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

The Pipeline Principle – Why a (UIMA) Pipeline

... postponed to the information extraction lecture

c© Jannik Strötgen – ATIR-02 47 / 68

Page 48: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Contents

1 Simple Linguistic Preprocessing

2 Linguistics

3 Further Linguistic (Pre-)Processing

4 NLP Pipeline Architectures

5 Evaluation MeasuresEvaluating NLP SystemsEvaluating IR Systems

c© Jannik Strötgen – ATIR-02 48 / 68

Page 49: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures

what is “good” / “correct” in information retrieval?

c© Jannik Strötgen – ATIR-02 49 / 68

Page 50: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLP

let’s start with a simple NLP task

examplegiven a sequence of tokens, nouns

can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose

how good is the system’s output?

c© Jannik Strötgen – ATIR-02 50 / 68

Page 51: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLPfrequently used measures

precision, recall, f-scorebased on evaluating all system’s decisions

can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose

correct decisions: 3 + 8 = 11?

c© Jannik Strötgen – ATIR-02 51 / 68

Page 52: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLPfrequently used measures

precision, recall, f-scorebased on evaluating all system’s decisions

can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose

we should count them separately

true positives: 3true negatives: 8

false positives: 2false negatives: 1

c© Jannik Strötgen – ATIR-02 51 / 68

Page 53: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLPconfusion matrix

ground truthpos neg

systempos TP FPneg FN TN

precision = TPTP+FP recall = TP

TP+FN f1-score = 2×precision×recallprecision+recall

or in wordsprecision: ratio of instances correctly marked as positive by thesystem to all instances marked as positive by the systemrecall: ratio of instances correctly marked as positive by thesystem to all instances marked as positive in the gold standardf1-score: balanced harmonic mean

c© Jannik Strötgen – ATIR-02 52 / 68

Page 54: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLP

can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose

true positives: 3true negatives: 8

false positives: 2false negatives: 1

precision = 3 / (3+2) = 0.6recall = 3 / (3+1) = 0.75f1-score = (2 × 0.6 × 0.75) / (0.6 + 0.75) = 2/3

c© Jannik Strötgen – ATIR-02 53 / 68

Page 55: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in NLPis precision then the accuracy?

accuracy = TP+TNTP+TN+FP+FN

in our exampleprecision = 0.6accuracy = 0.78

differenceprecision is about system’s decisions about instances markedas positive in the gold standardaccuracy is about correctness of all decisions

what makes sensedepends on the task

c© Jannik Strötgen – ATIR-02 54 / 68

Page 56: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in IR

which of the measures make sense to evaluate IR:precision, recall, f1-score, accuracy?

what’s the goal of IR systems?is the information need satisfied?is the user happy?happiness is elusive to measure

what’s an alternative?relevance of search resultsnow: how to measure relevance?

c© Jannik Strötgen – ATIR-02 55 / 68

Page 57: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in IRmeasuring relevance with a benchmark

a set of queriesa document collectionrelevance judgments

TREC data setsare popularbenchmarks

there are several issue, which we ignore (for now)

confusion matrix for IR

manual judgmentsrelevant not relevant

systemrelevant TP FP

not relevant FN TN

c© Jannik Strötgen – ATIR-02 56 / 68

Page 58: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Evaluation Measures in IRwe can calculate

precisionrecallf1-scoreaccuracy

but are we done?

short-comingsonly for binary judgments (relevant / not relevant)only for unranked resultshow do we get manual judgments for all documents?

we needmeasuresfor rankedretrieval

c© Jannik Strötgen – ATIR-02 57 / 68

Page 59: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Measures for Ranked Retrievalprecision at k

set rank threshold k (e.g., 1, 3, 5, 10, 20, 50)compute percentage of relevant documents in kprecision = “TP in k ′′

k

ignores all documents ranked lower than k

example

rank precision @1 2 3 4 5 6 7 8 9 10 11 1 3 5 10n r r r n n n n r n r 0 0.667 0.6 0.4

c© Jannik Strötgen – ATIR-02 58 / 68

Page 60: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Measures for Ranked Retrievalrecall at k

as precision at kprecision recall curve (http://nlp.stanford.edu/IR-book/html/htmledition/img532.png)

c© Jannik Strötgen – ATIR-02 59 / 68

Page 61: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Measures for Ranked Retrievalaverage precision

precision at all ranks r with relevant documentcompute precision at k for each r(typically with cut-off, i.e., lower ranks not judged / considered)

example

rank average precision1 2 3 4 5 6 7 8 9 10 11n r r r n n n n r n r 1/2+2/3+3/4+4/9+5/11

5 = .56

compute: p@2, p@3, p@4, p9, p11number of relevant documents: 5

c© Jannik Strötgen – ATIR-02 60 / 68

Page 62: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Measures for Ranked Retrievalso far

measures for single queries only

mean average precisionsum of average precision divided by number of queries

MAP =∑u

i=1 APiu

examplefor query-1, AP1 = 0.62for query-2, AP2 = 0.44

MAP = AP1+AP22 = 0.53

MAP is frequentlyreported in research

papers

attention:each query is worth the

same!

assumption:the more relevant

documents, the better

c© Jannik Strötgen – ATIR-02 61 / 68

Page 63: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Beyond Binary Relevance

not realisticdocuments either relevant or not relevant (0 / 1)

much betterhighly relevant documents more usefullower ranks are less useful (likely to be ignored)

c© Jannik Strötgen – ATIR-02 62 / 68

Page 64: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Beyond Binary Relevance

discounted cumulative gaingraded relevance as measure of usefulness (gain)gain is accumulated,starting at the top,reduced (discounted) at lower ranks

discount ratetypically used: 1/log (rank) (with base 2)

relevance judgmentsscale of [0,r], with r > 2

c© Jannik Strötgen – ATIR-02 63 / 68

Page 65: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Beyond Binary Relevance

cumulative gainratings of top n ranked documents r1, r2, ...rn

CG = r1 + r2 + ...+ rn

discounted cumulative gainat rank nDCG = r1 +

r2log2(2)

+ r3log2(3)

+ ...+ rnlog2(n)

scores highly depend onjudgments for queries

normalized discounted cumulative gainnormalize DCG at rank n by DCG at n of ideal rankingideal ranking of relevance scores: 3, 3, 3, 2, 2, 1, 1, 1, 0, 0, . . .

c© Jannik Strötgen – ATIR-02 64 / 68

Page 66: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Beyond Binary Relevance

popular to evaluate Web searchnDCGreciprocal rank: rr = 1

K , with K rank of first relevant documentmean reciprocal rank: mean rr over multiple queriesexploiting click data (you need the data to do that . . . )

c© Jannik Strötgen – ATIR-02 65 / 68

Page 67: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Summary

NLP 4 IRas text is not fully structured, plain keyword search not enoughpre-processing documents and queries is importanttokenization, stemming, lemmatization, stop word removal arefrequently used

Ambiguitieslanguage is often ambiguousthere are several levels of ambiguities

NLP taskspart-of-speech tagging helps to generalizenamed entities are important in IR

c© Jannik Strötgen – ATIR-02 66 / 68

Page 68: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Summary

Evaluation Measuresprecision, recall, f1-score (in NLP)IR evaluation is different from NLP evaluation

Assignment 1the slides will help you a lot!

Thank you for your attention!

c© Jannik Strötgen – ATIR-02 67 / 68

Page 69: Natural Language Processing for IR & IR Evaluationresources.mpi-inf.mpg.de/departments/d5/teaching/ss16/atir16/... · snowball ( what’s better for IR? stemming or lemmatization?

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End

Thankssome slides / examples are taken from / similar to those of:

Klaus Berberich, Saarland University, previous ATIR lecture

Manning, Raghavan, Schütze: Introduction to Information Retrieval(including slides to the book)

Yannick Versley, Heidelberg University, Introduction to ComputationalLinguistics.

c© Jannik Strötgen – ATIR-02 68 / 68


Recommended