Advanced Topics in Information RetrievalNatural Language Processing for IR & IR Evaluation
Vinay Setty Jannik Strötgen
[email protected] [email protected]
ATIR – April 28, 2016
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Organizational Things
please register – if you haven’t done somail to atir16 (at) mpi-inf.mpg.de(i) name, (ii) matriculation number, (iii) preferred email addresseven if you do not want to get the ECTS pointsimportant for announcements about assignments, rooms etc.
assignmentsfirst assignment todayremember: we can only open pdfs50% of points (not of exercises) with serious, presentable
c© Jannik Strötgen – ATIR-02 2 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Outline
1 Simple Linguistic Preprocessing
2 Linguistics
3 Further Linguistic (Pre-)Processing
4 NLP Pipeline Architectures
5 Evaluation Measures
c© Jannik Strötgen – ATIR-02 3 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?
c© Jannik Strötgen – ATIR-02 4 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?
different types of datastructured data vs. unstructured data (vs. semi-structured data)
structured data
typically refers to information in tables
Employee Manager SalaryJohnny Frank 50000
Jack Johnny 60000Jim Johnny 50000
numerical range and exact match (for text) queries, e.g.,Salary < 60000 AND Manager = Johnny
c© Jannik Strötgen – ATIR-02 5 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?unstructured data
typically refers to “free text”not just string matching queries
typical distinctionstructured data→ “databases”unstructured data→ “information retrieval”
NLP foundationsimportant for IR
actually: semi-structured dataalmost always some structure: title, bulletsfacilitates semi-structured search
title contains NLP and bullet contains data(not to mention the linguistic structure of text . . . )
c© Jannik Strötgen – ATIR-02 6 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?
standard procedure in IRstarting point: documents and queriespre-processing of documents and queries typically includes
– tokenization (e.g., splitting at white spaces and hyphens)– stemming or lemmatization (group variants of same word)– stopword removal (get rid of words with little information)
this results in a bag (or sequence) of indexable terms
c© Jannik Strötgen – ATIR-02 7 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
�������������������‣ ����������������������������������������������������������
‣ ������������������������������������������������������������������
‣ ��������������������������������������������������������������
‣ �������������������������������������������������������������������
‣ ������������������������������������������������������
��
���������������������������������
����������������������������������
�������������
�����������������
������������������ �
�������������
�������
c© Jannik Strötgen – ATIR-02 8 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?
standard procedure in IRstarting point: documents and queriespre-processing of documents and queries typically includes
– tokenization (e.g., splitting at white spaces and hyphens)– stemming or lemmatization (group variants of same word)– stopword removal (get rid of words with little information)
this results in a bag (or sequence) of indexable terms
many NLP concepts mentioned in previous lecturetoday: linguistic / NLP foundations for IR
c© Jannik Strötgen – ATIR-02 9 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Why NLP Foundations for IR?
goal of this lectureNLP concepts are not just buzz words,
NLP concepts shall be understood
example:what’s the difference between lemmatization and stemming?
c© Jannik Strötgen – ATIR-02 10 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Contents
1 Simple Linguistic PreprocessingTokenizationLemmatization & Stemming
2 Linguistics
3 Further Linguistic (Pre-)Processing
4 NLP Pipeline Architectures
5 Evaluation Measures
c© Jannik Strötgen – ATIR-02 11 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Tokenizationthe task
given a character sequence, split it into pieces called tokens
tokens are often loosely referred to as terms/wordslast lecture: “splitting at white spaces and hyphens”seems to be trivial
type vs. token (vs. term)token: instance of a sequence of characters in some particulardocument that are grouped together as a useful semantic unittype: class of all tokens containing same character sequenceterm: (normalized) type included in IR system’s dictionary
c© Jannik Strötgen – ATIR-02 12 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Tokenization – Exampletype vs. token – example
a rose is a rose is a rosehow many tokens? 8how many types? 3 ({a, is, rose})
type vs. token – exampleA rose is a rose is a roseknowing about normalization is important
set-theoretical viewtokens→ multiset
(multiset: bag of words)types→ set
c© Jannik Strötgen – ATIR-02 13 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Tokenization – Example
tokenization – exampleMr. O’Neill thinks rumors about Chile’s capital aren’t amusing.
simple strategiessplit at white spaces and hyphenssplit on all non-alphanumeric charactersmr | o | neill | thinks | rumors | about |chile | s | captial | aren | t | amusing
is that good? there are many alternatives→ o | neill – oneill – neill – o’neill – o’ | neill→ aren | t – arent – are | n’t – aren’t
even simple (NLP)tasks not trivial!
most importantqueries anddocumentshave to be
preprocessedidentically!
c© Jannik Strötgen – ATIR-02 14 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Tokenizationqueries and documents have to be preprocessed identically
tokenization choices determine which (Boolean) queries matchguarantees that sequence of characters in query matches thesame sequence in text
further issueswhat about hyphens? co-education vs. drag-and-dropwhat about names? San Francisco, Los Angelestokenization is language-specific
– “this is a sequence of several words”
– noun compounds are not separated in German:“Lebensversicherungsgesellschaftsangestellter”vs. “life insurance company employee”
compoundsplitter mayimprove IR
c© Jannik Strötgen – ATIR-02 15 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Lemmatization & Stemming
tokenization is just one step during preprocessinglemmatizationstemmingstopword removal
lemmatization and stemmingtwo tasks, same goal
→ to group variants of the same word
what’s the difference?stemming vs. lemmatization
stem vs. lemma
c© Jannik Strötgen – ATIR-02 16 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Lemma & Lemmatizationidea
reduce inflectional forms (all variants of a “word”) to base form
examplesam, are, be, is→ becar, cars, car’s, cars’→ car
lemmatizationproper reduction to dictionary headword form
lemmadictionary form of a set of words
c© Jannik Strötgen – ATIR-02 17 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stem & Stemming
ideareduce terms to their “roots”
examplesare→ arautomate, automates, automatic, automation→ automat
stemmingsuggests crude affix chopping
stemroot form of a set of words (not necessarily a word itself)
c© Jannik Strötgen – ATIR-02 18 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stemming and Lemmatization – Examples
the boy’s cars are different colors
lemmatizedthe | boy | car | be |
different | color
stemmedthe | boy | car | ar |
differ | color
c© Jannik Strötgen – ATIR-02 19 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stemming and Lemmatization – Examples
for example compressed and compression are both accepted asequivalent to compress.
lemmatizedfor | example | compress |
and | compression | be | both| accept | as | equivalent
| to | compress
stemmedfor | exampl | compress |and | compress | ar | both
| accept | as | equival| to | compress
c© Jannik Strötgen – ATIR-02 20 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stemming
popular stemmersporter’s algorithm(http://tartarus.org/martin/PorterStemmer/)snowball (http://snowballstem.org/demo.html)
what’s better for IR? stemming or lemmatization?try it yourself!
c© Jannik Strötgen – ATIR-02 21 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stop Words
stop wordshave little semantic contentare extremely frequent: about 30% of postings top 30 wordsoccur in almost each document, i.e., are not discriminative
→ high document frequency
example of a stop word lista, an, and, are, as, at, be, by, for, from, has, he, inis, it, its, of, on, that, the, to, was, were, will, with
what types of words are these?
c© Jannik Strötgen – ATIR-02 22 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Stop Word Removal
ideabased on stop list, remove all stop words,i.e., stop words are not part of IR system’s dictionarysaves a lot of memorymakes query processing much faster
trend (in particular in web search):no stop word removalthere are good compression techniquesthere are good query optimization techniques
stop words are needed – examplesKing of Norwaylet it beto be or not to be
c© Jannik Strötgen – ATIR-02 23 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Contents
1 Simple Linguistic Preprocessing
2 LinguisticsParts-of-SpeechAmbiguitiesSemantic RelationsNamed Entities
3 Further Linguistic (Pre-)Processing
4 NLP Pipeline Architectures
5 Evaluation Measures
c© Jannik Strötgen – ATIR-02 24 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Parts-of-Speech
alternative distinction between stop words and othersfunction words: used to make sentences grammatically correctcontent words: carry the meaning of a sentence
function wordsauxiliary verbsprepositionsconjunctionsdeterminerspronouns
content wordsnounsverbsadjectivesadverbs
how many parts-of-speech are there?between 8 and hundreds of different parts-of-speechwhat’s useful depends on the application and language
c© Jannik Strötgen – ATIR-02 25 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Ambiguities
one word, one part-of-speech?can we can fish in a can?can: auxiliary, verb, noun
c© Jannik Strötgen – ATIR-02 26 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
����������������
��
���������
c© Jannik Strötgen – ATIR-02 27 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Levels of Ambiguities
speech recognitionit’s hard to recognize speechit’s hard to wreck a nice beach
prepositional attachmentthe boy saw the man with the telescope
syntax / morphologytime flies (noun / verb) like (verb / preposition) an arrow
word level ambiguities“can”: auxiliary, verb, noun
disambiguationresolution ofambiguities
word level ambiguitiesmost crucial for IR
c© Jannik Strötgen – ATIR-02 28 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Semantic Relations between Words
synonyms→ query for one, find documents with either onedifferent words, same meaningcar vs. automotive
homographs→ disambiguate or diversify resultssame spelling, different meaningbank vs. bank
homophons→ problem with spoken queriessame pronunciation, different meaningthere vs. their vs. they’re
homonymssame spelling, same pronunciation, different meaning
c© Jannik Strötgen – ATIR-02 29 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Named Entities
entityanything you can refer to with a name
location, person, organizationfacilities, vehicles, songs, movies, products(and domain-dependent ones: genes & proteins, ...)sometimes: numbers, dates
relevant in IRentities are popular and extremely frequent in queries
names are highly ambiguousWashington→ place(s), person(s), (government)Springfield
c© Jannik Strötgen – ATIR-02 30 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Contents
1 Simple Linguistic Preprocessing
2 Linguistics
3 Further Linguistic (Pre-)ProcessingNormalizationsPart-of-Speech TaggingChunkingParsing – Syntactic Analysis
4 NLP Pipeline Architectures
5 Evaluation Measures
c© Jannik Strötgen – ATIR-02 31 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Normalizationsindexed terms have to be normalized
lemmatizationstemming
some things need to be done before that:U.S.A. vs. USAanti-discriminatory vs. antidiscriminatoryusa vs. USA
termsnormalization results in termsa term is a normalized word type, an entry in an IR system’sdictionary
c© Jannik Strötgen – ATIR-02 32 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Taggingidea
number of words in a language unlimited– few frequent words, many infrequent words
Zipf’s lawPn ∝ 1/na
number of parts-of-speech limited– Dionysios Thrax von Alexandria (100 BC): 8 parts-of-speech– in NLP: up to hundreds of part-of-speech tags
(application- and language-dependent)
many words are ambiguous
exampleThe/DET newspaper/NN published/VD ten/CD articles/NNS ./.Can/AUX we/PRP can/VB fish/NN in/IN a/DET can/NN ./.
c© Jannik Strötgen – ATIR-02 33 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
part-of-speech tagsallow for higher degree of abstraction to estimate likelihoods
what’s the likelihood of:“an amazing” – is followed by “goalkeeper”“an amazing” – is followed by “scored”“determiner adjective” – is followed by “noun”“determiner adjective” – is followed by “verb”
automatic assignment of part-of-speech tagse.g., Penn Treebank tagset: 36 tags (+ 9 punctuation tags)ambiguities can be resolved via contexts
c© Jannik Strötgen – ATIR-02 34 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
way to go:input: sequence of (tokenized) wordsoutput: chain of tokens with their part-of-speech tagsgoal: most likely part-of-speech tags for the sequence→ ambiguities shall be resolveda typical classification problem
is it tough?most words in English are not ambiguousmost occurring words in English are ambiguousdisambiguation is required
today’s taggersabout 97% accuracy (but highly domain-dependent)
c© Jannik Strötgen – ATIR-02 35 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
approachesrule-based taggersprobabilistic taggerstransformation-based taggers
probabilistic taggersgiven: manually annotated training data (“gold standard”)learn probabilities based on training dataestimate probabilities of pos tags given a word in a context→ Hidden Markov Models
c© Jannik Strötgen – ATIR-02 36 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
Hidden Markov Modelsbased on Bayesian inferencegoal: given sequence of tokens, assign sequence of pos tagsgiven all possible tag sequences, which one is most likely?
t̂n1 = argmax P(tn
1 |wn1 )
using Bayes, we gett̂n1 = argmax P(wn
1 |tn1 )P(tn
1 )
P(wn1 )
→ t̂n1 = argmax P(wn
1 |tn1 )P(tn
1 )
assumptions:probability of a word depends on own tag onlyP(wn
1 |tn1 ) ≈
∏ni=1 P(wi |ti)
probability of a tag depends on previous tag onlyP(tn
1 ) ≈∏n
i=1 P(ti |ti−1)
c© Jannik Strötgen – ATIR-02 37 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
Hidden Markov Modelsbased on Bayesian inferencegoal: given sequence of tokens, assign sequence of pos tagsgiven all possible tag sequences, which one is most likely?
t̂n1 = argmax P(wn
1 |tn1 )P(tn
1 ) ≈ argmax∏n
i=1 P(wi |ti)P(ti |ti−1)
maximum likelihood estimation based on a corpus
P(ti |ti−1) =C(ti−1,ti )C(ti−1)
P(wi |ti) = C(ti ,wi )C(ti )
c© Jannik Strötgen – ATIR-02 38 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Part-of-Speech Tagging
in information retrievaldetermine content words in a query based on pos tagshelpful for named entity recognition→ semantic search
c© Jannik Strötgen – ATIR-02 39 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Chunking
(simple) grouping of token’s that belong togethermost popular: noun phrase (NP) chunkingbut also: verb phrases
example[ Paris ]NP [ has been ]VP [ a wonderful stop ]NP during [ mytravel ]NP – just as [ New York City ]NP .
why chunking for IR?simpler than full syntactic analysisalready provides some structure
c© Jannik Strötgen – ATIR-02 40 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Parsing
goal: syntactic structure of a sentence
two views of linguistic structureconstituency (phrase) structure
example (man has the telescope)The boy saw the man with the telescope[ [ The boy ]NP[ [ saw ]VP [ [ the man ]NP [ with [ the telescope ]NP ]PP ]NP ]VP]S
c© Jannik Strötgen – ATIR-02 41 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Parsing
goal: syntactic structure of a sentence
two views of linguistic structureconstituency (phrase) structuredependency structure
example (man has the telescope)The boy saw the man with the telescope
helpful for IR?relation extraction
for knowledgeharvesting
The boy saw the man with the telescope
ROOT
subj
obj
det
c© Jannik Strötgen – ATIR-02 42 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Named Entity Recognition
tasksextraction→ determine the boundariesclassification→ assign class (PER, LOC, ORG, . . . )
systemsrule-based→ with gazetteers, context-based rules (Mr.), . . .machine learning→ features: mixed case (eBay), ends in digit(A9), all caps (BMW), . . .several tools available (e.g., Stanford NER)
extraction is good, but normalization is better
c© Jannik Strötgen – ATIR-02 43 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Named Entity Normalization
same task, many namesnormalizationlinkingresolutiongrounding
example: Washington/wiki/Washington,_D.C./wiki/Washington_%28state%29/wiki/Washington_Irving/wiki/Washington_Redskins/wiki/George_Washington
toolsseveral tools available (AIDA, . . . )
c© Jannik Strötgen – ATIR-02 44 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Contents
1 Simple Linguistic Preprocessing
2 Linguistics
3 Further Linguistic (Pre-)Processing
4 NLP Pipeline Architectures
5 Evaluation Measures
c© Jannik Strötgen – ATIR-02 45 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
NLP Pipeline Architectures
NLP tasks can often be split into multiple sub-taskse.g., dependency parsing:
– sentence splitting– tokenization– part-of-speech tagging– parsing
severalpre-processingcomponents inElasticsearch
pre-processing of corpora, e.g., for semantic searchUIMA https://uima.apache.org/
GATE https://gate.ac.uk/
NLTK http://www.nltk.org/
Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/
c© Jannik Strötgen – ATIR-02 46 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
The Pipeline Principle – Why a (UIMA) Pipeline
... postponed to the information extraction lecture
c© Jannik Strötgen – ATIR-02 47 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Contents
1 Simple Linguistic Preprocessing
2 Linguistics
3 Further Linguistic (Pre-)Processing
4 NLP Pipeline Architectures
5 Evaluation MeasuresEvaluating NLP SystemsEvaluating IR Systems
c© Jannik Strötgen – ATIR-02 48 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures
what is “good” / “correct” in information retrieval?
c© Jannik Strötgen – ATIR-02 49 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLP
let’s start with a simple NLP task
examplegiven a sequence of tokens, nouns
can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose
how good is the system’s output?
c© Jannik Strötgen – ATIR-02 50 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLPfrequently used measures
precision, recall, f-scorebased on evaluating all system’s decisions
can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose
correct decisions: 3 + 8 = 11?
c© Jannik Strötgen – ATIR-02 51 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLPfrequently used measures
precision, recall, f-scorebased on evaluating all system’s decisions
can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose
we should count them separately
true positives: 3true negatives: 8
false positives: 2false negatives: 1
c© Jannik Strötgen – ATIR-02 51 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLPconfusion matrix
ground truthpos neg
systempos TP FPneg FN TN
precision = TPTP+FP recall = TP
TP+FN f1-score = 2×precision×recallprecision+recall
or in wordsprecision: ratio of instances correctly marked as positive by thesystem to all instances marked as positive by the systemrecall: ratio of instances correctly marked as positive by thesystem to all instances marked as positive in the gold standardf1-score: balanced harmonic mean
c© Jannik Strötgen – ATIR-02 52 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLP
can a red rose be a tree or a fly or just a rosegold annotationscan a red rose be a tree or a fly or just a roseexample system outputcan a red rose be a tree or a fly or just a rose
true positives: 3true negatives: 8
false positives: 2false negatives: 1
precision = 3 / (3+2) = 0.6recall = 3 / (3+1) = 0.75f1-score = (2 × 0.6 × 0.75) / (0.6 + 0.75) = 2/3
c© Jannik Strötgen – ATIR-02 53 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in NLPis precision then the accuracy?
accuracy = TP+TNTP+TN+FP+FN
in our exampleprecision = 0.6accuracy = 0.78
differenceprecision is about system’s decisions about instances markedas positive in the gold standardaccuracy is about correctness of all decisions
what makes sensedepends on the task
c© Jannik Strötgen – ATIR-02 54 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in IR
which of the measures make sense to evaluate IR:precision, recall, f1-score, accuracy?
what’s the goal of IR systems?is the information need satisfied?is the user happy?happiness is elusive to measure
what’s an alternative?relevance of search resultsnow: how to measure relevance?
c© Jannik Strötgen – ATIR-02 55 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in IRmeasuring relevance with a benchmark
a set of queriesa document collectionrelevance judgments
TREC data setsare popularbenchmarks
there are several issue, which we ignore (for now)
confusion matrix for IR
manual judgmentsrelevant not relevant
systemrelevant TP FP
not relevant FN TN
c© Jannik Strötgen – ATIR-02 56 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Evaluation Measures in IRwe can calculate
precisionrecallf1-scoreaccuracy
but are we done?
short-comingsonly for binary judgments (relevant / not relevant)only for unranked resultshow do we get manual judgments for all documents?
we needmeasuresfor rankedretrieval
c© Jannik Strötgen – ATIR-02 57 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Measures for Ranked Retrievalprecision at k
set rank threshold k (e.g., 1, 3, 5, 10, 20, 50)compute percentage of relevant documents in kprecision = “TP in k ′′
k
ignores all documents ranked lower than k
example
rank precision @1 2 3 4 5 6 7 8 9 10 11 1 3 5 10n r r r n n n n r n r 0 0.667 0.6 0.4
c© Jannik Strötgen – ATIR-02 58 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Measures for Ranked Retrievalrecall at k
as precision at kprecision recall curve (http://nlp.stanford.edu/IR-book/html/htmledition/img532.png)
c© Jannik Strötgen – ATIR-02 59 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Measures for Ranked Retrievalaverage precision
precision at all ranks r with relevant documentcompute precision at k for each r(typically with cut-off, i.e., lower ranks not judged / considered)
example
rank average precision1 2 3 4 5 6 7 8 9 10 11n r r r n n n n r n r 1/2+2/3+3/4+4/9+5/11
5 = .56
compute: p@2, p@3, p@4, p9, p11number of relevant documents: 5
c© Jannik Strötgen – ATIR-02 60 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Measures for Ranked Retrievalso far
measures for single queries only
mean average precisionsum of average precision divided by number of queries
MAP =∑u
i=1 APiu
examplefor query-1, AP1 = 0.62for query-2, AP2 = 0.44
MAP = AP1+AP22 = 0.53
MAP is frequentlyreported in research
papers
attention:each query is worth the
same!
assumption:the more relevant
documents, the better
c© Jannik Strötgen – ATIR-02 61 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Beyond Binary Relevance
not realisticdocuments either relevant or not relevant (0 / 1)
much betterhighly relevant documents more usefullower ranks are less useful (likely to be ignored)
c© Jannik Strötgen – ATIR-02 62 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Beyond Binary Relevance
discounted cumulative gaingraded relevance as measure of usefulness (gain)gain is accumulated,starting at the top,reduced (discounted) at lower ranks
discount ratetypically used: 1/log (rank) (with base 2)
relevance judgmentsscale of [0,r], with r > 2
c© Jannik Strötgen – ATIR-02 63 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Beyond Binary Relevance
cumulative gainratings of top n ranked documents r1, r2, ...rn
CG = r1 + r2 + ...+ rn
discounted cumulative gainat rank nDCG = r1 +
r2log2(2)
+ r3log2(3)
+ ...+ rnlog2(n)
scores highly depend onjudgments for queries
normalized discounted cumulative gainnormalize DCG at rank n by DCG at n of ideal rankingideal ranking of relevance scores: 3, 3, 3, 2, 2, 1, 1, 1, 0, 0, . . .
c© Jannik Strötgen – ATIR-02 64 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Beyond Binary Relevance
popular to evaluate Web searchnDCGreciprocal rank: rr = 1
K , with K rank of first relevant documentmean reciprocal rank: mean rr over multiple queriesexploiting click data (you need the data to do that . . . )
c© Jannik Strötgen – ATIR-02 65 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Summary
NLP 4 IRas text is not fully structured, plain keyword search not enoughpre-processing documents and queries is importanttokenization, stemming, lemmatization, stop word removal arefrequently used
Ambiguitieslanguage is often ambiguousthere are several levels of ambiguities
NLP taskspart-of-speech tagging helps to generalizenamed entities are important in IR
c© Jannik Strötgen – ATIR-02 66 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Summary
Evaluation Measuresprecision, recall, f1-score (in NLP)IR evaluation is different from NLP evaluation
Assignment 1the slides will help you a lot!
Thank you for your attention!
c© Jannik Strötgen – ATIR-02 67 / 68
Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End
Thankssome slides / examples are taken from / similar to those of:
Klaus Berberich, Saarland University, previous ATIR lecture
Manning, Raghavan, Schütze: Introduction to Information Retrieval(including slides to the book)
Yannick Versley, Heidelberg University, Introduction to ComputationalLinguistics.
c© Jannik Strötgen – ATIR-02 68 / 68