Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
1Thursday, September 1, 2011
Lecture 13September 1, 2011
Evaluation: Precision and RecallReminder: start the recording
Some material from Will Lewis
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
2Thursday, September 1, 2011
Announcements• Project 5: Naïve Bayesian Classifier
– Due tonight at 11:45 p.m.– Solution will be posted on the course website
• Writing Assignment– Due next Tuesday at 11:45 p.m.
• Project 6: Edit Distance– Due next Thursday at 11:45 p.m.
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
3Thursday, September 1, 2011
Paper selection• Google for “best paper” AND…
– ACL– COLING– NAACL– EMNLP– HLT– IJCNLP– ANLP
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
4Thursday, September 1, 2011
Today• A shorter lecture
– Evaluation– Brief mention of some NLP tasks we haven’t
mentioned this quarter
• I’ll stay online afterwards to take questions about:– today’s material– project 5: Naïve Bayesian classifier– project 6: Edit distance
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
5Thursday, September 1, 2011
Evaluation• Contemporary research in computational linguistics is unacceptable
if it is not accompanied by principled evaluation
• Quantitative measurement of results is what differentiates our field from armchair theorizing
• It is one of the cross-cutting pillars of the CLMA curriculum to emphasize the critical importance of evaluation at all stages of research
• To be published, all research must:– establish a baseline, and– quantitatively show that it improves on the baseline
“An important recent development in NLP has been the use of much more rigorous standards for the evaluation of NLP systems”
Manning and Schutze
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
6Thursday, September 1, 2011
Basic Evaluation• “How well does the system work?”• Possible domains for evaluation
– Processing time of the system– Space usage of the system– Human satisfaction– Correctness of results today’s topic
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
7Thursday, September 1, 2011
example• You are building a system which automatically provides short,
human-readable summaries of a set of documents on a given topic
• Your system picks sentences from the documents based on word co-occurrences, and presents these sentences as the summary
• We want to evaluate the “quality” of the results• One choice in such a system is whether you should use
stemming when determining the word co-occurrences• Let’s briefly examine stemming
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
8Thursday, September 1, 2011
Stemming• Morphological suffixes used can make our data more
sparse• In content-analysis tasks (IE, IR, summarization), we may
only care about ‘stems,’ because they carry the “content” of the lemma
example:He doesn’t like to shop.She shops at the mall.I went shopping last week.Ben shopped until he dropped.Bill is quite and avid shopper.
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
9Thursday, September 1, 2011
Porter stemmer• One well known stemming algorithm for
English is the Porter stemmer– M. F. Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130–137.
• It is a heuristic which contains lots of code like this:
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
10Thursday, September 1, 2011
Porter stemmer
two medic expert testifi wednesdai in the dope trial of a former east german sport doctor said the femal swimmer theiexamin show health damag link to performance-enhanc drug includ liver damag and excess facial hair
Two medical experts testifying Wednesday in the doping trial of a former East German sports doctor said the female swimmers they examined showed health damage linked to performance-enhancing drugs, including liver damage and excessive facial hair.
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
11Thursday, September 1, 2011
Back to our example• Looking at the output of the Porter stemmer
makes linguists cringe• For the document summarizer system:
“We elected not to use the Porter stemmer because we were getting good evaluation scores with our system
already, and we didn’t like how it turns the linguistic data that we had carefully cleaned into gibberish.”
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
12Thursday, September 1, 2011
Evaluate!
ROUGE-1 ROUGE-2 ROUGE-SU4min max avg min max avg min max avg
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
13Thursday, September 1, 2011
Now who’s cringing?• Although the crude
approach of the Porter stemmer seems intuitively ugly, it improves our evaluation metric
• As a computational linguist, you must remain objective
• If you have a legitimate linguistic intuition in mind, test it!
• Evaluate, evaluate, evaluate
“We elected not to use the Porter stemmer because we were getting good evaluation scores with our system already, and we didn’t like how it turns the linguistic data that we had carefully cleaned into gibberish.”
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
14Thursday, September 1, 2011
Now who’s cringing?• As a professional
computational linguist, it is this type of statement that should immediately set off your alarm bells
• Always ask:– What is the BASELINE? – What is the RESULT?– How was it EVALUATED?
“We elected not to use the Porter stemmer because we were getting good evaluation scores with our system already, and we didn’t like how it turns the linguistic data that we had carefully cleaned into gibberish.”
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
15Thursday, September 1, 2011
Stemming and Performance• Does stemming help in IR, IE, and document
summarization?• Harman 1991 indicated that it hurt as much as it
helped– D. Harman (1991) How effective is suffixing. In Journal of the American
Society for Information Science. 42(7) 7-15.• Krovetz 1993 shows that it does help
– R. Krovetz (1993) Viewing morphology as an inference process. In Proc. 16th ACM SIGR R&D IR 191-202
– Porter-like algorithms work well with smaller documents– Krovetz proposes that stemming loses information– Derivational morphemes tell us something that helps
identify word senses; stemming them loses this
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
16Thursday, September 1, 2011
Evaluating a stemmer• In the summarization example, the stemmer is
a small part of the whole system• We used an end-to-end measurement (ROUGE
scores) to evaluate the impact of using stemming
• How would you evaluate the “performance” of stemming by itself?
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
17Thursday, September 1, 2011
“Correct” stemming• This would be difficult, because there’s no “correct”
stemming of a word• The best stemmer is a hash function that conflates all
words with the same linguistic stem: the actual stemmed token doesn’t matter
eat 04xBrLt
ate 04xBrLt
eating 04xBrLt
eats 04xBrLt
eat eat
ate at
eating eat
eats eat
Stemmer #7 Porter StemmerBetter than…
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
18Thursday, September 1, 2011
Therefore• Results are sensitive to the specific
application.• Rule of thumb: when there’s an
implementation decision to be made:– Evaluate the alternatives– Document your results and choice
• Some choices may require going back and re-evaluating earlier decisions
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
19Thursday, September 1, 2011
Accuracy• In order to evaluate your system, you need to know what
is correct or desired• A set of data which is labeled with the correct or desired
result is called a gold standard• If the system you are evaluating is a function that maps
one input to one output, then you can evaluate accuracy– Correct: matches the gold standard– Incorrect: otherwise
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
20Thursday, September 1, 2011
Accuracy: example• With your Naïve Bayesian classifier for
language identification, say we are only interested in the single best language it selects for a given input sentence
”مساء الخير“ Arabic“Bon soir!” French“Good evening!” Spanish
Accuracy:.67
67%
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
21Thursday, September 1, 2011
Error• You can also measure error. This is the proportion
of items that you got wrong
”مساء الخير“ Arabic“Bon soir!” French“Good evening!” Spanish
Error:.33
33%
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
22Thursday, September 1, 2011
Evaluating classifiers• Many NLP problems involve classifying things• For these problems, there is a more nuanced
way to evaluate performance• Also, note that many NLP problems can be re-
stated as classification problems• Before we talk about evaluating classification
results, let’s see how to re-state a problem as a classification problem
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
23Thursday, September 1, 2011
Example: Thai sentence-breaking• In Thai, there is no end-of sentence punctuation
character (such as the period)• A space character ‘ ’ always appears between
sentences• BUT, the space character is also used in other ways,
within a sentence• The problem: break Thai text into sentences
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
24Thursday, September 1, 2011
Sentence-breaking as classification• Here is some Thai text that we would like to break
into one or more sentences
• How can we treat this as a classification problem?
ผมซือ้หนังสอืขําขนัมาทกุเลม่ทีพ่อจะหาซือ้ได ้กระทัง่ฉบบัทีม่รีปูประกอบก็ซือ้ ถอืขึน้มาบนหอ้งในโรงแรมทีผ่มพัก นอกจากนีย้งัซือ้พายเนือ้หมแูละโดนัทอกีหลายสบิอนั ผมกนิพายกบัโดนัทแลว้นั่งทอดหุย่อยูบ่นเตยีง อา่นหนังสอืการต์นูเลม่แลว้เลม่เลา่ ในทีส่ดุผมก็รูส้กึวา่ความงว่งเหงารา้ยกาจคอ่ยๆแอบคบืคลานเขา้มาในตวั ผมจงึเอือ้มหยบิหนังสอืลอนดอนไทมร์ายสปัดาห ์แลว้เปิดหนา้บทบรรณาธกิารคา้งไวต้รงหนา้
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
25Thursday, September 1, 2011
Sentence-breaking as classificationFor each space character in the text: classify it as either sentence-breaking (sb) or non-sentence-breaking (nsb)
ผมซือ้หนังสอืขําขนัมาทกุเลม่ทีพ่อจะหาซือ้ได▇้กระทั่งฉบบัทีม่รีปูประกอบก็ซือ้▇ถอืขึน้มาบนหอ้งในโรงแรมทีผ่มพัก▇นอกจากนีย้งัซือ้พายเนือ้หมแูละโดนัทอกีหลายสบิอนั▇ผมกนิพายกบัโดนัทแลว้นั่งทอดหุย่อยูบ่นเตยีง▇อา่นหนังสอืการต์นูเลม่แลว้เลม่เลา่▇ในทีส่ดุผมก็รูส้กึวา่ความงว่งเหงารา้ยกาจคอ่ยๆแอบคบืคลานเขา้มาในตวั▇ผมจงึเอือ้มหยบิหนังสอืลอนดอนไทมร์ายสปัดาห▇์แลว้เปิดหนา้บทบรรณาธกิารคา้งไวต้รงหนา้
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
26Thursday, September 1, 2011
Reading theliterature
• Refereed journal papers will have Results and Evaluation sections
• Evaluation is often shown in a table versus previous work
• If skimming papers, skip to the results and evaluation section for a summary of the work
http://research.microsoft.com/apps/pubs/default.aspx?id=130868
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
27Thursday, September 1, 2011
Evaluating Classifiers• Classifiers divide items into different categories• Or, they “label” items• Or, the put them into different sets• For each label/category/set, you can say:
– There are items which are selected– There are items which are not selected
• The gold standard tells you which items should have been selected, i.e. “correct”
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
28Thursday, September 1, 2011
Precision and Recall• Precision and Recall are set-based measures• They evaluate the quality of some set
membership, based on a reference set membership
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
29Thursday, September 1, 2011
Precision and Recall
Precision: what proportion of the selected items are in the reference set?
Recall: how many items from the reference set got selected?
A less frequently used evaluation measure is fallout, (or collateral damage) the proportion of items from the reference set that were selected
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
30Thursday, September 1, 2011
Precision / Recall
RelevantZ
RetrievedY
RelevantRetrieved
X
= =Ω = ∪
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
31Thursday, September 1, 2011
Precision-recall trade-off• Usually, precision and recall can be traded for
each other by changing your system parameters
The higher the proportion of correct items you require in the selected set (high precision), the fewer of the total correct items you will select (low recall)
If you can tolerate a higher proportion of incorrect items in the selected set (low precision), you will capture more of the total correct items (high recall)
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
32Thursday, September 1, 2011
Precision-recall trade-off• It’s easy to get recall of 1.0. Why?
• Return all the documents RelevantZ
RetrievedY
RelevantRetrieved
X=
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
33Thursday, September 1, 2011
SummaryGold standard
X Y
Your Result
X true positivetp
false positivefp
type I error
Yfalse negative
fntype II error
true negativetn
= += +
= ++ + += ++ + + = +
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
34Thursday, September 1, 2011
Example• Tokenizing task Baseline Tokenizer 1 Tokenizer 2 Tokenizer 3 Tokenizer 4
After After After After Aftercoming coming coming coming comingclose close close close closeto to to to toAfter After After After Afterpartial partial partial partial partialsettlement settlement settlement settlement settlementa a a a ayear year year year yearago ago ago ago ago, , , , ,shareholders shareholders shareholders shareholders shareholderswho who who who whofiled filed filed filed filedcivil civil civil civil civilsuits suits suits suits suitsagainst against against against againstIvan Ivan Ivan Ivan IvanF. F F. F. F.Boesky . Boesky Boesky Boesky& Boesky & & &Co. & Co. Co Co.L.P. Co L.P. . L.P.Drexel . Drexel's L.P. Drexel's L.P. plaintiffs Drexel 'splaintiffs 's ' 's plaintiffs' Drexel's plaintiffs '
plaintiffs ''
Precision Recall F-MeasureTokenizer 1 0.800 0.889 0.844Tokenizer 2 0.962 0.926 0.944Tokenizer 3 0.929 0.963 0.946Tokenizer 4 1.000 1.000 1.000
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
35Thursday, September 1, 2011
In-class quiz #1This is the gold standard for biomedical documents which mention the IL−2R ⍺−promoter. The result of our classifier is shown below.
What is the accuracy? .70What is the precision? .80What is the recall? .66
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
36Thursday, September 1, 2011
In-class quiz #2This is the gold standard for biomedical documents which mention the IL−2R ⍺−promoter. The result of our classifier is shown below.
What is the accuracy? .60What is the precision? .60What is the recall? 1.0
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
37Thursday, September 1, 2011
In-class quiz #3This is the gold standard for biomedical documents which mention the IL−2R ⍺−promoter. The result of our classifier is shown below.
What is the accuracy? .60What is the precision? 1.0What is the recall? .33
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
38Thursday, September 1, 2011
Why do precision/recall matter• Take the first page of
results
• Precision: how many of these are relevant?
• Recall: how many of the relevant results are included?
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
39Thursday, September 1, 2011
Information Retrieval• A retrieval engine must generally have high
recall to be useful• Better quality retrieval means increasing
precision without sacrificing recall• Good recall but poor precision means relevant
hits will be lost in the noise of irrelevant results
• Precision/recall are set-based measures; they don’t take result ranking into account
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
40Thursday, September 1, 2011
Chunking• We just looked at P/R for information retrieval• We talked about P/R for general classification• P/R is appropriate for any set-oriented task• “Chunking” is another shallow NLP task that can be
evaluated with P/R • Chunking: Assign some additional structure over POS
tagging, without the expense of full parsingPla, F., Molina, A., and Prieto, N. 2000. Tagging and chunking with bigrams. In Proceedings of the 18th Conference on Computational Linguistics - Volume 2 (Saarbrücken, Germany, July 31 - August 04, 2000). International Conference On Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 614-620.
• P/R Evaluation of chunking: See J&M Section 13.5.3
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
41Thursday, September 1, 2011
F-measure• based on van Rijsbergen (1979):
• This is a simple average; P and R are each weighted by .5
• Other versions of F scores use different weights for P versus R
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
42Thursday, September 1, 2011
Planning for evaluation• Design evaluation strategy at the start of your
research– What will be measured?– What is the baseline?– What is the gold standard (reference)?– What is the measurement heuristic?
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
43Thursday, September 1, 2011
Measure/metric/indicator• a measure:
– a figure, extent, or amount obtained by measuring• a metric (distance)
– implies a ‘cline’– the degree to which a system possesses a given
attribute, or– a combination of two or more measures
• an indicator– the amount of deviation from a baseline state
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
44Thursday, September 1, 2011
Standardized evaluation• When evaluating complex systems, it’s helpful
to use the same measure that was used to measure comparable (competitive) systems
• Machine translation:– BLEU, NIST, METEOR
• Document summarization:– ROUGE-1, ROUGE-2, ROUGE-SU4, Pyramid
Lecture 13:Evaluation: Precision and Recall
University of WashingtonLinguistics 473: Computational Linguistics Fundamentals
45Thursday, September 1, 2011
Next week• Tuesday:
– parsing and generation with unification grammars– the DELPH-IN research consortium– ‘agree’ grammar engineering environment– demo
• Thursday– Review of what we’ve learned– Course evaluation:
https://depts.washington.edu/oeaias/webq/survey.cgi?user=UWDL&survey=1328