Date post: | 16-Dec-2015 |
Category: |
Documents |
Upload: | shaylee-brassfield |
View: | 216 times |
Download: | 0 times |
© Paul Kantor 2002
A Potpurri of topicsPaul Kantor
• Project overview and cartoon
• How we did at TREC this year
• Generalized Performance Plots
• Remarks on the formal model of decision
© Paul Kantor 2002
1. Accumulated documents
2. Unexpected event
3. Initial Profile
4. Guided Retrieval
5.Clustering
6. Revision and Iteration
Analysts
Retrospective/Supervised/Tracking
1. Accumulated documents
4. Anticipated event
3. Initial Profile
5.. Guided Retrieval
2.Clustering
Prospective/Unsupervised/Detection
Rutgers DIMACS: Automatic Event Finding in Streams of Messages
7. Track New documents
© Paul Kantor 2002
Communication
• The process converges….
• Central limit theorem …
• What???
• Pretty good fit
• Confidence levels
• What???
• And so on
© Paul Kantor 2002
Measures of performance Effectiveness
• 1. Batch Post-hoc learning. Here there is a large set of already discovered documents, and the system must learn to recognize future instances from the same family
• 2. Adaptive learning of defined profiles. Here there is a small group of "seed documents" and thereafter the system must learn while it works. Realistic measures must penalize the system for sending documents that are of no interest to any analyst, to the human experts.
• 3. Discovery of new regions of interest. Here the focus is on unexpected patterns of related documents, which are far enough from established patterns to warrant sending them for human evaluation.
© Paul Kantor 2002
Measures of performance Effectiveness
• Efficiency is measured in both time and space resources required to accomplish a given level of effectiveness. Results are best visualized in a set of two or three dimensional plots, as suggested on the following page.
© Paul Kantor 2002
Efficiency- Effectiveness PlotsM
easu
re o
f E
ffec
tiven
ess
Measure of Time Required (Best Baseline method/Method_plotted)
100%
100%
Strong and slow
Strong and fast
Weak but fast
Not good enough for government
work
© Paul Kantor 2002
The process
Incoming Documents
N; G are relevant
Our System
Sends n to analyst
Analyst: Reports g are relevant
n
g
G
© Paul Kantor 2002
Typical Effectiveness measures• Basic Concepts:
• precision p=g/n: g=number of relevant documents flagged by our system: n = number that the analyst must examine
• recall R=g/G: G=total number that “should” be sent to the analyst, that is the number of relevant documents.
– F-measures: Harmonic mean of precision and recall
• 1/F = a/p+ (1-a)/R =(1/g)(an+(1-a)G) so
• F=g/[an+(1-a)G]
• there is no persuasive argument for using this
• in TREC2002 a=0.8. A 4:1 weighting
© Paul Kantor 2002
Typical measures used
• Utility-Based measures– Pure Measure: U=vg -c(n-g) =-cn+g(v-c)– Note that sending irrelevant documents drives
the score negative. v=2; c=1– “Training Wheels”: To protect groups from
having to report astronomically negative results: U is replaced by
– T11SU = [max{U/2G, -0.5} - 0.5]/[1.5]
© Paul Kantor 2002
How we have done: TREC2002
• Disclaimers and caveats.– We report here only on those results that were
achieved and validated at the TREC2002 conference. These were done primarily to convince ourselves that we can manage the entire data pipeline, and were not selected to represent the best conceptual approaches that we can think of.
© Paul Kantor 2002
Disclaimers and caveats (cont).
• The TREC Adaptive rules are quite confusing to newcomers. It appears, in conference and post-conference discussions that the two top-ranked systems may not have followed the same set of rules as the other competitors. If this is the case, our results will actually be better than those reported here.
© Paul Kantor 2002
Using measure T11SU• Adaptive -- Assessor topics - 9th among all 14
teams - 7th among those known to follow rules.
• Intersection topics - 7th among all 14 teams -- 5th among known to follow the rules.
• Batch. 6th among all 10 groups on Assessor topics; 3rd among all 10 groups on Intersection topics. Scored above median on almost all. Tops on 24 of 50
© Paul Kantor 2002
Fusion of Methods
Paul Kantor and Dmitiry Fradkin (supported in part by ARDA)
© Paul Kantor 2002
Fusion Models
• Each of several systems gives scores to documents. Call these sj(d). Can these be combined so that the resulting score is a more accurate indication of the relevance of the document? The underlying mathematical concept is the conditional score distribution f(s,h) =Prob(document has score s, given relevance h). The “hypothesis” h=R,N (“Relevant or Not”.
© Paul Kantor 2002
Tools
• We have built visualization tools to show these two distributions. It can be shown that all decision making needs only know the so-called ROC curve, which is invariant to any monotone change of the score variable. We have also built tools which show the ROC.
• The simplest form gives a curve with coordinates (d(t), f(t))
© Paul Kantor 2002
ROC
• d(t)=Prob(score >t | document relevant)
• f(t)=Prob(score >t |document not relevant)
© Paul Kantor 2002
Score Distributions
© Paul Kantor 2002
ROC Display Applet
© Paul Kantor 2002
ROC Display Applet
© Paul Kantor 2002
Formal Models
David Madigan and Paul Kantor
© Paul Kantor 2002
Formal Models
• The BinWorld model
• Some heuristic ideas
© Paul Kantor 2002
BinWorlds --Very simple models• Documents live in some number (L) of bins. Some bins have only (b)
irrelevant (bad) documents, a few have relevant (good) documents. Documents are delivered randomly from the world, labeled only by their bin numbers. The work has a horizon H, with a payoff v for good documents sent to be judged, and a cost c for bad documents sent to be judged. We consider a hierarchy of models. For example, if only one bin contains good documents, the optimum strategy is either QUIT or continue until seeing one good document, and thereafter only submit documents from this bin to be judged.
• The expected value of the game is given by:
• EV=-CostToLearnRightBin+GainThereafter.
• Since the expected time to learn the right bin is 1+Lb/g
• EV=-c(1+Lb/g)+(H-(1+Lb/g))(vg-cb)/(b+g).
• Increasing Horizon H increases EV, while increasing
• the number of candidate bins, L, makes the game harder.
© Paul Kantor 2002
The essential math• However, if we have failed once on a bin,
perhaps it is not wise to test it again.• At any step on the way to the horizon H the
decision maker can know only these things:• The judgments on submitted documents, and the
stage at which they were submitted. Let ji=j(b,i) be the judgment received when a document from bin b was submitted at time step i.
© Paul Kantor 2002
The challenge• As a result of these judgments, the decision
maker has a present Bayesian estimate of the chance that each bin is the right bin
• Can we find a simple and effective heuristic based on the available history j1 … ji and the time remaining:H=K-i .
© Paul Kantor 2002
Example Heuristic
-120
-70
-20
30
80
1 22 43 64 85 106 127 148 169 190
Series2
Series3
• 5 Bins p=0.2
• If the current bin is the one that has the largest number of failures to date, do not send for judgment. (Yellow line).
• Gains slowly until the correct bin is discovered.
• Alternative is to submit always. (Mauve line)
© Paul Kantor 2002
Future work• Such an heuristic should exist because the
decision rule must be of the form: if the current estimate that a bin is the right one is below some critical value, don’t submit it.
• Note: this is “obvious but not yet proved.”• In more complex models, the chance of success
in the right bin (g); the number of bins L and even the number of good bins may be unspecified.