+ All Categories
Home > Documents > Planning for the TREC 2008 Legal Track

Planning for the TREC 2008 Legal Track

Date post: 11-Jan-2016
Category:
Upload: weldon
View: 30 times
Download: 2 times
Share this document with a friend
Description:
Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Agenda. Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design - PowerPoint PPT Presentation
Popular Tags:
17
Planning for the TREC 2008 Legal Track Douglas Oard Stephen Tomlinson Jason Baron
Transcript
Page 1: Planning for the TREC 2008 Legal Track

Planning for the TREC 2008Legal Track

Douglas Oard

Stephen Tomlinson

Jason Baron

Page 2: Planning for the TREC 2008 Legal Track

Agenda

• Track goals• Deciding on a document collection• “Beating Boolean”• Handling nasty OCR• Making the best use of the metadata• Ad hoc task design• Interactive task design• Relevance feedback task design• Other issues

Page 3: Planning for the TREC 2008 Legal Track

Track Goals

• Develop a reusable test collection– Documents, topics, evaluation measures

• Foster formation of a research community

• Establish baseline results

Page 4: Planning for the TREC 2008 Legal Track

Choosing a Collection

• FERC Enron (w/attachments, full headers)– Somewhat larger than CMU– Email is the real killer app for E-discovery

• IIT CDIP version 1.0 (same as 2006/07)– We have 83 topics. Do we need more?

• State Department Cables– Task model would be FOIA, not E-Discovery

Page 5: Planning for the TREC 2008 Legal Track

TREC Topic Number: 1

Title: Marketers or Traders of Electricity on the Financial Market Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron.

Query Possibilities: • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH)

o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified.

• (marketer or marketers or EPMI) and (short or long) o As in have a long or short position in sales/purchases.

• (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL)

o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) o EOL was the forward market trading place. (36, p. 3)

Page 6: Planning for the TREC 2008 Legal Track

Identity Modeling in Enron

[email protected] m scott

suebobsusan scott

sue

susan

ciao

again

m scott

[email protected]

scott susan

susan m scott

susan scott

[email protected] scott

friday

sscott5

susan

sscott

susan m scott

com members

66,715 models

82,084addr-name

3,151 addr-nickname

19,708 addr-addr

Page 7: Planning for the TREC 2008 Legal Track

Enron Identity Test Collections

Collection Emails Identities Mention Candidates

Queries Min. Avg. Max.

Sager 1,628 627 51 1 4 11

Shapiro 974 855 49 1 8 21

Enron-subset 54,018 27,340 78 1 152 489

Enron-all 248,451 123,783 78 3 518 1785

Sager

Shapiro

Enron-subsetEnron-all

Test CollectionsTest Collections

Page 8: Planning for the TREC 2008 Legal Track

Example Document

Title: CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY

Organization Authors: PMUSA, PHILIP MORRIS USA

Person Authors: HALLE, L

Document Date: 19970530

Document Type: MEMO, MEMORANDUM

Bates Number: 2078039376/9377

Page Count: 2

Collection: Philip Morris

Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aaBenffrts Departmext Rieh>pwna, Yfe&iaTa: Dishlbutfon Data aday 90,1997.From: Lisa FisllaSabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsUDuring our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ngartieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was amsiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with mySadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* .I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you onwhetlne you concur with my reeommendatioa…

Scanned OCR Metadata

Page 9: Planning for the TREC 2008 Legal Track

State Department Cables

0

100,000

200,000

300,000

400,000

1973 1974 1975

Nu

mb

er o

f R

eco

rds

Withdrawn

Metadata

Full Text

791,857 records – 550,983 of which are full text

Page 10: Planning for the TREC 2008 Legal Track

State Department Cables

Page 11: Planning for the TREC 2008 Legal Track

Handling Nasty OCR

• Index pruning

• Error estimation

• Character n-grams

• Duplicate detection

• Expansion using a cleaner collection

Page 12: Planning for the TREC 2008 Legal Track

How to “Beat Boolean”

• Work from reference Boolean?– Swap out low-ranked-in for high-ranked-out

• Relax Boolean somehow?– Cover density, proximity perturbation, …

Page 13: Planning for the TREC 2008 Legal Track

Using Metadata

• Title (term match)

• Author (social network

• Bates number (sequence)

Page 14: Planning for the TREC 2008 Legal Track

Ad Hoc Task Design

• Evaluation measures– R@B?, P@R?, Index size?– Error bars / Statistical significance testing– Limits on post-hoc use of the collection?– What are “meaningful” differences?

• Topic design– Negotiation transcript?

• Inter-annotator agreement

Page 15: Planning for the TREC 2008 Legal Track

Interactive Track Design

• Evaluation measure– Precision-oriented?– Recall-oriented?– Effect of assessor disagreement

Page 16: Planning for the TREC 2008 Legal Track

Relevance Feedback Task

• Evaluation measure– Residual recall at B_Residual?

• Two-stage feedback?

Page 17: Planning for the TREC 2008 Legal Track

Some Open Questions• Test collection reusability

– Unbiased estimates? Tight error bars?

• Why can’t we beat Boolean???– Different strategies? Detailed failure analysis?

• Can we improve topic formulation?– Structured relevance relevance feedback?

• Is OCR masking effects we need to see?– Is it time for a new collection?– Must it be de-duped? Is metadata needed?

• Does Δscope invalidate the interactive task?


Recommended