+ All Categories
Home > Documents > TREC-CHEM The TREC Chemical IR Track Mihai Lupu 1, John Tait 1, Jimmy Huang 2, Jianhan Zhu 3 1...

TREC-CHEM The TREC Chemical IR Track Mihai Lupu 1, John Tait 1, Jimmy Huang 2, Jianhan Zhu 3 1...

Date post: 14-Dec-2015
Category:
Upload: polly-benson
View: 214 times
Download: 0 times
Share this document with a friend
28
TREC-CHEM The TREC Chemical IR Track Mihai Lupu 1 , John Tait 1 , Jimmy Huang 2 , Jianhan Zhu 3 1 Information Retrieval Facility 2 York University 3 University College London 1 Network of excellence co-funded by the 7 th Framework Program of the European Comission , grant agreement number 258191
Transcript

TREC-CHEMThe TREC Chemical IR Track

Mihai Lupu1, John Tait1, Jimmy Huang2, Jianhan Zhu3

1 Information Retrieval Facility2 York University 3 University College London

1

Network of excellence co-funded by the 7th Framework Program of the European Comission , grant agreement number 258191

Agenda

• Introduction• „Prior Art“ Task (PA)• „Technology Survey“ Task (TS)• Conclusions

2

Motivation

• Increased awareness on behalf of the industry and regulatory authorities– Particularly in human-related chemistry

(pharma and cosmetics)– Particularly in IP-related contexts

• Increased availability of data and meta-data

• Different demands from professional users wrt other evaluation campaigns

3

Partners

• Collaboration– National Institute for Science and

Technology (US)– University College London (UK)– York University (Canada)

• Support from– Royal Society of Chemistry– Open access publishers– Experts in the field

• With the participation of– Research groups

4

Aims

• Assess the available Chemical Retrieval tools

• Generate interest among research groups for this domain

• Stimulate participation from industry• Generate new Chemical Retrieval

tools, at the intersection of chemoinformatics and text-mining

5

Data

• 2 collections• 2009– 1.2 million patent documents– 50k scientific articles– text only

• 2010– 1.3 million patent documents– 172k scientific articles– text, images, structure information

available6

2010 Data

• Patent data – Addition of WIPO patents– Addition of attachments (images, structure

data)

• Scientific articles– 3-fold increase, with attachments – Large mass from PubMed– Some directly from open access publishers:

IUCrJnls, Oxford Publishers, Hindawi Publishers, MPCI

7

2010 Data

• Patent data across IPC classesOrganic ChemistryOrganic Chemistry

Medical or Veterinary science; HygieneMedical or Veterinary science; Hygiene

Organic macromolecular compoundsOrganic macromolecular compounds

BioChemistryBioChemistry

Physical or chemical processes or apparatus in general

Physical or chemical processes or apparatus in general

Dyes; Paints; Polishes…Dyes; Paints; Polishes…

Petroleum; Gas..Petroleum; Gas..

8

Tasks

• Technology Survey (TS)– Search for all potentially relevant

documents, in both patents and scientific articles.

– 30 manually defined and evaluated topics• Prior Art (PA)– Search for patents that may invalidate a

given patent– 1000 automatically created and evaluated

topics (1000 patent files)

9

PA topics

• Tagline: recreate the citation list created by the patent examiner

• topic = patent application document• evaluation based on – applicant’s citations– examiner’s report– opposition citations (if any)

• only patent corpus used

10

PA topics

11

TS topics

• topic = natural language information request

• evaluation done manually by– junior evaluators (students, others)– senior evaluators (topic creators)

• both patent and scientific articles requested

12

TS topics -example

<topic><number>TS-23</number><title>Titanium tetrafluoride for improving dental health</title><narrative>Titanium tetrafluoride can be used to prevent dental caries or tooth decay along

with other fluoride containing compounds. We are specifically looking for the use of Titanium tetrafluoride for improving dental health or preventing decay.

</narrative><details><chemicals>titanium tetrafluoride</chemicals><condition>tooth decay</condition></details><relevance>A document will be considered RELEVANT if it refers to the use of titanium

tetrafluoride for improving dental health, including caries or tooth decay

A document will be considered HIGHLY RELEVANT when it is RELEVANT and it refers to the use of titanium tetrafluoride within a product such as toothpaste or mouthwash.

</relevance></topic>

13

TS topics - example

<topic><number>TS-47</number><title>Structure Search</title><narrative>We are looking for patents and papers on use of the chemical described

in TS-47.mol and TS-47.png for treating dementia.</narrative><details></details><relevance>A document will be considered RELEVANT if it refers to the use of

chemical X for treating dementiaThere are no HIGHLY RELEVANT documents.</relevance></topic>

14

Participants

• 13 participants registered to download the data

• PA– 4 submitted 10 runs– BiTeM Geneva, York University,

Fraunhfer SCAI, Iowa University• TS– 2 submitted 12 runs– BiTeM Geneva, York University

15

Methods

• Basic Probabilistic Model, Language Model and Vector Space Model– Different sections, weights on each section– bm25

• Additional filtering/weighting based on IPC codes

• Linguistic processing– Emphasis on NP

• Concept based search– Query expansion– Using Oscar3, MeSH

16

Methods

• The addition of non-text data did not impact the methods – only 2 TS topics were purely structure

based

• TODO– define interesting structure based topics– find ways to solve them

17

Evaluation – PA topics

Topic PatentTopic Patent

DD DD

cite

s

cites

Family MemberFamily

Membersibling

F1F1 F1F1

cite

s

cites

F2F2

F2F2

F3F3

F3F3

18

Evaluation

• PA topics qrels

19

Evaluation

• TS topics– Due to low participation -> pooling

method might have resulted in biased results

– However, still wanted to provide feedback to the 2 participating groups

– Evaluated 6 topics:• TS-21, TS-23, TS-30, TS-35, TS-36 and TS-43

20

Evaluation – TS Interface

• TS topics - interface

21

Evaluation – TS interface

• TS topics - interface

22

Evaluation

• TS topics – qrelsTopic #poole

d#sampled

#relevant #highly relevant

#non relevant

TS-21 4500 616 16 2 597

TS-23 4762 648 2 4 641

TS-30 3852 525 5 3 517

TS-35 6036 797 5 3 789

TS-36 5048 679 62 13 594

TS-43 6005 761 74 15 672

23

Results – Prior Art Task

24

Results – TS task

25

Results – TS Task

26

Conclusions & Outlook

• This year, more than the last, was a dry-run for the next campaign

• Fixed test collection• 24 TS topics still to use next year• Main objective for 2011–More collaboration between structure-

based search and text-mining

27

Thank you

Questions

28


Recommended