+ All Categories
Home > Documents > CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics...

CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics...

Date post: 27-Mar-2015
Category:
Upload: lucas-stafford
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
42
CERN European Organization for Nuclear Research Automatic Keyword Automatic Keyword Assignment for High Energy Assignment for High Energy Physics Literature Physics Literature Arturo Montejo Ráez ETT/SI Data Handling Group- CERN Geneva (Switzerland) Joint Research Center, Ispra (Italy) -4 March 2002
Transcript
Page 1: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Automatic Keyword Assignment Automatic Keyword Assignment for High Energy Physics Literaturefor High Energy Physics Literature

Arturo Montejo Ráez

ETT/SI Data Handling Group- CERNGeneva (Switzerland)

Joint Research Center, Ispra (Italy) -4 March 2002

Page 2: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

What we are going to see today...

Data Handling Group

Keyword assignment process Why keywords? How it is done for High Energy Physics papers The HEPindexer project:

Future work

Data Algorithm Experiments Results

Page 3: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

AuthorsIndexer

Keyworded papers

Page 4: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

Page 5: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

The document...

Full text paper Stored in a database Simplified representation needed

Page 6: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

The thesaurus...

Controlled vocabulary of concepts Relationships between keywords Categories and subcategories Can be domain specific Can be translated into multiple languages

Page 7: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

The thesaurus: a relational model for terms

cheese MT 6016 processed agricultural produce BT1 milk product

NT1 blue-veined cheese NT1 cow's milk cheese NT1 fresh cheese NT1 goat's milk cheese NT1 hard cheese NT1 processed cheese NT1 semi-soft cheese NT1 sheep's milk cheese NT1 soft cheese

RT cheese factory (6031)

Page 8: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

The thesaurus: a subject tree 04 POLITICS 0406 political framework 0411 political party 0416 electoral procedure and voting 0421 parliament 0426 parliamentary proceedings 0431 politics and public safety 0436 executive power and public service08 INTERNATIONAL RELATIONS 0806 international affairs 0811 cooperation policy 0816 international balance 0821 defence10 EUROPEAN COMMUNITIES 1006 Community institutions and European civil service 1011 Community law 1016 European construction 1021 Community finance

Page 9: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Keyword assignment process

Data Handling Group

The indexer...

An expert in the domain of the documents An expert in the use of the thesaurus Heavy task Not always the same proposition Expensive!

Page 10: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Why keywords?

Data Handling Group

Permit to index documents in a coherent way Can be viewed like the "index" at the end of a book Concepts that represent better the content Human made (value added) Meaningful Can stablish relations between documents Multilingual

Page 11: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Data Handling Group

Why keywords?

Access to documents

But... we already have fulltext indexing!

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Page 12: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Classification: To store (libraries) To access (narrow searches)

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

Why keywords?

Category 1 Category 2 Category 3

Page 13: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

Why keywords?

Navaja

Razor

Couteau

Navaja

Razor

Couteau

Razor?

LamettaLametta

Crosslingual access

Page 14: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

Why keywords?

RazorRazor

LamettaLametta

Multilingual comparison

MurderFrabbica

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERNEuropean Organization for Nuclear Research

Data Handling Group

Why keywords?

Multilingual comparison

Page 15: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERN

Why keywords?

Advantages over fulltext searches:

No ambiguity Better relevance and precision

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

More advanced tools for searching and classification are coming!

Page 16: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERN

Why keywords?

The BIG problem...

- E X P E N S I V E -

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Page 17: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERN

Why keywords?

The BIG problem?

E X P E N S I V E ?

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Page 18: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERN

Why keywords?

The BIG problem?

E X P E N S I V E ?

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Page 19: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

Data Handling Group

CERN

The CERN

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

• The world's largest particle physics centre• Explores what matter is made of, and what forces hold it together• Employs just under 3000 people• 6500 scientists, come for their research

Page 20: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

How it is done for High Energy Physics papers

Data Handling Group

DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany)

DESY thesaurus Group of indexers (students, experts...) Only High Energy Physics related papers

Page 21: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

How it is done for High Energy Physics papers

Data Handling Group

The DESY thesaurusA

*a4(2040) ('postulated particle, a4(2040)', was delta(2040)) *a6(2450) ('postulated particle, a6(2450)', was delta(2450)) *abelian *aberration absorption -absorptive model (model, absorption) accelerator . . .

B

B B anti-B B+ B+L number B*(5320) (excited B) -B** ('B*2...', similar for B/s, etc.) *B*2(5732) (postulated particle, B*2(5732)) B- -B-factory (B, particle source) B-L number . . .

Page 22: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

How it is done for High Energy Physics papers

Data Handling Group

The DESY thesaurus: Few categories rarely used Only two type of keywords: main keywords (1191) secondary keywords (949) No relationships between terms Specific terminology

Page 23: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

How it is done for High Energy Physics papers

Data Handling Group

The DESY thesaurus: specific terminology

Energy declarations: 1.5-2.7 GeV-cms Resonances: Delta (1232) Reaction equations: anti-p p ---> K0 K- pi+ Combinations: angular distribution, (photon), mass spectrum (pi+ pi- pi0) Two-particle initial state: 'anti-p p', 'electron positron'

Page 24: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Physicists Indexer

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

How it is done for High Energy Physics papers

Data Handling Group

The problem

More than 500 preprints per week!

Page 25: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

The HEPindexer project

Data Handling Group

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Physicists Indexer Keyworded papers

The solution

Page 26: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

CERNEuropean Organization for Nuclear Research

The HEPindexer project

Data Handling Group

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

Use of IR techniques Objective evaluation Real time answer Easy portable Full integrable into CDS Posibility of growing Fully automatical & aider tool

Page 27: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Keyworded papers (collection)

Keyword Term

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer project

Page 28: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer project

Keyword Term

DESY keywords Documents

Page 29: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer project

1220 test collection

2441 training collection

Data

3,661 documents 19,143 terms 1,191 main keywords

Page 30: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectAlgorithm

Page 31: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectAlgorithm

Preprocessing

Punctuation Lower case Remove stop words Stemming

Page 32: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer project

Weight term - document

Weight keyword - document

Weight keyword - term

Similarity keyword - document

Algorithm

Page 33: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectExperiments

Page 34: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectExperiments

AB

A B

A: keywords propossed by DESYB: keywords propossed by HEPindexer

Keywords in the trainning collection

Page 35: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectResults

52.7 % of precision58.5 % of recall

Response in 2 seconds

Page 36: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectResults

Page 37: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectResults

Page 38: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer project

C++ / STL UNIX Command line interface Digilib: Web interface (PHP)http://cern.ch/digilib Installation on the CERN Document Serverhttp://cds.cern.ch

Software

Page 39: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectSoftware

Page 40: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectSoftware

Page 41: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

The HEPindexer projectSoftware

Page 42: CERN European Organization for Nuclear Research Automatic Keyword Assignment for High Energy Physics Literature Arturo Montejo Ráez ETT/SI Data Handling.

Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002

CERNEuropean Organization for Nuclear Research

Data Handling Group

Future Work

Automatic proposition of secondary keywords Improve the algorithm (lemmatizer, multiwords, segmentation...) Use of references to link documents based on common concepts Specific algorithms for handling of energies, particle decays, desintegrations, etc. Agents OAI Apply Semantic Web approaches


Recommended