Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | lucas-stafford |
View: | 212 times |
Download: | 0 times |
CERNEuropean Organization for Nuclear Research
Automatic Keyword Assignment Automatic Keyword Assignment for High Energy Physics Literaturefor High Energy Physics Literature
Arturo Montejo Ráez
ETT/SI Data Handling Group- CERNGeneva (Switzerland)
Joint Research Center, Ispra (Italy) -4 March 2002
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
What we are going to see today...
Data Handling Group
Keyword assignment process Why keywords? How it is done for High Energy Physics papers The HEPindexer project:
Future work
Data Algorithm Experiments Results
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
AuthorsIndexer
Keyworded papers
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The document...
Full text paper Stored in a database Simplified representation needed
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The thesaurus...
Controlled vocabulary of concepts Relationships between keywords Categories and subcategories Can be domain specific Can be translated into multiple languages
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The thesaurus: a relational model for terms
cheese MT 6016 processed agricultural produce BT1 milk product
NT1 blue-veined cheese NT1 cow's milk cheese NT1 fresh cheese NT1 goat's milk cheese NT1 hard cheese NT1 processed cheese NT1 semi-soft cheese NT1 sheep's milk cheese NT1 soft cheese
RT cheese factory (6031)
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The thesaurus: a subject tree 04 POLITICS 0406 political framework 0411 political party 0416 electoral procedure and voting 0421 parliament 0426 parliamentary proceedings 0431 politics and public safety 0436 executive power and public service08 INTERNATIONAL RELATIONS 0806 international affairs 0811 cooperation policy 0816 international balance 0821 defence10 EUROPEAN COMMUNITIES 1006 Community institutions and European civil service 1011 Community law 1016 European construction 1021 Community finance
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Keyword assignment process
Data Handling Group
The indexer...
An expert in the domain of the documents An expert in the use of the thesaurus Heavy task Not always the same proposition Expensive!
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Why keywords?
Data Handling Group
Permit to index documents in a coherent way Can be viewed like the "index" at the end of a book Concepts that represent better the content Human made (value added) Meaningful Can stablish relations between documents Multilingual
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Access to documents
But... we already have fulltext indexing!
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Classification: To store (libraries) To access (narrow searches)
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Category 1 Category 2 Category 3
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Navaja
Razor
Couteau
Navaja
Razor
Couteau
Razor?
LamettaLametta
Crosslingual access
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
RazorRazor
LamettaLametta
Multilingual comparison
MurderFrabbica
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERNEuropean Organization for Nuclear Research
Data Handling Group
Why keywords?
Multilingual comparison
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
Advantages over fulltext searches:
No ambiguity Better relevance and precision
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
More advanced tools for searching and classification are coming!
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem...
- E X P E N S I V E -
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
Why keywords?
The BIG problem?
E X P E N S I V E ?
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
CERN
The CERN
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
• The world's largest particle physics centre• Explores what matter is made of, and what forces hold it together• Employs just under 3000 people• 6500 scientists, come for their research
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
DESY: Deutsche Elektronen-Synchrotron (Hamburg, Germany)
DESY thesaurus Group of indexers (students, experts...) Only High Energy Physics related papers
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The DESY thesaurusA
*a4(2040) ('postulated particle, a4(2040)', was delta(2040)) *a6(2450) ('postulated particle, a6(2450)', was delta(2450)) *abelian *aberration absorption -absorptive model (model, absorption) accelerator . . .
B
B B anti-B B+ B+L number B*(5320) (excited B) -B** ('B*2...', similar for B/s, etc.) *B*2(5732) (postulated particle, B*2(5732)) B- -B-factory (B, particle source) B-L number . . .
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The DESY thesaurus: Few categories rarely used Only two type of keywords: main keywords (1191) secondary keywords (949) No relationships between terms Specific terminology
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The DESY thesaurus: specific terminology
Energy declarations: 1.5-2.7 GeV-cms Resonances: Delta (1232) Reaction equations: anti-p p ---> K0 K- pi+ Combinations: angular distribution, (photon), mass spectrum (pi+ pi- pi0) Two-particle initial state: 'anti-p p', 'electron positron'
Physicists Indexer
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
How it is done for High Energy Physics papers
Data Handling Group
The problem
More than 500 preprints per week!
CERNEuropean Organization for Nuclear Research
The HEPindexer project
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Physicists Indexer Keyworded papers
The solution
CERNEuropean Organization for Nuclear Research
The HEPindexer project
Data Handling Group
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
Use of IR techniques Objective evaluation Real time answer Easy portable Full integrable into CDS Posibility of growing Fully automatical & aider tool
Keyworded papers (collection)
Keyword Term
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer project
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer project
Keyword Term
DESY keywords Documents
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer project
1220 test collection
2441 training collection
Data
3,661 documents 19,143 terms 1,191 main keywords
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectAlgorithm
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectAlgorithm
Preprocessing
Punctuation Lower case Remove stop words Stemming
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer project
Weight term - document
Weight keyword - document
Weight keyword - term
Similarity keyword - document
Algorithm
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectExperiments
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectExperiments
AB
A B
A: keywords propossed by DESYB: keywords propossed by HEPindexer
Keywords in the trainning collection
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectResults
52.7 % of precision58.5 % of recall
Response in 2 seconds
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectResults
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectResults
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer project
C++ / STL UNIX Command line interface Digilib: Web interface (PHP)http://cern.ch/digilib Installation on the CERN Document Serverhttp://cds.cern.ch
Software
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
The HEPindexer projectSoftware
Automatic Keywording for HEP literature Ispra (Italy) 4 March 2002
CERNEuropean Organization for Nuclear Research
Data Handling Group
Future Work
Automatic proposition of secondary keywords Improve the algorithm (lemmatizer, multiwords, segmentation...) Use of references to link documents based on common concepts Specific algorithms for handling of energies, particle decays, desintegrations, etc. Agents OAI Apply Semantic Web approaches