+ All Categories
Home > Documents > Multiple Retrieval Models and Regression Models for Prior Art Search

Multiple Retrieval Models and Regression Models for Prior Art Search

Date post: 28-Jan-2016
Category:
Upload: yama
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Participating institution: Humboldt Universität zu Berlin - IDSL. Multiple Retrieval Models and Regression Models for Prior Art Search. Patrice Lopez also at EPO Berlin, Germany. Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany. Plan. - PowerPoint PPT Presentation
Popular Tags:
31
Multiple Retrieval Models and Regression Models for Prior Art Search Participating institution: Humboldt Universität zu Berlin - IDSL Patrice Lopez also at EPO Berlin, Germany Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany
Transcript
Page 1: Multiple Retrieval Models and Regression Models for Prior Art Search

Multiple Retrieval Models and Regression Modelsfor Prior Art Search

Participating institution: Humboldt Universität zu Berlin - IDSL

Patrice Lopez

also at EPOBerlin, Germany

Laurent Romary

INRIA Gemo - Saclay, FranceHUB-IDSL – Berlin, Germany

Page 2: Multiple Retrieval Models and Regression Models for Prior Art Search

Plan

• Searching Scientific and Technical Documents • Issues related to Prior Art Search• Overview of PATATRAS• Patent Document processing• Combining metadata & text in four steps • Results• Future work

Page 3: Multiple Retrieval Models and Regression Models for Prior Art Search

PATATRAS !!

• PATent and Article Tracking, Retrieval and AccesS addresses Scientific and Technical Publications in general:

• Scientific and Technical Publications have 5 dimensions:

1. metadata 2. document structure3. textual content4. supporting content5. experimental data

• How is this instantiated in patent publications?

Page 4: Multiple Retrieval Models and Regression Models for Prior Art Search

Patent Publications

1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,

ECLA (+ICO) G06F17/30T2P2X – Citations Information retrieval Query expansion

Page 5: Multiple Retrieval Models and Regression Models for Prior Art Search

EPO Citation Statistics

EPO Search Reports produced in the last 5 years (tot. 775.000)

0

100000

200000

300000

400000

500000

600000

US EP WO DE XP JP GB FR AT IT CA CN RU KR

of

Se

arc

h R

ep

ort

s

EC classified Pat: 95%

NPL: 24%

JP: 17%

1%

Page 6: Multiple Retrieval Models and Regression Models for Prior Art Search

Patent Publications

1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,

ECLA (+ICO) G06F17/30T2P2X – Citations

2. Patent Document Structure: Title, Abstract, Claims, Description (description of prior art, "subjective" technical problem, description of embodiments)

– Strong interrelations between these structures– Each of these structures serves different goals

Information retrieval Query expansion

Page 7: Multiple Retrieval Models and Regression Models for Prior Art Search

Patent Publications

3. Textual Content of Patent:• Attornish, multilinguality

4. Supporting content:• tables, mathematical and chemical formulas,

citations, technical drawing, etc.

5. Experimental data: absent

Page 8: Multiple Retrieval Models and Regression Models for Prior Art Search

PATATRAS !!• Scientific and Technical Publications have 5

dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data

What are the known practices in prior art search ?

Page 9: Multiple Retrieval Models and Regression Models for Prior Art Search

Prior art search

Search reportSearch report

Patent application

Patent application

Prior artPrior artPrior artPrior art

Prior artPrior artPrior artPrior art

Prior artPrior art

Topic patents are granted patent publication (richer documents then applications; ECLA classes, extra citations; final claims)

Result set extended to all EPO documents introduced during the prosecution the application

All documents without EPO counterpart (via patent family) are discarded (patent applications never filed at the EPO and non-patent literature omitted)

CLEF-IP biases

Motivation and approach:• Non exhaustivity• Recall-oriented search is a myth (titles and abstracts; lack of elaborate tool)• Usage of classification for search (a priori restriction of the result set)•Usage of meta-data (thickets; patents are continuation of previous applications)

The patent examiner’sreal life

Page 10: Multiple Retrieval Models and Regression Models for Prior Art Search

PATATRAS !!• Scientific and Technical Publications have 5

dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data

• We investigated only 1 and 3 in CLEF IP 2009• However...

• how to combine metadata-based and text-content retrieval?• How to combine results in different languages?• How to combine different retrieval approaches?

Page 11: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Tokenization

POS Tagging

PhraseExtraction

ConceptTagging

FinalRankedResults

IndexLemma

en

IndexLemma

fr

IndexLemma

de

IndexPhrase

en

IndexConcept

Lemur 4.9- KL divergence- Okapi BM25

RankedResults

(10)

Query Lemma en

Query Lemma fr

Query Lemma de

Query Phrase en

Query Concept

InitInitial

Working Set

Post-Ranking

RankedMergedResults

Merging

PatentCollection

PatentTopic

Page 12: Multiple Retrieval Models and Regression Models for Prior Art Search

Lemur 4.9- KL divergence- Okapi BM25

Init

Post-Ranking

Merging

Overview of PATATRAS

Page 13: Multiple Retrieval Models and Regression Models for Prior Art Search

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Overview of PATATRAS

Page 14: Multiple Retrieval Models and Regression Models for Prior Art Search

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Overview of PATATRAS

Page 15: Multiple Retrieval Models and Regression Models for Prior Art Search

Patent Document Processing: Text Indexing

• Sound linguistic processing as groundwork:– No stemming: POS tagging & lemmatisation– No stop words: Only open grammatical categories are

considered (N, V, Adj., Adv., numbers)

• A total of 5 indexes:– One word form* (lemma) index per language (en, fr,

de) – English phrase indexing (Dice coefficient)– Conceptual indexing

*ISO/DIS 24611, Language resource management — Morpho-syntactic annotation framework

Page 16: Multiple Retrieval Models and Regression Models for Prior Art Search

Conceptual indexing

• Creation of a multilingual terminological database base based on a conceptual model* covering scientific & technical fields• Sources: MeSH, UMLS, Gene Ontology, SUMO,

WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de

• Merging on concept based on: – Domain matches (manual mappings between sources)– Term matches

• Represent terms/term variants/synonyms/acronyms and multilingual correspondences

• Term disambiguation based on IPC class

• 2,6 millions terms for en, 190.000 for de, 140.000 for fr

• 1,4 millions concepts (71.000 realized in de, 65.000 in fr)

*ISO 16642:2003, Computer applications in terminology — Terminological markup framework

Page 17: Multiple Retrieval Models and Regression Models for Prior Art Search
Page 18: Multiple Retrieval Models and Regression Models for Prior Art Search

Limitations of text-only retrieval• Queries are based on all the textual content of

the topic patent documents

Model Index Language base with citation text

KL lemma en 0.1068 0.1083

KL lemma fr 0.0611 0.0612

KL lemma de 0.0627 0.0634

KL phrase en 0.0717 0.0720

KL concept all 0.0671 0.0680

Okapi lemma en 0.0806 0.0813

Okapi lemma fr 0.0301 0.0303

Okapi lemma de 0.0598 0.0612

Okapi phrase en 0.0328 0.0330

Okapi concept all 0.0510 0.0516

Page 19: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 20: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 21: Multiple Retrieval Models and Regression Models for Prior Art Search

Patent Document Processing: Metadata

• Additional extraction of cited patents in the descriptions (regular expressions)• 7960 additional cited EP doc. found in XL set

• Metadata representation: basic normalization (author, applicant),

• Storage in a MySQL database (total 2,48 Go for the collection)

Page 22: Multiple Retrieval Models and Regression Models for Prior Art Search

Prior working sets

• Goal: For a given patent topic, create the smallest set of patents containing the relevant documents

• Iterative expansion from a core list of documents based on metadata: citation tree, common applicant/author, patent family relation, classifications → patent examiner's strategies

• Result: micro-recall of 0.7303, approx. 2600 doc. per patent topic (415 results per topic after final cutoff)

• Significant improvement of MAP results:

Model Index Language with cit. text with prior sets

KL lemma en 0.1083 0.1516 (+40%)

KL lemma de 0.0634 0.1145 (+81%)

KL phrase en 0.0720 0.1268 (+76%)

Okapi lemma en 0.0813 0.1365 (+68%)

Page 23: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 24: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 25: Multiple Retrieval Models and Regression Models for Prior Art Search

Merging of results

•Strong complementarities between the results sets•So many examples ! → fully supervised ML•Regression model for estimating for each patent topic the pertinence of a result set•Features: language, query size, init. working set size, max./min. & range of retrieval scores, IPC main & class, average phrase length•Training set: 500 + Addition of 4131 patents of the collection•Linear combination of weights: md

Mmmqdq wcs

Page 26: Multiple Retrieval Models and Regression Models for Prior Art Search

Merging of results

Feat. LeastMedSq MP SMO ν-SVM

f1 0.1681 (+5.8%) 0.1711 (+7.7%) 0.1706 (7.4) 0.1691 (+6.4%)

f1-6 0.1689 (+6.3%) 0.1797 (+13.1%) 0.1807 (+13.7) 0.1976 (+24.3%)

all 0.1786 (+12.4) 0.1898 (+19.4%) 0.2016 (+26.9%) 0.2281 (+43.5%)

f1 language

f2-6 related to the retrieval score

f7-8 IPC (domains)

f9 av. phrase length

Page 27: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 28: Multiple Retrieval Models and Regression Models for Prior Art Search

Overview of PATATRAS

Lemur 4.9- KL divergence- Okapi BM25

Init Post-RankingMerging

Page 29: Multiple Retrieval Models and Regression Models for Prior Art Search

Post-ranking

• Regression model for estimating the pertinence of a patent in the result set for a given patent topic

• Features: citation, # of common IPC & ECLA classes, prob. of citation, same applicant & inventors

• Training set: 500 + Addition of 4131 patents of the collection

Page 30: Multiple Retrieval Models and Regression Models for Prior Art Search

Final Results

Measures S M XL en-XL fr-XL de-XL MAP 0.2714 0.2783 0.2802 0.2358 0.1787 0.2092Prec. at 5 0.2780 0.2766 0.2768 0.2365 0.1855 0.2122Prec. at 10 0.1768 0.1748 0.1776 0.1575 0.1338 0.1467

• In average approx. 43s per topic

• Final runs (10.000 patent topics) for all, en, fr, de took 5 days on 4 machines

Page 31: Multiple Retrieval Models and Regression Models for Prior Art Search

Conclusion

• We have proposed an architecture for retrieving Scientific and Technical Publications

• We have adapted the architecture to patent search practices

• Need– improve terminological representations

– address document structures

– refine query representations

Full text available in HAL: http://hal.archives-ouvertes.fr/hal-00411835/fr/


Recommended