“All databases are equal... …but some are more equal than others.” Stephen Adams, Magister...

Post on 05-Jan-2016

223 views 3 download

Tags:

transcript

“All databases are equal...

…but some are more equal than others.”

Stephen Adams,

Magister Ltd., GB

© Magister Ltd 2004, 2005 2

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 3

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 4

The basics of information retrieval

Query

Documents Documentrepresentation

Query representation

HitsMatching

Adapted from Crestani, J.Inf.Sci. 29(2), 87-96 (2003)

© Magister Ltd 2004, 2005 5

Reference interview

Query

Documents Documentrepresentation

Query representation

HitsMatching

“I’m sorry - I don’t understand the question…”

“Are you also interested in…?”

“How much do you already know about this?”

QUALITY RESULTS START WITH US.

© Magister Ltd 2004, 2005 6

Strategy development

Query

Documents Documentrepresentation

Query representation

HitsMatching“Where has that manual got to…!”

“When did they start using that field?”

“Is that field available for all records?”

© Magister Ltd 2004, 2005 7

Document quality - at source

Query

Documents Documentrepresentation

Query representation

HitsMatching“I leave the form-filling to the paralegals…”

“I’m sure my secretary never transposes application numbers - she can read my handwriting…”

“Our patent office uses that INID code differently…”

© Magister Ltd 2004, 2005 8

Full text, abstract, indexing...

Query

Documents Documentrepresentation

Query representation

HitsMatching

“I get so much rubbish with full-text…”

“I don’t trust abstracts - especially for a freedom-to-operate search…”

“Their timeliness has improved - but indexing quality is down…”

“800,000 corrections per year”

© Magister Ltd 2004, 2005 9

Hitting the keyboard

Query

Documents Documentrepresentation

Query representation

HitsMatching

“Where on earth did that false drop come from…?”

“We always use the free services - the results are OK so far”

“Why does this host always crash on a Friday?”

© Magister Ltd 2004, 2005 10

Major topics for today

Query

Documents Documentrepresentation

Query representation

HitsMatching

Database content

Database context

© Magister Ltd 2004, 2005 11

Content and context

• The effectiveness of “a database” as a search tool is a function of (at least) two variables:– the data content– the search engine / command language.

• The ideal answer may be a compromise:– (‘average’ database & ‘good’ command

language) or (‘good’ database & ‘poor’ search engine).

© Magister Ltd 2004, 2005 12

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 13

Why evaluate database content?

• Database evaluation is a basic part of information literacy:– “a set of abilities requiring individuals to

recognise when information is needed and have the ability to locate, evaluate and use effectively the needed information.”

– American Library Association 1989, Final Report of the ALA Presidential Committee on Information Literacy

• If we do not evaluate our sources, we cannot serve our customers fully.

© Magister Ltd 2004, 2005 14

The biggest database of all?

Isn’t that enough for anyone?

So why evaluate?

© Magister Ltd 2004, 2005 15

A simple evaluation parameter: language

Cyber Atlas distribution 2000

English

Japanese

German

Chinese

French

Spanish

Russian

Other

Source: CyberAtlas, www.clickz.com/stats/big_picture/demographics/article.php/5901_408521

OCLC figures for 2004 are comparable: 30-35% of the Internet is not in English.

© Magister Ltd 2004, 2005 16

Implication:

• The effectiveness of ‘the Internet’ as a retrieval tool will be skewed according to the nature of our search:– “Hermann Hesse” “Das Glasperlenspiel” =

13,600• of which “& domain=de” comprises 13,400

– “Hermann Hesse” “The Glass Bead Game” = 12,500

• of which “& domain=de” comprises 128

– “Hermann Hesse” “Magister Ludi” = 5,100

© Magister Ltd 2004, 2005 17

The third leg

• Good database evaluation should include not only the 2 factors identified above: – Database content i.e. how well it is put together

– Database context i.e. the command language and search engine

• but also a third factor– How well does this database fit my specific enquiry? (one-off

need or recurring usage)

– Note - if the evaluation process includes this factor, it follows that there is no such thing as the ‘ideal’ database for all enquiries

© Magister Ltd 2004, 2005 18

What is quality?

• “Fitness for purpose”– content– completeness– timeliness etc.

• It is difficult to be absolute; more easy to evalutate as a relative quantity– benchmarking two sources against one

another gives a better practical feel for ‘quality’ than attempts to measure against a mythical standard

© Magister Ltd 2004, 2005 19

Simple example of quality

• We wish to conduct a freedom-to-operate search in respect of Germany– one file contains DE-C2, DE-B4 documents– a second file contains DE-C2, DE-B4, DE-

C1, DE-B3, DE-T2 and DE-U documents

• Which one would you choose?– Whichever your answer, it does not imply

that the other is ‘poor quality’.

© Magister Ltd 2004, 2005 20

Measuring quality

• We can measure good content– essentially quantitative, binary

• We can measure good database structure/context – essentially qualitative, relative, subjective

• e.g. are there explicit links between individual records (e.g. common indexing scheme)?

• e.g. do the command language features or field standardisation facilitate virtual links?

• e.g. what proportion of the time is the system up?

© Magister Ltd 2004, 2005 21

The coelecanth

Location: GreenlandZone: polarHabitat: fresh waterSize: 30 cm.Era: 200 m. years agoExtinct for 50 m. years

Location: South AfricaZone: sub-tropicalHabitat: salt waterSize: 1.75 metresEra: 1938Alive and breeding

© Magister Ltd 2004, 2005 22

Databases or datadumps?

• Science is not ephemeral - it is cumulative– Unless adequate consideration is given to the

issue of retrieval at a distance of 10, 20 or 50 years after publication, then the resulting file is not a database at all - it is a datadump

• Much emphasis has been given in recent years to timeliness i.e. adding new records– add in haste, repent at leisure?

© Magister Ltd 2004, 2005 23

Robert Maxwell:

Chairman of Pergamon Press

Owner of Pergamon Orbit-InfoLine

Owner of Mirror Group Newspapers

© Magister Ltd 2004, 2005 24

“All the science that’s fit to print”

• Publication or ‘laid open to public inspection’ without consideration of retrieval afterwards means that each record is left isolated from the context of the corpus of science– and will be missed in a proportion of the

searches to which it is a relevant answer– or possibly never found again

© Magister Ltd 2004, 2005 25

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 26

Missing fields

Three ‘layers of incompleteness’

Missing Kinds-of-documents

Missing documents

© Magister Ltd 2004, 2005 27

Missing documents

• The classical measure of database quality:– Is

• every document of the same kind,

• published in that period

• by that publishing authority

– present in the file?

• Examples:– Latipat, USPTO.gov, Patent Abstracts of

Japan

© Magister Ltd 2004, 2005 28

Missing documents

• Latipat– Newly launched esp@cenet portal,

http://lp.espacenet.com

• USPTO.gov– Full-text of granted patents

• Patent Abstracts of Japan– JAPIO file

© Magister Ltd 2004, 2005 29

Latipat

0500

10001500

200025003000

35004000

45005000

Both Latipat and PlusPat (below) suffer from the same problem - missing records; lots of them!

© Magister Ltd 2004, 2005 30

USPTO.gov

Partial listing of missing patents:

4097518 - 4097928 (411)

4526401 - 4527286 (886)…

= 6,092 missing between 4,000,000 and 4,999,999 (0.6%)

STILL 224 missing between 6,000,000 and 6,101,209 (0.2%)

© Magister Ltd 2004, 2005 31

PAJ

PAJ fact sheet from Questel-Orbit

© Magister Ltd 2004, 2005 32

What the publicity impliesA

PP

LIC

AN

TS

DATE

TECHNOLOGY

1976

© Magister Ltd 2004, 2005 33

First limitation - by applicantA

PP

LIC

AN

TS

DATE

TECHNOLOGY

1976 1989

Backfile to 1989 now available - but has every host loaded it?

Prior to 1998, cases not claiming JP priority were not automatically included in PAJ

© Magister Ltd 2004, 2005 34

Second limitation - by technologyA

PP

LIC

AN

TS

DATE

TECHNOLOGY

1976 1989

Prior to 1989, only 48 out of 118 IPC classes were covered completely (40%)

Complete IPC coverage from 1989 - but no plans to create back-file?

© Magister Ltd 2004, 2005 35

The (messy) truthA

PP

LIC

AN

TS

DATE

TECHNOLOGY

1976 1989

© Magister Ltd 2004, 2005 36

How to evaluate?

• “Missing documents” is one of the few parameters which can be measured independently of the database– Annual Reports of the office concerned– WIPO Industrial Property statistics

• Caution : – these may not refer to the appropriate

document kinds; check before use.

© Magister Ltd 2004, 2005 37

Caution

• Determining database ‘completeness’ is only meaningful when measured against a quantitative parameter– e.g. publication number.

• It has little or no meaning when measured using more qualitative parameters– e.g. no. of hits found using the same strategy

across several databases• the strategy will be sub-optimal for some

databases and not for others

© Magister Ltd 2004, 2005 38

Simple source-by-source comparison

BIOSIS -v- Medline

BIOSIS Evolutions vol.9 no.6 © BIOSIS

© Magister Ltd 2004, 2005 39

Science Direct: Comprehensive - provided it’s from Elsevier...

Web of Science: Comprehensive - provided it’s got a high impact factor from ISI...

MDL: PCT and EP from 1976 ?

© Magister Ltd 2004, 2005 40

Take-home message

• There is nothing wrong with publicity– provided it is not confused with user

documentation.

• Database producers still have a long way to go in informing users of the gaps in their databases– it should be much easier to locate this data

than it is at present.

© Magister Ltd 2004, 2005 41

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 42

Missing Kinds-of-Documents

• Second measure of database quality– Is

• every document of every appropriate kind,

• published in that period

• by that publishing authority

– present in the file?

• Examples:– Overlapping year / country coverage– EP-A1, -A2, -A3, -A8, -A9– US-B1, -B2, -E, -C1, -C2

© Magister Ltd 2004, 2005 43

But they all cover Australia...

• Even given overlapping country and year coverage, different sources can cover different publication stages

• e.g. Australia– WPI : AU-A from 1963, AU-B from 1993– INPD : AU-A from 1973, AU-B from 1978– CAS : AU-B from 1927

• AU-A is included in CAPlus family, even though it will never be selected as CAS basic - see http://www.cas.org/EO/patkind.html

© Magister Ltd 2004, 2005 44

European correction documents

• ST.50 implemented from 1997– how many database producers take the data?– how many tell their users whether they take

the data?

• Examples:– Questel-Orbit EPPATENT file– STN Europatfull file

© Magister Ltd 2004, 2005 45

Coverage of correction documents

1/1 EPPATENT - (C) Questel.Orbit- imageCPIMPN - EP954211 A2 19991103 [EP-954211]BPN - 1999-44ET - Supporting apparatusBRR - 2000-29 (Updated 2000-29)DREX- 2001-01-18 Request for examination (Updated 2001-13)DNEX- 2001-08-06 First examination report (Updated 2001-38)DGR - 2003-07-23 Grant (Updated 2003-30)BGR - 2003-30 (Updated 2003-30)NGR - B1 (Updated 2003-30)

EPPATENT MAX format (edited) : all Bulletin announcements

© Magister Ltd 2004, 2005 46

Coverage of correction documents

L1 ANSWER 1 OF 1 EUROPATFULL COPYRIGHT 2004 WILA on STN PATENT APPLICATION - PATENTANMELDUNG - DEMANDE DE BREVET AN 954211 EUROPATFULL ED 19991114 EW 199944 FS OSTIEN Supporting apparatus.PIT EPA2 EUROPAEISCHE PATENTANMELDUNG GRANTED PATENT - ERTEILTES PATENT - BREVET DELIVRE AN 954211 EUROPATFULL UP 20030729 EW 200330 FS PSTIEN Supporting apparatus.PIT EPB1 EUROPAEISCHE PATENTSCHRIFT

Europatfull default format (edited) : no record of anything after EP-B1

© Magister Ltd 2004, 2005 47

INID (15) shows that this B8 is a correction to the B1 (grant).

© Magister Ltd 2004, 2005 48

How effective is this?

• The experienced information specialist is tempted to infer legal status information from the presence/absence of a particular publication stage (risky!) – e.g. EP-B = assumption of entry into force

• The inexperienced information specialist is not always given the correct links to lead to the right conclusion e.g.– e.g. US parent, re-issue, re-examination cases

© Magister Ltd 2004, 2005 49

Re-examination mentioned in facsimile version - but not in ASCII text:Parent case - claims 1-10Re-exam 1 - new claims 11-112Re-exam 2 - new claims 113-126

IFI record consolidates all changes into a single record - the novice has a better chance of getting a more accurate answer to a legal status search.

© Magister Ltd 2004, 2005 50

US coverage

Kind Code DefinitionEarliest date of use

Dialog / 652-654 IFI Claims INPADOC (incl. Delphion)

Questel / USAPPS

Questel / USPAT

STN/ USPAT2

STN/ USPATFULL WPI

US-A old Act grant 1836 1971 1950 1968 1971 1971 1963new Act published application 2001 2001 2001 2001 2001 2001

US-B old Act re-examination Y 1981new Act grant 2001 2001 Y

US-C new Act re-examination 2001 YUS-E re-issue 1838 Y 1963 1968 Y 1970US-H defensive publication 1969 Y 1963 1977 1976US-H1 Statutory Invention Registration 1985 Y 1963 1985 1985 Y 1968US-A1 Trial Voluntary Protest Program 1975 1975US-S Design Patent 1843 Y 1976 2001 1976 YUS-P old Act Plant Patent 1931 Y 1976 1994 1976 YUS-P1 Plant Patent published application 2001 2001US-P2 Plant Patent grant 2001US-A0 NTIS invention applications 1974 1983US-A9 Correction of new Act published application 2002 2002 2002None Office of the Alien Property Custodian (APC) 1917

• Example analysis of KD coverage– e.g. IFI would appear to cover SIR’s from 1963,

some 22 years before they started (?)

– e.g. split between USPATFULL/USPAT2 difficult to discern

© Magister Ltd 2004, 2005 51

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 52

Missing fields

• Third measure of database quality• every document of every appropriate kind

• published in that period

• by that publishing authority

• present in the file to the same level of detail

• Evaluation must compare like-with-like• Variations in completeness of coverage and/or field

population will affect the apparent effectiveness

• Examples: • EP and PCT full-text files, Derwent WPI

coverage

© Magister Ltd 2004, 2005 53

Non-systematic or missing fields

• New field during database life– imposes an implicit time range on your

search• e.g. IPC editions, WPI coding changes

• Systematic omission of a field– biases results against records which do not

contain that field• e.g. US-A assignees

• e.g. JP, CN inventors in WPI

© Magister Ltd 2004, 2005 54

European Patents Fulltext covers all European patent applications and granted European patents published since the opening of the European Patent Office (EPO) in 1978…

But…EP-A specifications from 1986 in only one languageEP-B specifications from 1991 in three languages

© Magister Ltd 2004, 2005 55

PCT full text

• Many files claim to cover ‘full text’ PCTs– Few handle the cases published in Japanese,

Chinese or Russian• but these still have an English abstract

• Abstract searching gives equal weight to all documents

• Full text searching skews results in favour of records containing full text

© Magister Ltd 2004, 2005 56

Derwent WPI countries

• Most countries in WPI are coded using the Manual Code system– but not all countries had Manual Codes added

from the start of their coverage

• A strategy incorporating Manual Codes imposes an implicit time ranging on some countries, and can distort retrieval– MC retrieval of KR-B started 1990, biblio

available from 1986

© Magister Ltd 2004, 2005 57

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 58

Quality and the search platform

• A poor search platform / command language can ruin a good quality database, by effectively concealing or distorting the information which is present.

• Typical questions:– does the default print format contain the most

useful information for my search?– do I obtain the same answer irrespective of

the route to it?

© Magister Ltd 2004, 2005 59

Default print formats

1/1 PLUSPAT - (C) QUESTEL-ORBIT- image CPIM (C) Questel-Orbit PN - EP0954211 A2 19991103 [EP-954211] TI - (A2) Supporting apparatusSTG - (A2) Pub. Of applic. Without search report

1/1 PLUSPAT - (C) QUESTEL-ORBIT- imageCPIM (C) Questel-OrbitPN - EP0954211 A2 19991103 [EP-954211]PN2 - EP0954211 A3 20000719 [EP-954211]PN3 - EP0954211 B1 20030723 [EP-954211]PN4 - EP0954211 B8 20040414 [EP-954211]TI - (A2) Supporting apparatusSTG - (A2) Pub. Of applic. Without search reportSTG2- (A3) Publi. Of search reportSTG3- (B1) PatentSTG4- (B8) Modified first page

PlusPat BIB format : only shows first publication stage.

PlusPat MAX format

© Magister Ltd 2004, 2005 60

Variation due to search route

• US patent term extension under 35 USC §136 (Hatch/Waxman)– issued in the form of a Certificate of

Correction

• At least two equivalent routes to view: – locate the original document and check for a

‘Correction’ segment in full text view OR– go directly to list of term extensions and link

to Certificate• http://www.uspto.gov/web/offices/pac/dapp/opla/term/156.html

© Magister Ltd 2004, 2005 61

Test question

• Is there an extension in force for – US 4540568 ? – US 4572909 ?– N.B. - This question avoids the use of PAIR

(inoperative on day of test) and assumes that the enquirer has already established that US 4540568 has been replaced by US Re 32969

• but why was PAIR not working?

• and why should I have to make that assumption?

© Magister Ltd 2004, 2005 62

35 USC 156 listing

© Magister Ltd 2004, 2005 63

Results

• US Re 32,969 (replaced US 4540568)– via 156 listing, obtains a PDF of Cert. of Correction

shows term extended for 931 days • actual extension in listing is recalculated as 897 under 35

USC 156(c)(3)

– via full text view, there is no record of the Certificate of Correction at all; nor any link from US 4540568

• US 4572909– via 156 listing, obtains a PDF showing extension for

1252 days

– via full text view, an additional ‘Correction’ segment available

© Magister Ltd 2004, 2005 64

Additional document segment is present for US 4572909, but missing for others….

© Magister Ltd 2004, 2005 65

Summary answer

Source: US 4540568 /US Re32969

US 4572909

35 USC 156listing

Yes Yes

Full text No Yes

PAIR ? ?

© Magister Ltd 2004, 2005 66

Topics

• Where database creation goes wrong…

• Why bother to evaluate?– A word about ‘quality’

• Quality content– missing documents, document kinds and

fields

• Quality context– search engines

• Conclusion

© Magister Ltd 2004, 2005 67

Conclusion

• There is no such thing as “the database for all seasons”

• Evaluation is ongoing, even for established products

• There are many ways in which databases can be ‘incomplete’

• A poor search environment can ruin a good database

• Communication between legal, information and database specialists is the key quality factor

© Magister Ltd 2004, 2005 68

Coming up in part 2...

• Two case studies– PCT publication rates– Searching for gold

© Magister Ltd 2004, 2005 69

Enjoy your break!