Set-oriented retrieval

InJormolron Processing & Managemenr Vol. 25, No. 5, pp. 465-475. 1989 Printed in Great Britain.

0306.4573189 $3.00 + .OO Copyright 0 1989 Pergamon Press plc

SET-ORIENTED RETRIEVAL

A. BOOKSTEIN* Center for Information Studies, University of Chicago, Chicago, IL 60637, USA

(Received 30 September 1988; accepted in final form 4 January 1989)

Abstract -The broad way in which we look at how an IRS functions influences the types of questions we ask about it and the ways we try to improve performance. In the recent past, retrieval methodologies have been based on retrieving documents one at a time. In this paper we are introducing a set-oriented view. We observe that this view is quite consistent with the single-document or sequential methods, and define a precise model to capture the set-oriented approach. We then examine a number of consequences of the model, such as the limitations implied by a finite index vocabulary. Finally, we discuss various ways in which the set orientation can influence our thinking about IR.

1. INTRODUCTION

Many formalizations of the information retrieval (IR) process have a dual interpretation: they can be thought of as describing operations to be carried out on individual documents to determine which ones ought to be retrieved, or they can be conceptualized as means for determining sets. This duality is particularly conspicuous in retrieval systems based on Boolean logic. Here each document is indexed by a set of terms taken from an indexing vocabulary and a request is made up of a subset of these terms connected by Boolean oper- ators. In a Boolean system, it is possible to describe the retrieval process as one in which each document is tested to see whether it satisfies the retrieval criterion established by the request, with those satisfying the request being retrieved. But equivalently, one can think of each index term as defining a set: the documents indexed by that term-and the retrieval process as consisting of a sequence of operations on sets of documents, where the sequence of operations is governed by the form of request (see, e.g. [ 11). This duality continues with those modern approaches to IR based on the mathematics of fuzzy sets. (An excellent general introduction to the theory of fuzzy sets can be found in [2]. Bookstein [3] reviews the application of fuzzy sets to information retrieval.)

However, for various practical and theoretical reasons, most recent probability-based attempts to model IR and to develop retrieval algorithms have tended to be single-item oriented: they center around the task of deciding whether or not to retrieve any given document. The single-item orientation is most conspicuous when feedback information is used: here, given judgments about a preceding sequence of retrieved documents, the system’s task is to decide which document to retrieve next, or, in a closed system, whether to terminate the search.l_ This focus, prompted by the mathematical techniques that underlie these approaches, has caused us to lose sight of the fact that ultimately it is a set of documents that a user sees and that he evaluates the system on this basis. This is true whether the system operates by accepting a request and putting together a set of documents to retrieve (each document retrieval decision being made without consideration of the other retrieved documents), or operates interactively, taking advantage of feedback; ultimately the search

*The research reported here was supported, in part, by the Nationai Science Foundation through Grant 183-8413559. This paper extends the paper presented in the 1 lth International Conference in Research and Devel- opment in Information Retrieval, Grenoble, France, June 13-15, 1988. Edited by Yves Chiaramella, Press Univer- sitaires de Grenoble.

tRobertson and Sparck-Jones [4], and others following, retrieve documents in stages, where in each stage a set of documents are retrieved based on feedback information generated in earlier stages. Bookstein [5] dis- cusses an extreme version of this model, in which documents are retrieved individually in a sequence and each retrieval takes into account evaluation judgements for all previously retrieved documents.

lffl 25:5-A

466 A. BOOKSTEIN

stops and it is a set of documents that a user receives.* This crucial aspect of document retrieval is lost in much of the recent IR research, and, therefore, the insight that recog- nition of this characteristic of IR provides is also lost. For example, two documents cen- trally about a topic but with a great degree of overlap of content may both be appropriate to retrieve within the context of a single-item model, but not within the context of an explicit set-oriented model [6]. Similarly, a set-oriented approach might consider defining a set of documents in which the presence of one document assumes that one of the others is available for reference. Don Swanson has studied some very interesting examples in which documents were related in this manner because of the logical relations of their content [7,8]. For example, suppose the content of document A can be summarized by the proposition that Hi implies Hz, while document b asserts Hz implies H3. Then, given a request for which the proposition HI implies H3 would be interesting, neither A or B individually would be relevant, but the two together would be most valuable.

These problems can be summarized most succinctly in the vocabulary of probabilistic retrieval [3]. A central concept of the probabilistic retrieval models is the cost of a retrieval decision, i.e. a numeric quantity assigned in fact or in principle to a retrieval ses- sion that measures how well or how poorly the system has performed. Translated into these terms, we argue that costs should be assigned to the retrieval of sets of documents rather than to individual documents; as detailed below, we are asserting that cost is a set function.

The considerations noted above would be very difficult to incorporate into the single- item models that have been proposed thus far. The objectives of this paper are as follows:

To point out, as above, that not only the Boolean retrieval methods, but also the sequential methods of probabilistic retrieval can be described in set theoretic terms To use a decision-theoretic framework analogous to that used in conventional probabilistic retrieval to outline a set-oriented model of an information retrieval system (IRS); and To explore some interesting paths suggested by adopting this viewpoint; in particular, to study how various retrieval mechanisms implicitly create structures that limit retrieval capability

Accepting a set-oriented viewpoint might enhance retrieval effectiveness, though perhaps not so much by leading to more sophisticated elaborations of current item-oriented techniques as by causing us to think in terms of different approaches that complement these techniques. The more holistic vision of the collection that a set-oriented approach provides encourages us to search for structure naturally occurring within a collection and for mechanisms that create such structures: these structural relations are the means by which a set can be broken down into a collection of subsets. Many such structuring mechanisms already exist and are familiar in libraries and other document retrieval systems; others can be introduced and exploited. Also, since item-oriented retrieval is a special case of set- oriented retrieval, we expect a set-oriented approach will allow us to understand better the functioning of current models.

2. RETRIEVAL MODEL

Our first task is to define an explicitly set-oriented, decision-theoretic model of an IRS. The theory underlying the now traditional probabilistic models makes heavy use of cost functions when determining how to retrieve a document: costs are, in principle, associated with various retrieval outcomes and documents are retrieved using the rule of retrieving the least costly document next, of course taking uncertainty into account. This provides

*Formally, we define R as the set of legitimate requests, and E as the set of possible user evaluations of documents: E determines the possible feedback sequences that might occur as the retrieval process evolves (that is, each e E E is a function from the set of documents, D, to the set (0,l)). Then the retrieval process can be defined as a function T:R x E --* 24 This formalization makes clear the set character of the current feedback systems. In earlier systems, T does not depend on E.

Set-oriented retrieval 467

a simple, unifying approach to IR with a great deal of aesthetic appeal. It has been rich in generating retrieval algorithms. It allows us to understand better the performance of the popular, earlier vector retrieval models [9]. Also, it has proven fertile in generating re- searchable questions, whose investigation has greatly enriched our understanding of the IR process. For these reasons, as we move from a single document to a set orientation, we would like to define retrieval once again as a decision theoretic activity, driven by an underlying cost function, but now with a set focus. In developing this model, we imagine a user presenting a request to an IRS and the system, perhaps after interacting with the user, defining a set of documents, referred to as the retrieved set. We are not concerned here with whether the system simply presents a set of documents to the user or whether the user is given the documents one at a time, with the stop decision being made either by the user or the system. We are also neglecting the effect of order in the presentation of documents. This simplifies our discussion. Much of what follows could, however, be extended to ordered sets or sequences of documents. In our model, the user then evaluates the set, taking into account the quality, for his purposes, of the documents retrieved (perhaps with some sense of how many desired documents are missing), the number of documents retrieved, redundancy, coherence, and so forth. We shall assume that a user is able to compare two sets of documents and judge which is better for his purposes. We shall model this capability by defining for each request a utility or cost associated with each set of documents. In terms of this cost function, the objective of an IRS is to retrieve the set that has the minimum cost. Typically, not every possible subset is retrievable. A more precise statement of the objectives of an IRS is to retrieve the retrievable set of lowest cost, a retrievable set being a set that can be retrieved given the processes permitted by the system. Within Boolean IRS, what constitutes a retrievable set is evident. A careful statement of what constitutes the retrievable sets, their relative sizes, and the relations among them in the more complex weighted and feedback systems is not at all obvious and is an interesting research question. We shall comment further on aspects of this question below.

An important component of an IR model is the user request. Bookstein and Cooper, in their formalization of an IR system [lo], simply treated the request space as a set made up of formally structured linguistic entities that the system processed in order to retrieve a set of documents. The more recent feedback mechanisms demand a more complex con- ceptualization of this component of the system. While a user enters the system with a request of the form described above, this request is modified greatly in the course of the search. To assist in formalizing the concept request, it is helpful to remind ourselves how a request is used in the retrieval process: the request is used only as a mechanism for defining, through a weighting or matching process, the set of documents likely to be desired by the patron. In fact, the feedback mechanisms explicitly modify the request in an attempt to create a variant whose retrieved set better matches the needs of the user, and ideally stops when the retrieved set is the best possible given the discrimination capabilities of the system. Thus, the linguistic entity presented by the user is in fact an attempted representation of the documents that he would like to see, expressed in a form convenient for the system to manipulate. It is intrinsically an approximation made by the user with an understanding or sense, very likely vague, of how the documents themselves are represented and how the system will use his request and the document representations to define a set of documents. The system will try to improve the user’s statement in the course of the search, but typically user needs exist for which no requests can be formulated that retrieves the ideal set of documents.

In our formalization, we shall directly define the user request to be the set of documents he would like to see, and the presented request as a representation of this in a form acceptable to the system. To distinguish the request from its verbal expression, we shall refer to the latter as a query.

We can now bring these components together: an IRS is made up of a set of documents D and a retrieval function that transforms a query and, in a feedback system, a history of responses into a set of documents. The user brings to the system a request that we formalize in the model as follows: the request component of an IRS is defined as a function: C:2D + reals, where the real numbers referred to are the costs the user associates with a set of documents. The objective of the system is to define a retrieval function such

',L +L IIE Joj '.L < .I wql ysns ‘aIqel!eAE pauIns

-se aq UE!~ wq$ uoy~~1~o3u! 30 sse~s e ua@ ~s!xa ‘,J ‘iCi3a~?J1s Isrugdo LIB saoa

~(saurow~o

luaJa3j!p aLuquIo3 01 &M a~qeuost?aJ $sow aq$ si sagyq!ssod 30 a8ueJ aql JaAo uogel

-3adxa aql put? a[qyJeA tuopu~?J B s! alqt?I!'?AE uoy~~1Jo3u~ aql asu!s ‘(((_r)Q)s)g

> (((r)I~)s)g ‘Llas!DaJd aJotu Jo “‘ ZJ -=c 'I,,) 5~ ueql Jallaq s! 1~ leql hs aM ut?3

uaqM ‘suoysun3 jo sse~s awes aql woJ3 ‘5~ pug 1~ ‘suogwnj Ir?AayaJ 0~1 uaA!D

:i3u!~ol103 aql yse 01 ayrl

pInoM am ‘SaI uc 01 aIqeI!cAe uo~wuJo3u~ aql uaA!r;) ~uo~1e~u~o3u~ 30 ssep pals!JlsaJ E uo

suo!s!aap so! ayew lsnw SKI huv ~suo!lw-qr?Aa 30 howq B pue uo~1eu1Jo3u~ ZuIxapu! ‘uog

-yu_uJo3 IsanbaJ aql apnIDu! ~81 luaJJn3 icq pasn uoy~uJo3u~ aql 'a~ 3 (1)~ :sluacun

-Dop30 las e 01 ii surJo3sueJipue ‘1 ‘urawh aqt 01 alqygAe uoyuJo3u~ sayw leql ‘L hs

‘uoyaunj e se30 lq8noqlaq u~3 SKI UI? ‘aAoqr! palou sv '~130 %npuwsJapun p23gaJoaql

uwj t? %I!dOIaAap Jo3 Ir?yuassa suoysanb30 Jaqurnu B yst? 01 sn spt?al Tapour aAoqt! aqL

'IlaM al!nb ~JOM 1~3 u! saqzoJddc pawa!Jo-$uaurnDop a@!s aql asuls

‘pazgeaJ Qaw!ur!xoJdde aq 01 aJnlDnJls 1? q3ns laadxa aM .marunsop aIBu!s e 30 anleA Jo

aw2AalaJ aql lnoqe 8uy1e1 Joj uoytwj~lsn~ 1wgaJoaqi E ahaq aM uaql ‘(!p)sz + (u)8

= (Up‘. .“lp)s 3! ‘aldurexa .IO~ .paqoJd aq uw uog3unj is03 ayl30 Li!~!i!ppr! iql 01 %I!

-lyaJ suogsanb ‘0~31 'pau!japaqur?~apz~s Ir?AJalu! UB uaqM asway s! %ysaJalu! aJow

.aps a3ueAaIaJ ~eu!pJo ue aui3ap pInoM uoyunj aDuaJaja.rd

\? q3ns .[zI] pawjap aq pInoD sa3uaJajaJd passaJdxa asaql qiy luals!suo3 Uo~~XInj Icy!

-I!yn Jo ISOD E ‘Qdde suoyelaJ awaJa3aJd 30 aJn$DnJis Ir?nsn aqi 31 uaqi ‘(11~ ald!aupd ui) sias

-qns30 Jaqurnu B Jo3 Jasn ~30 saDuaJa3aJd aql uaA!f> .lwAalaJ sy luaunvop uaA$ B JaqiaqM

ap!3ap 01 ueql sJajaJd aq qXqM ‘sluauInsop 30 safhysed 0~1 uaA$i ‘apyDap 01 Jasn E Joj

Ja!sza aq plnom I! 13adxa aM ‘puvq 1~ uraIqoJd aqljo swJal UI wIag30 sJ!ad uaawaq suo!s

-!3ap awaJajaJd ayew 01 a1qe waas op aldoad ‘ayew 01 lItv!jj!p ualjo aJe surai! 30 suoge

-yeAa palyap aI!qM *uogetq\?Aa aau-eAa[aJ t.z u! Jasn ~30 papuwuap ysel aql‘ald!suiJd u!

isea] le ‘iCJydw!s pu?? cap! srql pualxa MOU uw aM *luaurn~op t? 30 a3ur?ldaDDE s,Jasn ejo

surJai u! pawjap s! a3ueAalaJ q2iqM u! ‘[II] a3ueAalaJ paseq-diqgn c Jo3 JagJea pan8.w aM

.aweAaIaJ Ou!ssasw uaqM uosJad ~30 papuaurap ysr?l aql Bu!ugap s! wa\qoJd Jaqlouv

*(ias e sa~nigsuo2 os~e luaurnaop a@u!s E ‘asJno3 30 ‘q%oqi) 1uatunDop a@

-u!s e 01 ueql JaqieJ surauInsop30 ~asr! 01 Bu!u!t?lJad SE? aDuk?AaIaJ aw3ap aM3! paleugga

s! waIqoJd p?nlda3uo:, s'!qJ ~sluaurn~op Jay10 aql Lq paXIanuu! so slue~ aq $BqMjo asuas

s,Jasn aqi 3! uaAa ~0 siuawn~op Jaqlo us paJaAo3 Jaiiaq s! luawnsop leql30 lualuo3 aqi

3I ‘*%a ‘siuauuwop ozjo au0 s! ii3! iou inq ‘Jasn E 01 paluasald au0 Quo aqi s! 1131 ~UEA

-aIaJ paIp?D aq [IaM hu 1uauInDop uaA@ I? :xopeJEd kpuanbas %u]qnoJl aqi 01 pal seq s!qL

.aDueAaIaJ inoqe suo!ssnDsip isour 30 uoyieiua!Jo luaurn2op a@!s aqi uaaq seq dlItv!jj!p

pasne:, seq a3ueAalaJ 30 uogou aql suoscaJ aqi %IOLLIV .asueAalaJ 30 wql s! aldurexa Buy

-IsaJaiu! LJaA v 'IcAa!JiaJ uoyzu~o3u! u! sida%Jos awosaIqnoJ1 Jay10 uo 1qSg pays sd[aq

Tapour IeAayaJ aqi %I!A~JP Idaauo2 Bu$IJapun aql SE uogDun3 iClggn JO ISOD I? 2k11sn

'uoysunj is03 aqi

lnoqle s~!ls!Jawt?Jeq3 awes saqdw! aAyajja put? Iyuanuu! OS uaaq seq pzAa!JlaJ palua!Jo

-1uaurn3op a@uys ieqi1383 aql JaqiaqM aJybu! 01 ‘Je~nzyJed u! 'pue !uoy3unj s!qi lnoqe

ayEXU lq8IuI aU0 suoydurnsw SnO!JeA 30 IEAalJlaJ JO3 suoywgdur! aql Jap!SUOD 01 !SxI

ut2 qi!M suo!izeJaiuy ueurnq lapour ~~~r?nlzz 01 uogDun3 s!qi l!wJad 01 aIqeJ!sap aJe slu!e~is

-uo3 1eqM .Iap!suoD 012?U!lsaralu~ s! 11 TIo!lXInj is03 aqi SI lapour s!qi30 luauoduros Ir?JiuaD

aqi ‘Apea .suogsanb Ie3gaJoaqi 8ugsaJalu~30 Jaqurnu E sisai%ns Ma!"30 lu!od s!qi %I!

-y~3‘Jaqlm~ *uo!lXInj s!ql ‘qw?uI LIED aldpu!Jd us JO ‘q3lBuI suo!wmawaIdwy D!jiDads IIaM

Moqjo swJal u! yuiqi 01 sn sa%eJnoDua 11 puv waisLs aqi paluasaJd uoy-00~ It?qJaA aql30

Jalzn?q~ luapuadap-waisds pue aleur!xoJddrz aql saz!wqdwa 11 *uralsks aql %I~!s!A uaqM

suoguaiu! syq pue paau s,Jasn aql siuasaJdaJ LIaSpaJd 11 wassh ~1 UE 30 1~08 ai-euryIn aqi

01 Qasop kJaA i?kIyi?IaJ j0 i!3auaq aqi wq ‘lXJ@q&? aI!qM ‘IsanbaJ ?? jo uoyuijap Jng

YiuIAa~JiaJ 30 alqedw sr wais6s aqi siasqns aqi E~IOLII~? a1q

-!ssod SE ~~eurs SE sy (x~)s ‘JaSn UaA!Z B ~03 paAa!JlaJ SluaurnDop 30 las aql s! v~3! ‘leql

NEILSXOOa ‘V 89P


l Suppose S,, is the optimal set for a given user. Can T*(Z) be related to S,,,. Such questions lead us to look for structural characteristics associated with particular classes of IRS that may limit their ability to generate or make optimal use of information potentially available to it. Since IR has not been studied from this point of view before, we begin with a particularly simple case, that of Boolean retrieval, and go on to speculate how this discussion may be generalized.

3. INDEXING

The most conspicuous way in which structure is imposed on an IRS is by means of indexing: as is well understood, indexing and sets are closely related, in that indexing divides the collection into sets of documents, each set being “about” the concept denoted by a given index term. Boolean retrieval methods are means of manipulating these sets, exploiting this structure to define retrieval sets. But concentrating on the semantics of index terms makes it easy to overlook some formal constraints imposed on a collection by its depending on indexing as a means of providing a structure for retrieval; the existence of these constraints is clearer when viewing indexing as the generation of sets.

More generally, we are very interested in the problem of how the structures implied by a retrieval approach in themselves influence the flexibility with which an IRS based on that approach is able to define a retrieval set for a user. We shall now study this in depth for the case of Boolean retrieval since here the structure is most evident and the anal- ysis simplest. For example, it is easy to believe that, if a set of documents is properly indexed, it should be possible to retrieve the ideal set for a patron-if only the query could be properly formulated. In fact, the existence of a finite indexing vocabulary creates intrinsic constraints on the number of subsets that can be retrieved, and thus the fineness of discrimination. Specifically, it can be shown that any Boolean request is equivalent to a request of the form t, OR tz OR . . . OR f,, where the t’s are either simple conjunctions of index terms or of their negatives: t = i, AND iZ AND . . . AND i,, where iJ denotes either “i” or “NOT i”, for i one of the language’s n index terms. The constraint on the number of possible sets that can be retrieved, which is identical to the number of possible independent Boolean requests, is now easily stated. We first note that, given n index terms, only 2” t’s are possible, since each t is determined by deciding for each of n index terms whether it appears directly in the t or is “NOTed,” and there only 2” possibilities. Finally, each Boolean expression is determined by selecting a number of such f’s. If there are m such t’s, there are only 2” ways to select a subset of t’s to form a Boolean expression. We conclude that there are only 22” independent Boolean expressions that can be formed from n index terms [ 131. Since a set of N documents can be broken up into 2N subsets, a fundamental constraint for every subset of documents to be retrievable is that 2*” be at least as large as 2”, or that n be greater than log, N. Generally, we see some measure relating the effective expressiveness of an IRS (here 22”) to the effective demand on the system (here 2”‘) as a critical parameter in defining the capability of a retrieval mechanism. We include the qualifier effective since constraints on the system will limit the number of, for example, requests made to the system or the relative probabilities of their occurrences, and this should influence our measure.

The above formal constraint sets a lower bound on the number of index terms needed to retrieve every set of documents. It doesn’t in itself say that if n satisfies this constraint, then all subsets are retrievable. It is, however, easy to show that, formally at least, the above condition is adequate in the sense that if n satisfies the above condition, then it is possible to assign index terms in a manner that permits each subset to be retrieved. Sup- pose we have N documents. Let us order these documents and assign an integer to each that indicates its place in the list. Further, let us represent each such integer by its binary representation. We first note that a string of log, N binary digits will be adequate to represent each document, and that each document is uniquely represented in this manner. Thus, if we have log, Nsymbols (or index terms), we can associate one with each place in the binary representation of a document’s placement on the list. We can now assign an index term to a document, provided that term is associated with a binary 1 in that document’s binary representation. Therefore, we have satisfied the demand made above for the following:

470 A. BOOKSTEIN

1. We use only log, N symbols. 2. Each document is uniquely indexed. 3. Any subset of documents can be retrieved by ORing together the index term de-

scription of the individual documents comprising the set.

As an example, consider a set of eight documents, numbered 000 to 111. (Here we are including the possibility that a document may be indexed by no terms.) Since log,8 is 3, only three binary digits are needed to represent each document. Let the three “index terms,” or symbols, associated with the binary digits as we go from left to right be a,b, and c, and the documents numbered as 000, 010, 011, . . , 111. Thus, the first document (001) that has any index terms is indexed by c only, the second by b only, the third by b and c, . . , and the last by all three terms. Now any subset of documents can be retrieved by means of a Boolean expression. For example, to retrieve the third document only, the request (NOT a) AND b AND c would be used. Similarly, any other single document (including the first) can be retrieved by itself, and by ORing such single document requests an arbitrary subset of documents can be retrieved.

Thus, in principle, we have full formal retrieval command of a set of 1024 documents with 10 index terms, and a set of 100 index terms gives us full command of 2”’ or 103’ documents. It is clear that the constraint as stated is not a very restrictive one. However, effective restrictions on the indexing of documents, for example, limit the possibilities for retrieval, and these must be taken account of if we are to have a useful measure of system expressiveness as defined above. But in general we see that it is desirable to maximize the number of queries per set of documents-the query density per set of documents. The greater this quantity, the greater the flexibility in formulating requests, the greater the resil- ience to error, and the better the ability to compensate for unequal densities of request and documents.

One such restriction on the effectiveness of Boolean retrieval is the limited number of terms assigned to documents-the above argument assumed a document could be indexed by any subset of index terms. Suppose, for example, exactly s index terms are assigned to each document. Then the collection is broken up into (i) re rieval-equivalent components, t since no Boolean query could discriminate between documents within a class of documents all of whom are identically indexed. Any such equivalence class could be retrieved (with all other classes excluded) by the simple conjunct (aI AND a2 AND AND a,) AND NOTa,,, . . . AND NOT a,,, where a, through a, are the index terms by which the desired class is indexed- there is a one-to-one correspondence between such conjuncts and equivalence classes of documents. Each set of such conjuncts ORed together would define a different retrieval class: the union of the equivalence classes associated with each conjunct. There are 2(T) such queries; this quantity thus forms a lower bound on the number of nonequivalent Boolean queries.

On the other hand, all Boolean queries are equivalent to queries of this form for a set of documents indexed by exactly s index terms. Consider an arbitrary Boolean query. Transform this into the equivalent normal form described earlier. The conjuncts will be of three types: those with s positive terms, those with fewer than s positive terms, and those with more than s positive terms. The last two groups of terms, however, cannot retrieve any documents. For example, given any document, any conjunct of s + 1 terms must demand the presence of at least one term not in that document, and similarly for conjuncts with, say, s-1 terms. Thus, this request is equivalent within this collection to the first com-

ponent alone, which is of the form described in the preceding paragraph. We conclude that under the conditions stated, n index terms can distinguish 2(:) basic sets from which other sets can be composed, and if every set is to be retrievable, ($) must be greater than N. For example, whereas 100 index terms can in principle distinguish as much as 2’@ documents, if each document is indexed by 10 terms, this quantity drops to (‘f$) or to about 244. While still large, the quantity indicates the extent to which realistic restrictions can reduce

the effectiveness of a retrieval language. Another restraint is that not all combinations of s index terms will be applied to doc-

uments since concepts cluster. If the n index terms break up into k classes of m terms each, and each document is indexed by s terms, then there are only k(y) retrievable classes of


documents, and 2k(y) logically distinguishable queries. For every set of documents to be retrievable, we must demand that k(y) > N. For example, if the 100 index terms serving in the previous example in effect broke up into five clusters of 20 terms each, the upper limit on the number of possible documents that could be controlled is 5 (fg) or 1 million documents.

But the effectiveness of a retrieval system is also influenced by restrictions on the query space. For example, in a system with 100 index terms, we are never asked to retrieve documents indexed by all 100 terms. If we were to approximate the actual query space by conjuncts composed of three index terms ANDed together without use of the NOT oper- ator, then only (ly”) q ueries would be possible;* given restrictions on the indexing of documents, many of these would be equivalent to one another. Specifically, if queries con- sisted only of the ORing of unique conjuncts, each made up of exactly s index terms ANDed together, only 2(T) logically distinct requests are possible. Clearly no more than this quantity of requests are possible, by arguments similar to those given above. But no two such requests are equivalent, if all possible document indexing is permitted. For the document indexed by terms wI , w2, . . . , w, will be retrieved by, and only by, the conjunct {w,, AND w2, AND . . . AND w,). (Of course the conjunct will retrieve many other documents as well.) Call that document the document associated with that conjunct. Consider now an arbitrary query made up of a number of unique conjuncts of s terms. Among the documents retrieved are those associated with the conjuncts making up the query. Further, no other document indexed by s-terms can be retrieved by such a query (if a document is to be retrieved by such a request it must be retrieved by at least one of the conjuncts, and no conjunct can retrieve an s-term document other than the one associated with it .) Now if another request of this form were equivalent to the one being con- sidered, it must, by definition, retrieve the same set of documents. Since it must retrieve the subset of comments associated with the conjuncts of the first request, the new request must contain all the conjuncts of the previous request. Similarly, it cannot retrieve any new s-term document, so it cannot introduce any additional conjuncts. Thus, apart from order (or repetition of conjuncts) the two queries are identical. We conclude that indeed each of the 2(g) queries is distinct.

Similarly, again as with document indexing, terms are associated in requests, so that not all groups of n terms will occur. Thus, we find that relationships among index terms that limit how they occur in queries and how they are used to index documents create very strong limits on the number of documents that we can control by means of an IRS driven by a finite index vocabulary. The set of typical requests is much smaller than the set of all possible Boolean requests, and index terms are not applied methodically to maximize the system’s ability to discriminate subsets of d0cuments.t We could continue by seeking other constraints and examining the combined effects of constraints on indexing and request formulation. However, in determining a measure of system capability, some of the most interesting constraints can be stated in terms of the probabilities of various indexing (or request formulation) conditions occurring. This extension also allows us to consider the effectiveness of non-Boolean systems. The following section suggests some means of developing such measures.

4. MEASURE OF STRUCTURE

We estimated above the effective number of sets that can be controlled by a set of index terms based upon several increasingly restrictive but simple assumptions regarding

*To relate this to the Boolean form described above, this simple locution would be expanded into a much more complex expression involving all the other index terms. For example, in a space of two index terms, a and 6, the locution “a” would expand to a AND (b OR NOT b) or (a AND b) OR (a AND NOT b). That is, the presented query indicates which index terms are to be in the document, but doesn’t care whether or not the other index terms are present.

tAn experimental confirmation of this argument has been pointed out to me by my colleague, Scott Deer- wester, who has been studying the consequences of factor analyzing index terms, thereby removing term depen- dencies by reducing the dimension of the index space. He reports that his preliminary findings, to be published, indicate that his test collections effectively are indexed by only 10 independent index terms.

472 A. BOOKSTEIN

the co-occurrence of index terms in documents and requests. In reality, the restriction of term co-occurrence is not as well defined as we assumed. The arguments given above are useful for establishing orders of magnitude and for illustrating the concepts; to represent the full complexity of an IRS, more flexible methods are required. A useful starting point is to produce a numerical measure that indicates the extent to which terms co-occur in requests.

A beginning point is the observation that when we made the statement that terms occur in subsets, we were in fact expressing in deterministic form statements of a probabilistic character. That is, when we made a statement to the effect that terms a, 6, c will occur in the same requests and that such requests will not include terms x, y, and z, we were simplifying the observation that the probability is high that the first set of terms occur together; and the probability was very low that a request including terms a, b, and c will also include, for example, x. Extending this idea, we conclude that the structure of an index vocabulary as reflected by its actual use is defined at least in part by a probability distribution over the possible nonequivalent requests that can be created using these terms, and that a measure of the co-occurrence structure of the index vocabulary, or the tightness with which terms are bound to one another, is a measure of the degree to which this probability distribution is not uniform. A number of possibilities come to mind. We suggest here, as a simple measure based upon principles with which we have considerable experience, the information theoretic measure,

ME -~PilOg*Pi3 i

where i goes over all possible equivalence sets of requests. If all requests were equally likely, A4 takes its largest value, 2”. This would be the situation in which the index terms (more precisely, the queries) had no co-occurrence structures. On the other hand, to the extent that index terms do cluster, large numbers of queries will have low or zero probabilities and M will be reduced. Thus M, a measure of the structure on a request space, measures the degree of term co-occurrence. (An expression similar to M could as well be developed to measure the co-occurrence of index terms within documents.)

Just as requests and sets of documents are closely associated, there is a measure of document association closely related to that of term association-call it M’. To define M’ we will consider the simplest structure for the cost function described above. We will assume that (1) the cost function takes on only the values 0 or 1 (a set is either useful or not useful), and (2) that for any given interaction, only a single set generates the value one. Then, as we go over the possible retrieved sets of documents, we can define the probability, p, that a request r arrive such that for a given set S, C(r,S) = 1; call that probabilityp,. Then a measure of the use structure of a set of documents is given by

M’ = -~PslogPs s

the sum going over all the subsets, S, of documents. Because of the similarity of the two measures, we surmise that this measure wiI1 be related to the indexability of a set of documents by a vocabulary of a given size and a given structure, a topic that is currently being investigated. Also of interest is generalizing this measure to cover the cases of more general cost functions.

We tried earlier to establish a framework for measuring the effectiveness of an IRS. This led us to count the number of document subsets that can be retrieved by an IRS and to compare this quantity to the actual number of subsets in the collection. Should this approach prove profitable, it would be useful to note that quantities such as 2M and 2”’ provide rough measures of the effective sizes of the request space and document subset space, respectively, taking probabilities into account. A very interesting question is how to extend these ideas to the more realistic situation in which the cost function is more complex.

Set-oriented retrieval

5. STRUCTURE

413

Taking a set-oriented approach makes us more conscious of mechanisms that define sets within an IRS. A typical library or IRS provides, explicitly or implicity, a wide range of mechanisms whose function is to partition a collection into sets of documents in a manner that is of service to a user population. Since taken together this creates a structure over the documents, we shall refer to these as the structuring mechanisms of an IRS. In general, by structure we mean a relationship defined on a set of documents, or on the subsets of a set of documents. We are interested in relations that can be used to define subsets of documents that are interesting for retrieval purposes, and possibly to establish useful relationships among such subsets. These relations, even if defined implicitly, are the orga- nizing principles of an IRS, and distinguish a set configured to provide a useful service from a heap of randomly distributed items. We take the following major position:

1. Set oriented retrieval encourages us to think in terms of the structures inherent within an IRS.

2. It is useful for planners of and thinkers about IRS’s to become aware of and plan to create and exploit collection structure.

3. It is likely, once such a point of view is explicitly taken, latent structures currently existing in IRS’s will be recognized and potentially exploited. The number of possibilities will increase enormously as we move to an environment of distributed, tightly networked information resources.

Many structuring mechanisms exist besides the obvious mechanism of indexing and classifying documents. These include co-citation information, implicit in the organization of the documents in the collection or explicit in citation indexes; the shelf organization by which items are stored; and structures implied by bibliographic tools such as the catalog and indexes of a library.

But much structure is also provided by the historical use of the collection: information about how and by whom the collection is used provides a mine of data that can be used to assist retrieval. The kernel of this idea is already used in feedback systems in that the history of an individual search is used to influence subsequent retrievals for that search. It has been suggested that such a procedure can be extended to provide information of use across searches, much in the manner of Salton and Rocchio [14]. In these procedures, the history of acceptances and rejections of proffered documents is summarized by means of index vectors associated with documents and requests, and these are used to establish relationships among documents that facilitate subsequent search.

Another example is that of co-citation structure, whereby documents are related because they cite the same documents or are cited by the same documents. Books in a library and documents in a circulating collection (that is, a collection in which records indi- cating who took out what are kept) can take advantage of a similar structure. That is, if we take out a book, we might want to know what books were taken out by other patrons that took out the same book(s). Books are thus tied together or associated by their shared circulation experience, and this association has retrieval potential. An IRS that permitted me to obtain a list of all books checked out by, say, at least N other patrons that checked out a book of interest to me (or checked out at least a given fraction of a given set of books) would be providing a service of considerable retrieval value, especially if this could be coordinated with other retrieval mechanisms. This approach is fully compatible with, for example, the probabilistic retrieval techniques, entering as one of a number of features on the basis of which probabilities are estimated.

Note that what is of interest is not the identification of other individuals but the books associated with a book because other, anonymous patrons used it. Unfortunately, with the intention of preserving privacy, such data are being systematically destroyed in many libraries today, although the information can be used without any infringement of privacy. For example, even if the system files on the basis of which such computations are made were

474 .A. BOOKSTEIN

not secure, privacy can be secured by encrypting the patron name, or even by the use of one-way functions.

More generally, associating with each document a list of users who requested the document (more effectively, if such information is available, lists of users who liked the document and who did not like the document, as might be made available in an IRS using feedback) permits us to establish a variety of relations among the documents in the collection, and thereby a structure on the collection.

6. SET ORIENTED FORMALIZATIONS OF COMMON CONCEPTS

In the previous discussion, we have relied very little on concepts such as fhe meaning of a term or what a document is about. In a sense such concepts are defined in terms of how terms are used and how users respond to documents. Pushing this a bit further, we can ask whether some of the fundamental concepts of indexing might have formal, non- semantic definitions suggested by the set orientation we are taking. Several examples come to mind immediately. For example, for a given collection, C, two terms are C-synonyms if they are applied, within that collection, to exactly the same documents. This definition is motivated by the observation that, for retrieval purposes, what a term is about is based upon the consequences of using it in a search. If two terms are synonyms, then they can be used interchangeably in that system. We note that we are not basing synonymy on any intrinsic concept of what the terms mean to users, but rather on how they perform if used. It will be influenced by the size of the collection and how the indexers interpret the text. However, if we can assume “perfect” indexing, as the collection gets larger we can approach a notion of synonymy that should agree with one based on semantics. Specifically, under the ideal situation mentioned, two terms are tautologically synonymous provided they index the same documents regardless of the constitution of the document collection. (Such a concept aiso permits the notion of degree of synonymy. For example, the limit of 1 A n B 1 //A U B / as the size of the collection gets larger, for A and B the sets of documents in which the two terms occur.

Term hierarchy has a similar definition: Within a collection, C, term a is more general then term b if all documents indexed by b are also indexed by a. Our comments about the assumption of perfect indexing and large collection size apply here as above.

7. CONCLUSIONS

We noted above that whereas document retrieval algorithms are single document based, ultimately it is a set of documents that is generated and by which the user evaluates the system. Although it will be very difficult to create a system that optimally retrieves sets of documents, we are arguing that the set perspective can be an illuminating complement to the document-oriented approach. We have, above, explored several conceptual avenues suggested once the set-oriented approach is adopted, and have indicated how, because of its strong emphasis on structure and the relationships that exist among documents, it might influence retrieval procedure. We believe that information scientists should be alert to the existence of such latent structures. Particularly as the concept of a library is extended to include dispersed, perhaps even individually held, collections of information tightly knit by means of electronic and telecommunication technology, opportunities for taking advantage of such structures are multiplied.

REFERENCES

1. Lancaster, F.W. Information retreival systems. 2nd ed. New York: Wiley-Interscience; 1979. 2. Kaufmann, A. Introduction to the theory of fuzzy subsets. vol. 1. New York: Academic Press; 1975. 3. Bookstein, A. Probability and fuzzy-set applications to information retrieval. Annual Review of informa-

tion Science and Technology, 20: 118-150; 1985. 4. Robertson, S.; Sparck-Jones, K. Relevance weighting of search terms. Journal of the American Society for

Information Science, 27(3): 129-146; 1976.


5. Bookstein, A. Information retrieval: A sequential learning process. Journal of the American Society for Infor- mation Science, 34(5): 331-342; 1983.

6. Goffman, W. An indirect method of information retrieval. Information Storage & Retrieval, 4(4): 361-373; 1968.

7. Swanson, D. Undiscovered public knowledge. Library Quarterly, 56(2): 103-l 18; 1986. 8. Swanson, D. Two medical literatures that are logically but not bibliographically connected. Journal of the

American Society for Information Science 38(4): 228-233. 9. Salton, G.; McGill, M.J. Introduction to modern information retrieval. New York: McGraw-Hill; 1983.

10. Bookstein, A.; Cooper, W. A general mathematical model for information retrieval systems. Library Quar- terly 46(2): 153-167; 1976.

11. Bookstein, A. Relevance. Journal of the American Society for Information Science, 30(5): 31-40; 1979. 12. Fishburn, P. Utility theory for decision making. New York: Wiley; 1970. 13. Birkhoff, G.; MacLane, S. A survey of modern algebra. New York: MacMillan; 1953. 14. Rocchio, J. In Salton, G., editor. The SMART retrieval system. Englewood Cliffs, N.J.: Prentice-Hall; 1971:

313-323.

Date post:	11-Dec-2023
Category:	Documents
Upload:	mcgill
View:	0 times
Download:	0 times

Set-oriented retrieval

Documents