+ All Categories
Home > Documents > A hypertext-based thesaurus as a subject browsing aid for bibliographic databases

A hypertext-based thesaurus as a subject browsing aid for bibliographic databases

Date post: 02-Sep-2016
Category:
Upload: richard-pollard
View: 212 times
Download: 0 times
Share this document with a friend
13
0306.4573/93 $6 00 + .OO CopyrIght 0 1993 Pergamon Press Ltd A HYPERTEXT-BASED THESAURUS AS A SUBJECT BROWSING AID FOR BIBLIOGRAPHIC DATABASES RICHARD POLLARD Graduate School of Library and Information Science, University of Tennessee, Knoxville, TN 37996-4330, U.S.A. Abstract-Browsing is a very common information-seeking strategy, particularly for retrieval of bibliographic information. Conventional information retrieval systems, however, provide little support for this activity. In hypertext systems browsing is the ma- jor, and often the only, way of retrieving information. Unfortunately, hypertext users often become disoriented even when browsing a small information space. If hypertext is to be used to support browsing in bibliographic databases, some form of naviga- tional assistance will be required. This article examines the role of thesauri as navigational aids for the subject domain of a bibliographic database. The design of an experimental hypertext-based browsing interface for a thesaurus is presented and its implementation using a commercially available hypertext program described. As the ultimate goal of the thesaurus browsing interface is to facilitate subject access to bibliographic databases, strategies for linking the thesaurus to a database are examined. 1. INTRODUCTION The importance of browsing as an information-seeking strategy is gradually becoming recognized (Hildreth, 1982; Bates, 1989). Browsing has been defined as semi-directed or semi-structured searching in an area of potential interest (Ellis, 1989). This activity typi- cally involves looking through the contents pages of journals, scanning journals held by a library, or browsing along library shelves. Browsing in conventional bibliographic infor- mation retrieval systems is usually limited to examining a set of citations retrieved by a previous search or viewing a portion of the indexing vocabulary online (Hildreth, 1989). However, several experimental information retrieval systems support other types of brows- ing (Noerr & Noerr, 1985; Cove & Walsh, 1988). Hypertext is becoming a popular means of storing and presenting large bodies of online information (Nielsen, 1990). The increasing use of hypertext to store information has been matched by increased interest in retrieval techniques for hypertext systems. From the standpoint of information retrieval, the distinguishing characteristic of hypertext is that it allows users to “browse” or “navigate” through the information space in a non- sequential manner. In this context “browsing” refers to the process of moving around a network by following links between information-containing nodes. Unfortunately, users often become disoriented while navigating a body of informa- tion represented as a hypertext (Edwards & Hardman, 1989; Nielsen & Lyngbzk, 1990). This phenomenon is commonly referred to as being “lost in hyperspace.” The problem of disorientation is, however, more complex than this phrase would suggest. McKnight et al. (1991, p. 65) note that the problem involves users not knowing “how the information is organized, how to find the information they seek, or even if that information is available.” Several approaches to alleviating the problem of disorientation have been proposed. One approach uses inferencing techniques to enable the system to identify nodes related to the user’s query (Frisse & Cousins, 1989; Croft & Turtle, 1989). The ranked list of can- didate nodes produced by this type of system can be used as the starting point for user- directed navigation. An alternative approach, one that we favor, is to enhance the information space with structural cues that enable relevant nodes to be identified solely by user-directed navigation. Simpson (1989) classifies structural cues as either internal or external. Internal cues such as headings, subheadings, and various typographical conventions are contained within the text. External cues, of more interest here, exist outside the body of the text. 345
Transcript

0306.4573/93 $6 00 + .OO

CopyrIght 0 1993 Pergamon Press Ltd

A HYPERTEXT-BASED THESAURUS AS A SUBJECT BROWSING AID FOR BIBLIOGRAPHIC DATABASES

RICHARD POLLARD Graduate School of Library and Information Science,

University of Tennessee, Knoxville, TN 37996-4330, U.S.A.

Abstract-Browsing is a very common information-seeking strategy, particularly for retrieval of bibliographic information. Conventional information retrieval systems, however, provide little support for this activity. In hypertext systems browsing is the ma- jor, and often the only, way of retrieving information. Unfortunately, hypertext users often become disoriented even when browsing a small information space. If hypertext is to be used to support browsing in bibliographic databases, some form of naviga- tional assistance will be required. This article examines the role of thesauri as navigational aids for the subject domain of a bibliographic database. The design of an experimental hypertext-based browsing interface for a thesaurus is presented and its implementation using a commercially available hypertext program described. As the ultimate goal of the thesaurus browsing interface is to facilitate subject access to bibliographic databases, strategies for linking the thesaurus to a database are examined.

1. INTRODUCTION

The importance of browsing as an information-seeking strategy is gradually becoming recognized (Hildreth, 1982; Bates, 1989). Browsing has been defined as semi-directed or semi-structured searching in an area of potential interest (Ellis, 1989). This activity typi- cally involves looking through the contents pages of journals, scanning journals held by a library, or browsing along library shelves. Browsing in conventional bibliographic infor- mation retrieval systems is usually limited to examining a set of citations retrieved by a previous search or viewing a portion of the indexing vocabulary online (Hildreth, 1989). However, several experimental information retrieval systems support other types of brows- ing (Noerr & Noerr, 1985; Cove & Walsh, 1988).

Hypertext is becoming a popular means of storing and presenting large bodies of online information (Nielsen, 1990). The increasing use of hypertext to store information has been matched by increased interest in retrieval techniques for hypertext systems. From the standpoint of information retrieval, the distinguishing characteristic of hypertext is that it allows users to “browse” or “navigate” through the information space in a non- sequential manner. In this context “browsing” refers to the process of moving around a network by following links between information-containing nodes.

Unfortunately, users often become disoriented while navigating a body of informa- tion represented as a hypertext (Edwards & Hardman, 1989; Nielsen & Lyngbzk, 1990). This phenomenon is commonly referred to as being “lost in hyperspace.” The problem of disorientation is, however, more complex than this phrase would suggest. McKnight et al. (1991, p. 65) note that the problem involves users not knowing “how the information is organized, how to find the information they seek, or even if that information is available.”

Several approaches to alleviating the problem of disorientation have been proposed. One approach uses inferencing techniques to enable the system to identify nodes related to the user’s query (Frisse & Cousins, 1989; Croft & Turtle, 1989). The ranked list of can- didate nodes produced by this type of system can be used as the starting point for user- directed navigation. An alternative approach, one that we favor, is to enhance the information space with structural cues that enable relevant nodes to be identified solely by user-directed navigation.

Simpson (1989) classifies structural cues as either internal or external. Internal cues such as headings, subheadings, and various typographical conventions are contained within the text. External cues, of more interest here, exist outside the body of the text.

345

346 R. POLLARD

One example of an externai structural cue is a graphic browser. A typical graphic browser displays a schematic diagram of the hypertext in which nodes are represented as labelled boxes and links between nodes are represented as lines. Selecting a node from an interactive browser display causes the content of that node to be revealed. McKnight et al. (1991, p. 79) remark that a graphic display is analogous to a map of a physical environ- ment, in that it “shows the user what the overall information space is like, how it is linked together and consequently offers a means of moving from one information node to an- other.” Although graphic displays have intuitive appeal, their utility has been questioned

(Akscyn ef al., 1988; Brown, 1989). Wing and Yankelovich (1989) report that the global maps generated by the Intermedia system were “too large and entangled to be of any use” with webs of even a few hundred documents. One presumes that the graphic representa- tion of a bibliographic database containing several million records would also be of very little use.

Another approach to providing external structural cues is to use a two-level architec- ture in which the document collection is enhanced by an auxiliary data collection. In this model the auxiliary data collection, rather than the document collection itself, is browsed to identify terms of interest. Documents associated with a particular element from the aux- iliary data collection can then be examined for relevance to the user’s information need. Agosti et al. (1991) propose an information retrieval system in which the collection of aux- iliary data is formed from the union of a set of extracted terms, a set of index terms, and a set of user terms.

The assumption underlying this approach to hypertext-based information retrieval is that external structural cues help users form a mental model of the information space which, in turn, facilitates navigation. There is some experimental evidence to support this assumption. Simpson and McKnight (1990), for example, used an alphabetical index and a hierarchical contents list as structural cues for a small (2,500 word) hierarchically struc- tured document. They found that maps of the document structure produced by subjects using the hierarchical contents list were more accurate than those produced by subjects using the alphabetical index. They also found that navigation through the document was more efficient with the hierarchical contents list than with the alphabetical index. Naviga- tional efficiency was positively correlated with the accuracy of maps produced by subjects.

This paper reports on a project that aims to improve subject retrieval by using hyper-

text to provide a browsing interface to a thesaurus. The thesaurus acts as a structural cue to the subject domain of a bibliographic database. Section 2 suggests why thesauri can be used as structural cues to a subject domain and reviews the evidence concerning their effectiveness in this role. Section 3 discusses the reasons why hypertext techniques are suited to the representation of thesauri, and reviews previous work on hypertext-based thesauri. Section 4 describes an experimental hypertext-based browsing interface to an online the- saurus. Section 5 describes the implementation of this design using a commercially avail- able hypertext program. Section 6 discusses strategies for linking the hypertext-based thesaurus to a bibliographic database, and section 7 identifies areas for further research.

2. THESAURI AS STRUCTURAL. CUES TO A SUBJECT DOMAIN

We consider a thesaurus to consist of a set of terms, a set of relationships, and a set of displays showing relationships between terms. The set of terms contains a set of pre- ferred terms and a disjoint set of non-preferred terms. Two broad types of relationships are represented in a thesaurus: (a) relationships of categories of terms to one another and to the subject domain as a whole, and (b) inter-term relationships representing semantic links between individual terms (Aitchison & Gilchrist, 1987, p. 34). The three basic types of inter-term relationship contained in a thesaurus are equivalence, hierarchical, and as- sociative. Relationships between terms are shown using one or more of the following types of display: alphabetical, hierarchical, systematic, and graphic (Aitchison & Gilchrist, 1987, p. 70).

Thesauri are used by indexers to represent the subject content of a document col- lection, and by searchers to represent queries against a collection of documents. A query is often formulated with the aid of a thesaurus when the document collection has been

A hypertext-based thesaurus 347

indexed using that thesaurus. However, a thesaurus may also be used to formulate a query against a document collection that has not been indexed with the aid of a thesaurus.

We are interested in the ability of thesauri to represent the vocabulary and structure of a subject domain. In other words, their ability to serve as external structural cues to a subject domain. There are, we suggest, several reasons why thesauri are suited to this role.

First, thesaurus terms have literary warrant. That is, a term is chosen for inclusion on the basis of its frequency of occurrence in the literature and its usefulness for retrieval pur- poses (Lancaster, 1986, p. 24). A term should also have user warrant, that is, it is included only if it is of interest to the users of the retrieval system (Lancaster, 1986, p. 26).

Second, a rich variety of inter-term relationships can be expressed in a thesaurus. The- saurus standards permit the expression of twa types of equivalence reiat~onship (synonyms and quasi-synonyms), three types of hierarchical relationship (generic, whole-part, and instance), and a wide variety of associative relationships (International Organization for Standardization, 1986). In addition, the standards provide guidelines for testing the valid- ity of assigned relationships.

Third, if several different types of display are used in the thesaurus, as is usually the case, multiple views of the subject domain can be provided. For example, the alphabeti- cal display shows ah terms, preferred and non-preferred, in a single alphabetical sequence. A hierarchical display shows preferred terms in the context of a full hierarchy, allowing users to identify broader and narrower concepts in a Iogicaliy progressive manner. A sys- tematic display “provides an overall structure or macro-classification within which the re- lations between hierarchies and groups of terms otherwise related may be clarified and presented” (Aitchison & Gilchrist, 1987, p. 79). Finally, a graphic display “brings related terms into physical proximity and allows an indexer or searcher to view a complete con- spectus of these associations at a glance” (Lancaster, 1986, p_ 89).

Having suggested why thesauri can be used as structural cues to a subject domain we now review the evidence concerning their effectiveness in this role. We first consider the issue of vocabulary.

A common problem with thesauri is that end-users have difficulty mapping the terms they use to describe topics to indexing terms authorized by the thesaurus. The function of the entry vocabulary (the set of non-preferred terms and associated equivalence relation- ships) is to guide the user towards preferred terms. Bates (1986) argues that end-user (as opposed to indexer) thesauri should contain an entry vocabulary several times Larger than the set of authorized terms. Most thesauri in common use do not have such large entry vo- cabularies. For example, the entry vocabulary of the ~~~~uu~s of ERK Lkscr&_mrs (Houston, 1990) camprises approximately 44% of the total number of terms contained in the thesaurus. Initial entry to a thesaurus may be facilitated by providing a permuted key- word index or a “‘table of contents.” Many thesauri incorporate a system of broad cate- gories which funct.ions as a table of contents for the vocabulary. This “table of contents” offers an access point for users unfamiliar with the terms included in the thesaurus.

One might also question the cognitive accuracy of thesauri. In other words% bow well do thesauri match users’ mental models of a subject domain? Rada and Bicknell(l989) used an algorithm to measure the conceptual distance between documents and queries repre- sented by terms from the MeSH (Medical Subject Headings) thesaurus. They found that the ranking of documents produced by the algorithm compared favorably with that of phy- sicians. From this experiment it appears that the hierarchical structures of MeSH exhibit good cognitive realism. in another experiment Rada et al. (1991) enriched the Excerpta Medica thesaurus (EMTREE) with non-hierarchical relationships. The authors found the presence of these relationships to be beneficiat, but only in certain c~rcurnst~c~. Although the cognitive accuracy of thesauri remains an open question, results from ongoing research are encouraging.

3. HYPERTEXT AND THESAURI

Despite the popularity of “free-text” searching, thesauri are still used to index many Large online bib~iograph~c databases (Ghan & Pollard, 1988). Most thesauri are, however, available only in print format. The relatively few thesauri that are online have been criti-

348 R. POLLARD

cized for being incomplete, confusing, and difficult to use (Weinberg & Cunningham, 1988).

There is reason to suppose that hypertext techniques might be suited to the represen- tation of thesauri. Shneiderman (1989) maintains that hypertext is appropriate in situations where “there is a large body of information organized into numerous fragments, the frag- ments relate to each other, and the user needs only a small fraction at any time.” Thesauri fit this pattern exceptionally well. A typical thesaurus contains many thousands of indi- vidual terms that are explicitly related to one another in a well defined manner. At any par- ticular moment, the user of a thesaurus is interested in identifying only a small subset of the available terms.

As mentioned earlier, a singular characteristic of hypertext is that it allows users to “navigate” through an information space in a non-sequential manner. The structure of a thesaurus is inherently non-sequential. Users of alphabetical thesaurus displays are pain- fully aware that linear ordering schemes, such as those based on alphabetic adjacency, are not particularly effective in collocating terms linked by equivalence, hierarchical, or asso- ciative relationships. Bertrand-Gastaldy and Davidson (1986) suggest that the interface to an online thesaurus should allow the user “to ‘move about’ in the thesaurus and ‘bring into focus’ detailed portions of it depending on what is required at a given time.” Hypertext has the potential to provide this type of access.

Relatively little work on hypertext-based thesauri has been reported in the literature. Marchetti and Miihlhauser (1991) describe Hyperline, a user interface to the European Space Agency Information Retrieval Service (ESA-IRS) that attempts to integrate the ca- pabilities of conventional information retrieval and hypertext. Hyperline supports a con- cept navigation function that appears to use hypertext techniques to represent a thesaurus. However, the character-based interface displays terms in a format similar to that of con- ventional online thesauri.

Pollard (1990) examines the ability of hypertext to support three common types of thesaurus display, and presents a design for a hypertext-based hierarchical display that addresses many of the inadequacies of its printed counterpart.

McMath et a/. (1989) discuss an information retrieval system that supports a hyper- text-like implementation of the ACM Computing Reviews Classification System (Sammet & Ralston, 1982). The system has a graphic display that shows the current node in the center of a window surrounded by labelled circles representing child nodes. The child nodes have their children displayed as smaller circles around them. Users traverse this purely hierarchical structure by clicking on the circle that represents the node of interest. The sys- tem responds by redrawing the display with the newly selected term in a central position surrounded by its children and grandchildren.

McAleese and Duncan (1987) describe a system that uses a graphical browser to dis- play global and local maps of a small (29-term) subset of a thesaurus. Terms selected from a local map are displayed, in textual form, in a window. Items listed in this window are link anchors to nodes that contain the corresponding terms. Users can browse the thesau- rus either by using the graphical display or by clicking on link anchors in the term display.

In the next section we describe an experimental hypertext-based browsing interface to the alphabetical display of an online thesaurus.

4. A HYPERTEXT-BASED BROWSING INTERFACE FOR A THESAURUS

4.1 Thesaurus The Thesaurus of ERIC Descriptors is used to index the ERIC database and several

other databases in the field of education. This thesaurus, which contains almost 10,000 terms and more than 66,000 inter-term relationships, is smaller than some thesauri in com- mon use but larger than many of the documents reported to have been converted to hy- pertext. The printed version of the thesaurus contains an alphabetical display, a permuted (KWIC) index, a two-way hierarchical display, and a broad subject group display. The machine-readable version of the thesaurus corresponds to the alphabetical display in the printed version.

4.2 Data model

A hypertext-based thesaurus 349

The data model used to map the thesaurus to a hypertext assigns each thesaurus term to a node and each inter-term relationship to a link between nodes. All three types of relationship contained in the thesaurus are represented by associative links.

Initial access to the hypertext-based thesaurus is via a keyword index. This index con- tains every word used to form thesaurus terms, whether descriptors or nonusable terms. Keywords that appear in just one thesaurus term are linked directly to the corresponding node; those appearing in more than one term are linked to a term index.

The term index associated with a particular keyword contains an alphabetically ordered list of terms in which that keyword appears. Nonusable terms are listed with a reference to equivalent descriptors. All terms posted in this index, whether nonusable terms or descrip- tors, are linked to the thesaurus database using hypertext links. This combination of key- word index and term index performs a function similar to the permuted index in the printed thesaurus.

The user interface to the hypertext-based thesaurus consists of five buttons (Exit, Notes, Search . . ., Back, Help) and two scrollable windows. The thesaurus is, for the most part, operated by clicking the mouse on buttons or on items displayed in the windows. The operation of the thesaurus can best be described with the aid of an example.

Imagine that the user wishes to find a thesaurus descriptor to represent the concept of “Job Training.” The first step is to search the keyword index. The user clicks the “Search . . .” button and enters the word “job” into a dialogue box causing the system to search the keyword index. If a match or partial match is found, the appropriate section of the index with the matching string highlighted is displayed in the left-hand window (see Fig. 1).

JaZZ Jealousy Jewish Jews

WI Jobholding Jobs Jogging Joint Journal Journalism Journals Journey Judaism Judges Judgment Judgmental Judicial Junior Jure Justice Juvenile

j Exit I ! Notes / j Search... i Back /

Fig. 1. Keyword index search.

350 R. POLLARD

Words 1 Exit Notes, Search... Back Help cl&u, Japanese Javanese Jazz USE Off The Job Training

/

,,!i ,I!

Jealousy Jewish Jews Job Jobholding Jobs Jogging Joint Journal Journalism Journals Journey Judaism Judges Judgment Judgmental Judicial Junior Jure Justice Juvenile

USE Vocational Adjustment

Job Behaviors USE Job Skills

USE Career Change

USE Occupational Clusters Job Conditions

USE Work Environment

USE Occupational Information Job Content Analysis

USE Job Analysis

USE Job Development

USE Occupational Information

Fig. 2. Term index for “Job."

Clicking any word in the keyword index causes either the term index or a single term to be displayed in the right-hand window. In Fig. 2 the user has clicked the word “Job” in the keyword index, causing the term index for this word to be displayed. Note that references from non-preferred to preferred terms are included in the term index.

Terms are located in the term index by scanning and scrolling. In Fig. 3 the term index has been scrolled to reveal the descriptor “Job Training.” Clicking a term in this index causes the index to be replaced by the clicked term. For example, clicking “Job Training” in the term index reveals the display for this descriptor as shown in Fig. 4.

The target audience for the experimental hypertext thesaurus consists of student and practicing information professionals. This audience is likely to be familiar with the term displays of printed thesauri. For this reason, the term displays in the hypertext thesaurus were designed to resemble closely those of the alphabetical display in the printed thesaurus.

The Thesaurus of ERIC Descriptors employs the recommended symbols USE/UF, BT/NT, and RT for identifying inter-term relationships. Terms listed under these symbols in the hypertext thesaurus are anchors for links to the corresponding node. For example, in Fig. 4, the narrower term (NT) “On The Job Training” is a link anchor which, if clicked, causes this term to replace “Job Training” in the right-hand window.

Thus, movement between terms in the thesaurus is achieved simply by clicking the mouse on link anchors. Other operations, such as backtracking, summoning context- sensitive help, and exiting are performed by clicking the buttons at the top right of the screen.

5. IMPLEMENTATION OF THE HYPERTEXT-BASED THESAURUS

5.1 Text to hypertext conversion There are few generally applicable tools for transforming text into hypertext. As a

result, implementors of hypertexts are often obliged to design special purpose conversion

A hypertext-based thesaurus

Jkanese JaZt

Jewish - Jews Job Jobholding Jobs Jogging Joint Journal

Journals Journey Judaism Judges

Judgmental Judicial Junior Jure Justice Juvenile

Job Training Job Vacancies

USE Employment Opportunities Job Vacancy Surveys

USE Occupational Surveys Off The Job Training On The Job Training Student Job Placement

USE Job PlacementAND Student Employment

Words tia,, Japanese Javanese Jazz Jealousy Jewish Jews Job Jobholding Jobs Jogging Joint Journal Journalism Journals Journey Judaism Judges Judgment Judgmental Judicial Junior Jure Justice Juvenile

Fig. 3. Scrolled term index.

.&l l!?&iY SN training

(note: prior to mar80, the instruction “occupational use vocational education” was carried in the

thesaurus) UF

NT

BT RT

Attendant Training (1968 1980) X Employment Preparation Occupational Training Custodian Training Off The Job Training On The Job Training Training Adult Vocational Education Employment Flight Training Industrial Training Job Skills Labor Force D~lopment Oc~pations Office Occupations Retraining Scope Of Bargaining Sheltered Workshops SupervisoryTraining Trainees Trainins Allowances

Fig. 4. Term display for “Job Traming.” IPM 29:3-F

352 R. POLLARD

programs or perform the conversions manually. Furuta et a/. (1989) found that regularly and repetitively structured documents are good candidates for automatic conversion. Their most successful conversions were achieved with documents in which the structure was well defined and explicitly represented in the markup specification. We performed an automatic conversion of the thesaurus to hypertext by taking advantage of the embedded tags used to represent thesaural relationships. Special purpose programs were written to perform the

conversion. The standard ERIC distribution tapes contain data coded in both EBCDIC and bi-

nary form. The thesaurus tape was pre-processed to convert it to an ASCII text file. The text corresponding to the term “Job Training” is shown in Fig. 5. In this figure, tags of the form OXXX identify individual elements of the term’s alphabetical display. The mean- ing of these tags is defined in the documentation for the ERIC tapes (ERIC Processing and Reference Facility Computer Systems Department, 1981). The ASCII version of the the- saurus was further processed by programs that (a) extracted individual terms from the file, (b) converted the all upper-case text to mixed case text, (c) assigned a unique identifica- tion number (used to form a filename) to each term, (d) formatted the text for display, (e) added linking information, and (f) created an individual text file for each thesaurus term. An example of the resulting text file for the term “Job Training” is shown in Fig. 6. The linking information, which is used by the text-to-hypertext conversion program, follows the ‘>’ character.

The hypertext system used in this project was GUIDE from OWL International. This is a commercially available, if somewhat different, version of the GUIDE system devel- oped under the direction of Brown (1987) at the University of Kent at Canterbury. We chose GUIDE 3.0 for Microsoft Windows because of its graphical user interface, modest

hardware requirements, and built-in scripting language. We used the scripting language (LOGiiX) to write the text-to-hypertext conversion pro-

grams. LOGiiX programs are “compiled into an intermediate stack-based machine language which is then interpreted and executed by a pseudo-machine” (LOGiiX high level program- ming language, 1990, pp. l-2). In addition to being an interpreted language, LOGiiX lacks some of the functionality of a complete programming language.

The thesaurus conversion program, which generated one hypertext node for each term and constructed the links between nodes, took almost nine hours to run on a 25 MHz 386- based personal computer with eight megabytes of memory and a cached IDE disk. The ASCII text version of the thesaurus consisted of a single 2.5megabyte file. The hypertext version of the thesaurus consisted of 9,712 files which, because of the limitations of MS- DOS (User’s guide and reference, 1991, p. 102), were stored in a series of directories with a maximum of 150 files per directory. The resulting hypertext database required 18 mega- bytes of disk storage.

0000 JOBTRAINING 0001 66182 0002 85343 0100 JOB TRAINING 0102 (NCTE: PRIOR TO MAR80, THE INSTRUCTION "OCCUPATIONAL TRAINING, USE VOCATIONAL EDUCATION" WAS CARRIED IN THE THESAURUS) 0103 ADULT VOCATIONAL EDUCATION; EMPLOYMENT; FLIGHT TRAINING; INDUSTRIAL TRAINING; JOB SKILLS; LABOR FORCE DEVELOPMENT; OCCUPATIONS; OFFICE OCCUPATIONS: RETRAINING; SCOPE OF BARGAINING; SHELTERED WORKSHOPS; SUPERVISORY TRAINING; TRAINEES; TRAINING ALLOWANCES; VOCATIONAL EDUCATION; VOCATIONAL TRAINING CENTERS; WORK EXPERIENCE PROGRAMS 0104 TRAINING 0105 CUSTODIAN TRAINING; OFF THE JOB TRAINING; ON THE JOB TRAINING 0106 ATTENDANT TRAINING (1968 1980) B; EMPLOYMENT PREPARATION; OCCUPATIONAL TRAINING 0108 630

Fig. 5. Tagged text for "Job Training."

A hypertext-based thesaurus

JOB TRAINING AD Jul. 1966 SN (note: prior to mar80, the instruction “occupational

training, use vocational education’ was carried in the thesaurus)

UF Attendant Training (1968 1980) #>\eric\3\598.gui Employment Preparation>\eric\l8\2832.gui Occupational Training>\eric\40\6034.gui

NT Custodian Training>\eric\l3\2094.gui Off The Job Training>\eric\40\6048.gui On The Job Training>\eric\40\6073.gui

BT Training>\eric\60\9147.gui RT Adult Vocational Education>\eric\l\l88.gui Fmployment>\eric\l8\2818.gui Flight Training>\eric\22\3368.gui Industrial Training>\eric\28\4206.gui Job Skills>\eric\30\4616.gui Labor Force Development>\eric\31\4703.gui Occupations>\eric\40\6035.gui Office Occupations>\eric\40\6054.gui Retraining>\eric\49\7453.gui Scope Of Bargaining>\eric\51\7788.gui Sheltered Workshops>\eric\53\7995.gui Supervisory Training>\eric\57\8647.gui Trainees>\eric\60\9145.gui Training Allowances>\eric\60\9148.gui Vocational Education>\eric\63\9469.gui Vocational Training Centers>\eric\63\949O.gui Work Experience Programs>\eric\64\9628.gui $

353

Fig. 6. Text with linking information.

5.2 Indexes A commercially available text indexing program was used to generate an alphabetically

ordered list of all significant words used to form thesaurus terms. This list also included an indication of the number of terms in which each word occurs. For each keyword used in more than one thesaurus term, a custom program generated a list of all the terms in which that keyword occurs. This program also inserted references from nonusable terms in a term list to equivalent descriptors, and linking information in the keyword and term lists. The output from this program consisted of a keyword list with 4,904 entries and a

set of 2,104 term lists. LOGiiX programs were written to convert these text files into indexes for the hyper-

text thesaurus. Because of a limitation in the implementation of the LOGiiX interpreter (E. Ludwig, personal communication, July 24, 1990), this conversion process was only semi-automatic, requiring manual intervention to recover from “insufficient memory” errors. Had the conversion programs been able to run unattended, they would have taken

at least four hours to generate the thesaurus indexes.

5.3 Help system The thesaurus has a context-sensitive help system. A LOGiiX script attached to the

“Help” button examines the title of the active window and displays the appropriate Help text. The Help text is itself structured as a hypertext, enabling users to navigate freely through the Help database. The process of designing and implementing the Help system proved to be the most labor-intensive part of this project.

5.4 Performance Despite the size of the thesaurus database and the modest hardware (386-based per-

sonal computer) on which it is mounted, the time required to traverse a link between any pair of nodes is two seconds or less. Keyword index search performance is, however, less than satisfactory. Hypertext systems generally have weak searching capabilities, and GUIDE is no exception. The LOGiiX Find function is so slow that the keyword index, which contains less than 5,000 entries, had to be manually divided into several sections to

354 R PCJLLARD

obtain tolerable search times. Even with this arrangement, most keyword searches take at least 5 seconds, and the worst case search time is 29 seconds.

5.5 Recommendations The experience of implementing a hypertext-based thesaurus prompts two major rec-

ommendations for designers of hypertext programs. First, the program should include an indexed search capability to provide rapid access to nodes and individual objects stored in those nodes. Some hypertext programs do include this facility; to our knowledge, GUIDE

does not. Second, a utility to convert documents coded in a standard marhup language such as

HyTime (Newcomb et a/., 1991) to the program’s internal format should be made avail- able. This would relieve developers of the burden of writing custom conversion programs for each (document, hypertext program) pair.

6. LINKING THE THESAURUS TO A DATABASE

The ultimate goal of the thesaurus browsing interface is, of course, to facilitate subject retrieval in bibliographic databases. A basic requirement of a working retrieval system is that the hypertext-based thesaurus be linked to a bibliographic database. In this section we briefly discuss the overall functionality of the retrieval system and strategies for linking the hypertext-based thesaurus to a bibliographic database.

6.1 System functionality We envisage an information retrieval system in which the user would start a search by

accessing the hypertext thesaurus. Initial entry to the thesaurus would be gained via a key- word index or by consulting a “table of contents.” Alternatively, one could take advan- tage of the “free text” terms used in the bibliographic database with which the thesaurus is associated. For example, Marchetti and Belkin (1991) propose a semantic association function that would suggest potentially appropriate entry points into a thesaurus when the user’s terms did not match authorized terms. Agosti et a/. (1991) report that this semantic association function has been incorporated into an experimental user interface to European Space Agency Information Retrieval Service (ESA-IRS).

Having gained initial access to the thesaurus, the user could browse the syndetic structure by clicking on terms displayed in a window. Terms could be displayed in alpha- betical, hierarchical, or graphical formal as appropriate. As each term is selected, the system would display bibliographic records for documents that have been indexed using that particular term. The searcher could examine these document surrogates for relevance to the information need. Based on this examination, the focus of the query could be ad- justed by further navigation in the thesaurus. This process of query refinement would con- tinue until a satisfactory set of documents had been identified.

The document surrogates associated with a particular term could also be used as the basis of further navigation. The system should allow the user to “jump” from any index- ing term assigned to a particular document to the set of records associated with that term (Marchetti & Belkin. 1991; Agosti et al., 1991). A more general version of this function, which allows users to navigate from any element in a bibliographic record to related terms or documents, is described by Hildreth (1989, pp. 89-90).

6.2 Use of hypertext links The first strategy for linking the hypertext-based thesaurus to a database assumes that

the database is stored as a hypertext. A hypertext link would be created between every thesaurus descriptor and every database node to which that descriptor has been assigned. Any particular thesaurus descriptor may, of course, have been assigned to thousands of nodes. Creating and maintaining these links is potentially very time-consuming. Moreover, hypertext systems generally do not support such one-to-many links.

A similar problem was encountered during creation of the keyword index for the hypertext-based thesaurus. A keyword may appear in many terms. The word “Education,”

A hypertext-based thesaurus 355

for example, occurs in 300 terms. This problem was overcome by placing intermediate nodes (i.e., term indexes) between the keyword index and the thesaurus database. The intermediate nodes contain many terms, each of which is an anchor for a one-to-one link to a thesaurus node. This method of simulating a i :N link is workable only if Nis not too large.

An alternative would be to create a one-to-one hypertext link between a thesaurus descriptor and the set of nodes to which that descriptor had been assigned. Unfortunately, the basic hypertext model lacks a composition mechanism, that is, a way of representing sets of nodes and links as unique entities separate from their components (Halasz, 1987). A data access model based on transient hypergraphs (Watters & Shepherd, 1990) addresses this shortcoming. In this model the system generates transient hypergraphs in response to user queries through dynamic composition of nodes and dynamic instantiation of the links connecting these nodes.

In the absence of support for one-to-many links and a composition mechanism, other strategies will be explored.

6.3 Use of a retrieval system The second strategy for linking the hypertext-based thesaurus to a database assumes

that the bibliographic database is stored in a conventional retrieval system rather than a hypertext. Linkage would be effected by passing queries and result sets between the hyper- text-based thesaurus and the retrieval system. This approach has been adopted by Agosti et al. (1991) in their work with ESA-IRS. The advantage of this strategy is that it is rela- tively easy to implement in a windowing environment that supports inter-process commu- nication. The disadvantage is that conventional retrieval systems are notoriously difficult to use.

Consider the example of a system in which the database is stored in an exact match retrieval system with Boolean capabilities. The user enters the hypertext-based thesaurus with the term “Job Training.” This term is passed to the retrieval engine which, in the case of the ERIC database, would return a set of more than 3,000 bibliographic citations. The user can either save this set for later combination with another set, or scan it for individual items of interest. Neither option is particularly satisfactory. The first option assumes rhat the user knows how to formulate a query using Boolean operators. The sec- ond option requires the user to scan a large unranked list for items of potential interest.

An alternative would be to store the bibliographic database in a best match retrieval system based on statistical or probabilistic techniques. The combination of a knowledge- based approach to term selection (i.e., use of a thesaurus) and a statistical approach to term-document matching might be superior to either approach taken on its own. Further discussion of such a system is, however, outside the scope of this paper.

7. CONCLUSION

In this paper we proposed that a hypertext-based thesaurus be used as a navigational aid to the subject domain of a bibliographic database. We described the implementation of a hypertext version of the Thesaurus of ERIC Descriptors and discussed strategies for linking the thesaurus to a database. The discussion of thesaurus to database linkage, though included for the sake of completeness, is somewhat premature.

A behavioral study of professional searchers indicates that they rely heavily on thesauri (Fidel, 1991). Nevertheless, we know very little about how searchers actually use these tools. For example, which of the various displays do searchers use? With what frequency do users folIow the various thesaural relationships between terms? Do users have an accu- rate mental model of the differences between hierarchical and associative relationships? These questions are difficult to study using printed thesauri. The hypertext-based thesau- rus allows us to study patterns of use by recording users’ interactions with the system.

In addition, much work remains to be done in designing user interfaces to online thesauri and in testing their usability. What features should be included in the interface to an online thesaurus? How should the various inter-term relationships be represented at the

356 R. POLLARD

user interface? Do users prefer online thesauri to printed versions? We hope that future research will address these and other unanswered questions.

Acknowledgemenl-The Chancellor of the University of Tennessee and the Umversity Computing Center pro- vided computing facilities and services. Songqlan Lu provided programming assistance.

REFERENCES

Agosti, M., Grademgo, G., & Marchetti. P.G. (lY91) Architecture and functions for a conceptual interface to very large online bibliographic collections. In A. Llchnerowicz (Ed.), intelltgent lexf and image handfing, Proceedings RIAO ‘91 (pp. 2-24). Amsterdam: Eisevier.

Aitchison, A., & Gilchrist, A. (1987). Thesaurus constructron: A practlcul manuul. London: Aslib. Akscyn, R.M., McCracken, D.L., & Yoder, E.A. (1988). KMS: A distributed hypermedia system for managing

knowledge in organizations. Cummunicat~ons of the ACM. 3/(7), 820-835. Bates, M.J. (1986). Subject access tn online catalogs: A design model. Journal of the Amerrcan Soaety for

Information Science, 37(6), 357-376. Bates, M.J. (1989). The design of brow\ing and berrypiclmg techmques for the online search interface. Onhne

Review, f3(5), 407-424. Bertrand-Gastaidy, S., & Davidson, C.H. (1986). Improved design of graphic thesauri through technoIogy and

ergonomics. Journal of Documentation, 42(4), 225-251. Brown, P.J. (1987). Turmng Ideas into product\: The GUIDE system. Hypertext ‘87 (pp. 33-40). Chapel Hill,

NC: Umversity of North Carolina. Brown, P.J. (1989). Do we need maps to navigate round hypertext documents? Electronrc Publishmg- Orlgmation,

Disseminatron and Design, Z(2), Y I - 100. Chan, L.M., & Pollard, R. (1988). Thesaurr used rn onlrne databuses: An anal_vtrcal gurde. Westport, CT:

Greenwood. Cove, J.F., & Walsh, B.C. (1988). Onhne text retrieral ?!a browsing. Informatzon Processmg & Management,

24(i), 31-37. Croft, W.B. & Turtle, H. (1989). A rctrleval modei for illcorp~~ratlng hypertext hnks. Proceedtngs H.vpertext ‘89

(pp. 213-224). New York: Aasoaatton for Computing Machinery. Edwards, D.M., & Hardman, L. (1989). ‘Lost m hyperspace’ Cognittve mapping and navigation in a hypertext

environment. In R. McAIeese (Ed.), Hyperteur: Theocv rnto pructrce (pp. 105-125). Norwood, NJ: AbIex. Ellis, D. (1989). A behavioural approach to information retrieval rystem design. Journal ofDocumentation, 45(3),

171-212. ERIC Processing and Reference Facility Computer Sybtemc Department (1981). Data base muster fdes tape

documentatron. Rockville, MD. Fidel, R. (1991). Searchers’ selection of search keys: 11. Controlled vocabuiary or free-text searchmg. Journal

of the Amerrcan Soctety for Informutron Science, 42(7}, Sol-5 14. Frisse, M.E., & Cousins, S.B. (1989). Information retrieval from hypertext: Update on the dynamic medicat hand-

book project. Proceedmgs Hypwte.ut ‘89 (pp. 199-212). New York: Association for Computing Machtnery. Furuta, R., Plaisant, C., 81 Shneiderman, B. (1989). A spectrum of automatic hypertext constructions. Hyper-

media, I(2), 179-195. Halasz, F.G. (1987). Reflections on Notecards: Se\en issues for the next generation of hypermedia systems.

Hypertext ‘87 (pp. 345-365). Chapel H111, NC: University of North Carohna. Hildreth, C.R. (1982). The concept and mechanics of brossmg in an onltne hbrary catalog. Proceedrngs of the

3rd National Onlrne Meeting, pp. 181-196. Hildreth, C.R. (1989). Intelligent interfaces and retneval methods for subject searchmg in biblrogruphic retrieval

sysiems. Washington, DC: Library of Congress Cataloging Distribution Service. Houston, J.E. (Ed.) (1990). Thesaurus of ERIC descrrptors (12th ed.). Phoenix, AZ: Oryx. lnternat~onal Organization for Standardization. (1986). Do~untentatlon-GuldeIines,~or the estabfishment and

development ofmonolmgual thesaurr. Geneva: international Organization for Standardization (IS0 2788). Lancaster. EW. (1986). Vocabulary control&r rnformatron rerrreval. Arlington, VA: Information Resources Press. LOCiiX high level programmrng language for GUIDE hypermedta sofiware: user manual. (1990). BeIIevue, WA:

OWL International, Inc. Marchetti, G.P., & Belkin, N.J. (1991). Interactive onlme search formulation support. Proceedings of the 12th

National Onlrne Meetmg, pp. 237-243. Marchetti, C.P., & Muhlhauser. G (1991, May). ‘Hyperline’. The information browser. ESA Bdfetin, pp. 115-118. McAleese, R., & Duncan, E.B. (1987). The graphical representation of ‘terrain’ and ‘street’ knowledge in an

Interface to a database system. Proreedmgs of the ilth Infernational On/me information Meeting, pp. 443-456.

MeKnight, C., Dillon, A., & Richardson, J. (1991). H_vperre.yt f,l context. Cambridge: Cambridge Umverslty Press. McMath, C.F.. Tamaru, R.S., & Rada, R. (1989). A graphlcal thesaurus-based Information retrieval system.

International Journal of Man-Machme Studies, 31, 12 I- 147 Nielsen, J. (1990). The art of navigating through hypertext. Communrcatrons of the ACM, 33(3), 296-310. Nielsen, J., & Lyngbzk, U. (1990). Two field studies of hypermedia usability. In R. McAleese & C. Green (Eds.),

Hypertext: State of the art (pp. 64-72). Norwood, NJ: Ablex. Newcomb, S.R., Kipp, N.A., & Newcomb, V.T. (1991). The “HyTime” hypermedia/time-based document struc-

turing fanguage. Communrcatrons of the ACM, 34( 11). 67-83. Noerr, P.L., & Noerr, K.T.B. (1985). Browse and navigate: An advance m database access methods. Znformat~on

Processing L Manageme~tf, 2If3). 205-213 Pollard, R. (1990). Hypertext presentation of thesauri used in online searching. Electronic Publ~shjng-OrJgi~atton,

Dissemmation and Design, 3(3), 155- 172.

A hypertext-based thesaurus 357

Rada, R., Barlow, J., Potharst, J., Zanstra, P., & Bijstra, D. (1991). Document ranking using an enriched thesaurus. Journal of Documentation, 47(3), 240-253.

Rada, R., & Bicknell, E. (1989). Ranking documents with a thesaurus. Journal of the Amencan Society for information Science, 40(5), 304-3 10.

Sammet, J.E., & Ralston, A. (1982). The new (1982) computing reviews classification system-final version. Com- munications of the ACM, 25(l), 13-25.

Shneiderman, B. (1989). Reflections on authoring, editing, and managing hypertext. In E. Barrett (Ed.), Thesociety oftext (pp. 115-131). Cambridge, MA: MIT Press.

Simpson, A. (1989). Navigation in hypertext: Design issues. Proceedings of the 13th International Online Information Meeting, 241-255.

Simpson, A., & McKnight, C. (1990). Navigation in hypertext: structural cues and mental maps. In R. McAleese & C. Green (Eds.), Hypertext: State of the art (pp. 73-83). Norwood, NJ: Ablex.

User’s guide and reference for the MS-DOS operatrng system, version 5.0. (1991). Redmond, WA: Microsoft Corporation.

Utting, K., & Yankelovich, N. (1989). Context and orientation in hypermedia networks. ACM Transactions on InformatIon Systems, 7(l), 58-84.

Watters, C., & Shepherd, M.A. (1990). A transient hypergraph-based model for data access. ACM Transactrons on Information Systems, 8(2), 77-102.

Weinberg, B.H., &Cunningham, J.A. (1988). The design of online thesauri. Proceedings of the 9th National Online Meeting, pp. 411-419.


Recommended