The Use of Classification in Information Retrieval Barbara H. Kwaśnik School of Information Studies...

The Use of Classification in

Information Retrieval

Barbara H. KwaśnikSchool of Information Studies

Syracuse University

ASIST Annual ConferenceCharlotte, NC

November 2, 2005

The Process of Classification Classification is the partitioning of

experience into meaningful clusters.

Two necessary processes that work in parallel: Clustering: Finding similar attributes along

some meaningful dimensions in order to group things together into classes; and

Discrimination: Determining rules for distinctions among things, so that we can create boundaries for classes.

What’s the Point of Classifying? Retrieval and Re-Finding.

If we’re trying to retrieve something we know is there, classification provides a shortcut. By clustering like things together, it helps us find things again that were stored there in the past.

Browsing and Exploration. By the same token, a classified collection can

be searched/browsed for something even if we only have reason to suspect it’s there but don’t know for sure, such as in stores, or in libraries.

What’s the Point of Classifying? Communication.

Classification creates labels and definitions for class inclusion. This enables communication about disparate phenomena by establishing a common ground.

Knowledge representation. Classifications are knowledge structures and

thus visualize and reflect what we know about things. Such representations can help us to understand things better, identify gaps, recognize patterns, predict future trends, etc.

The Information-Retrieval Problem Borrowing the notion from Bob Oddy, an

information seeking and retrieval event can be construed as a dialogue in which the user “reveals” him or herself to the system, and the system, in turn, “reveals” itself to the user.

User and theThe User’s Context

Articulates andrepresents a requestusing some strategy

QueryQuery

The Information-Retrieval Problem Space

CollectionDocumentsInformation

Representation of theCollection/

Documents/ Information

QueryOutput

Matching/Intermediation

Challenges on the Query-Formulation Side It may be difficult to articulate a request

Request may be vague, unformed Request may be difficult to translate/express

User may not know what is available, or what there is to choose from Users like to ask for what they think they might

reasonably find; often don’t like to shoot blind Strategy for level of precision may not be obvious

User may not be aware of the precision the system can accommodate

Query may thus be too broad or too specific

Challenges on the System Side Expressiveness available to the system

may be inadequate and thus the resulting representation may be incomplete In a plant catalog, no vocabulary for expressing

the hardiness of the plant

The choice of available descriptive dimensions may not be the most useful ones Original MARC record for archival collections First author only in Web of Science

Challenges on the System Side Inadequate or inconsistent levels of precision or

granularity In Dewey Decimal all “other” religions share one-tenth

of the space. LCSH policy for indexing only the main topic of a book.

Inter-indexer and intra-indexer inconsistency or system “error” might affect reliability of the representations. Indexers’ tendency to choose a limited range of terms,

even when others are more appropriate.

Challenges in the Matching User’s and system’s knowledge structures may not

be the same User’s and system’s vocabulary may differ The mechanism by which the user is “revealed” to

the system and vice versa may be inadequate for achieving a good match E.g., In many OPACs, the use of the search option

“subject” is ambiguous and does not indicate clearly that it means LCSH subject headings, and not “keyword.”

The navigational affordance and control of the system may cause confusion, inconsistencies, and miscommunication between user and system.

Challenges in the Output Undifferentiated output is difficult to

navigate and interpret

The Role of Classification in IR Classification can play a role in remedying

many of the challenges in an IR event by the use of: Classified collection representations Classified output: clusters, classes Thesauri and taxonomies for aid in question

formulation and document representation Classification for navigation and browsing

Possible Scenarios

Collection and/orOutput

Classified

Collection and/orOutput

Unclassified

QueryClassified

Query Unclassified

Examples A classified cookbook index yields a term

that points to a classified section of the cookbook. “Cookies – nut cookies – macaroons” points to

the cookie section of the cookbook.

A keyword entered in a search box yields a classified menu of options leading to an unclassified list of products. Orvis clothing online

How Classification Helps What “Threats to Successful IR” can be

helped by classification?

At the Query-Formulation End Ability to choose from a classified display

or to use a classification to support query-formulation strategies Reduces the need to come up with a term on

your own (recognition is easier than production)

Supports browsing, and precise “known-item” searching.

Allows you to learn something about the domain (i.e., learn the knowledge structure of the system)

More at the Query-Formulation End

Orients you in the information space. Gives you a sense of the “whole” – what’s available, what’s included

Terms are disambiguated because they are in context

Use of a taxonomy allows you to expand or narrow the query or suggests other related and possibly useful terms

At the System End Taxonomies, thesauri, ontologies, help

ensure consistent representation at all stages

Help identify gaps and inconsistencies in the coverage

Provide guidance for depth and exhaustivity of representation

Help provide guidance in navigation E.g., navigational index and content indexes on

a website

At the Matching and Output Stages The matching in a classified collection

allows for adjustments to granularity to help minimize the probability of “no hits.”

Classified output is browsable. Classified output eases the navigation and

cognitive processing of large results sets. Can help users orient themselves in the

information space. May provide suggestions for the next

round of querying.

The Power of Classification for IR When a good classification is deployed in a

retrieval system it has the ability to add a big jolt of information power. A robust and valid classification expresses a

great deal about the nature of the entities being classified, and also the relationship of these entities to each other.

On the other hand, we need much less knowledge of a domain in order to “read” a classification that is presented to us.

But, There Are Constraints… No classification is ever a complete

representation. All classifications are created in a cultural

context. Some classifications are good at

description, some good at visualization, some at reflecting a body of knowledge.

With very few exceptions, most can’t do all at once.

Says a Lot, but not Everything… Early stages of breast cancer

Stage 0 Cancer cells are present … but they haven’t spread…

Stage I Cancer has spread … 2 cm or less … Stage II Cancer has spread … <5 cm … sometimes the

lymph nodes may be involved.

Advanced stages of breast cancer Stage III … >5 cm … spread to lymph nodes … Stage IV …metastatic… spread to other parts of the

body, such as bone, liver, lung, or brain.

Selective Representation• No classification can

represent all perspectives at the same time. While appealing as a visualization, some things get privileged, and some get masked.

All Depends on What Theory You Adopt

GenealogicalLinguistic

Classification

For any two languages:Do they have a common

origin?

TypologicalLinguistic

Classification

For any two languages: what similarities of form do they have.

What a Good Classification Should Be

Elegant and parsimonious – expresses the domain being classified as succinctly and efficiently as possible.

Expressive and complete – sufficient to accommodate all entities within it with just enough specificity to be useful.

Memorable and usable – affords devices to help the user learn it and relearn it.

Flexible and hospitable – allows new entities to be added gracefully and coherently

There are very few classifications that can be all of these.

Conclusion Classification has an important role to play in

information retrieval For enhanced communication between system and user For precision, flexibility, completeness, and consistency

of document and query representation For expressiveness at the knowledge-structure level but

also the navigational level For browsing to reduce cognitive load, increase fun.

But, care must be taken to ensure that the classification is appropriate to the various functions of information seeking and retrieval.

Date post:	21-Jan-2016
Category:	Documents
Upload:	dorothy-byrd
View:	214 times
Download:	0 times