Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | dorothy-byrd |
View: | 214 times |
Download: | 0 times |
The Use of Classification in
Information Retrieval
Barbara H. KwaśnikSchool of Information Studies
Syracuse University
ASIST Annual ConferenceCharlotte, NC
November 2, 2005
The Process of Classification Classification is the partitioning of
experience into meaningful clusters.
Two necessary processes that work in parallel: Clustering: Finding similar attributes along
some meaningful dimensions in order to group things together into classes; and
Discrimination: Determining rules for distinctions among things, so that we can create boundaries for classes.
What’s the Point of Classifying? Retrieval and Re-Finding.
If we’re trying to retrieve something we know is there, classification provides a shortcut. By clustering like things together, it helps us find things again that were stored there in the past.
Browsing and Exploration. By the same token, a classified collection can
be searched/browsed for something even if we only have reason to suspect it’s there but don’t know for sure, such as in stores, or in libraries.
What’s the Point of Classifying? Communication.
Classification creates labels and definitions for class inclusion. This enables communication about disparate phenomena by establishing a common ground.
Knowledge representation. Classifications are knowledge structures and
thus visualize and reflect what we know about things. Such representations can help us to understand things better, identify gaps, recognize patterns, predict future trends, etc.
The Information-Retrieval Problem Borrowing the notion from Bob Oddy, an
information seeking and retrieval event can be construed as a dialogue in which the user “reveals” him or herself to the system, and the system, in turn, “reveals” itself to the user.
User and theThe User’s Context
Articulates andrepresents a requestusing some strategy
QueryQuery
The Information-Retrieval Problem Space
CollectionDocumentsInformation
Representation of theCollection/
Documents/ Information
QueryOutput
Matching/Intermediation
Challenges on the Query-Formulation Side It may be difficult to articulate a request
Request may be vague, unformed Request may be difficult to translate/express
User may not know what is available, or what there is to choose from Users like to ask for what they think they might
reasonably find; often don’t like to shoot blind Strategy for level of precision may not be obvious
User may not be aware of the precision the system can accommodate
Query may thus be too broad or too specific
Challenges on the System Side Expressiveness available to the system
may be inadequate and thus the resulting representation may be incomplete In a plant catalog, no vocabulary for expressing
the hardiness of the plant
The choice of available descriptive dimensions may not be the most useful ones Original MARC record for archival collections First author only in Web of Science
Challenges on the System Side Inadequate or inconsistent levels of precision or
granularity In Dewey Decimal all “other” religions share one-tenth
of the space. LCSH policy for indexing only the main topic of a book.
Inter-indexer and intra-indexer inconsistency or system “error” might affect reliability of the representations. Indexers’ tendency to choose a limited range of terms,
even when others are more appropriate.
Challenges in the Matching User’s and system’s knowledge structures may not
be the same User’s and system’s vocabulary may differ The mechanism by which the user is “revealed” to
the system and vice versa may be inadequate for achieving a good match E.g., In many OPACs, the use of the search option
“subject” is ambiguous and does not indicate clearly that it means LCSH subject headings, and not “keyword.”
The navigational affordance and control of the system may cause confusion, inconsistencies, and miscommunication between user and system.
Challenges in the Output Undifferentiated output is difficult to
navigate and interpret
The Role of Classification in IR Classification can play a role in remedying
many of the challenges in an IR event by the use of: Classified collection representations Classified output: clusters, classes Thesauri and taxonomies for aid in question
formulation and document representation Classification for navigation and browsing
Possible Scenarios
Collection and/orOutput
Classified
Collection and/orOutput
Unclassified
QueryClassified
Query Unclassified
Examples A classified cookbook index yields a term
that points to a classified section of the cookbook. “Cookies – nut cookies – macaroons” points to
the cookie section of the cookbook.
A keyword entered in a search box yields a classified menu of options leading to an unclassified list of products. Orvis clothing online
How Classification Helps What “Threats to Successful IR” can be
helped by classification?
At the Query-Formulation End Ability to choose from a classified display
or to use a classification to support query-formulation strategies Reduces the need to come up with a term on
your own (recognition is easier than production)
Supports browsing, and precise “known-item” searching.
Allows you to learn something about the domain (i.e., learn the knowledge structure of the system)
More at the Query-Formulation End
Orients you in the information space. Gives you a sense of the “whole” – what’s available, what’s included
Terms are disambiguated because they are in context
Use of a taxonomy allows you to expand or narrow the query or suggests other related and possibly useful terms
At the System End Taxonomies, thesauri, ontologies, help
ensure consistent representation at all stages
Help identify gaps and inconsistencies in the coverage
Provide guidance for depth and exhaustivity of representation
Help provide guidance in navigation E.g., navigational index and content indexes on
a website
At the Matching and Output Stages The matching in a classified collection
allows for adjustments to granularity to help minimize the probability of “no hits.”
Classified output is browsable. Classified output eases the navigation and
cognitive processing of large results sets. Can help users orient themselves in the
information space. May provide suggestions for the next
round of querying.
The Power of Classification for IR When a good classification is deployed in a
retrieval system it has the ability to add a big jolt of information power. A robust and valid classification expresses a
great deal about the nature of the entities being classified, and also the relationship of these entities to each other.
On the other hand, we need much less knowledge of a domain in order to “read” a classification that is presented to us.
But, There Are Constraints… No classification is ever a complete
representation. All classifications are created in a cultural
context. Some classifications are good at
description, some good at visualization, some at reflecting a body of knowledge.
With very few exceptions, most can’t do all at once.
Says a Lot, but not Everything… Early stages of breast cancer
Stage 0 Cancer cells are present … but they haven’t spread…
Stage I Cancer has spread … 2 cm or less … Stage II Cancer has spread … <5 cm … sometimes the
lymph nodes may be involved.
Advanced stages of breast cancer Stage III … >5 cm … spread to lymph nodes … Stage IV …metastatic… spread to other parts of the
body, such as bone, liver, lung, or brain.
Selective Representation• No classification can
represent all perspectives at the same time. While appealing as a visualization, some things get privileged, and some get masked.
All Depends on What Theory You Adopt
GenealogicalLinguistic
Classification
For any two languages:Do they have a common
origin?
TypologicalLinguistic
Classification
For any two languages: what similarities of form do they have.
What a Good Classification Should Be
Elegant and parsimonious – expresses the domain being classified as succinctly and efficiently as possible.
Expressive and complete – sufficient to accommodate all entities within it with just enough specificity to be useful.
Memorable and usable – affords devices to help the user learn it and relearn it.
Flexible and hospitable – allows new entities to be added gracefully and coherently
There are very few classifications that can be all of these.
Conclusion Classification has an important role to play in
information retrieval For enhanced communication between system and user For precision, flexibility, completeness, and consistency
of document and query representation For expressiveness at the knowledge-structure level but
also the navigational level For browsing to reduce cognitive load, increase fun.
But, care must be taken to ensure that the classification is appropriate to the various functions of information seeking and retrieval.