Taxonomies, Lexicons and Organizing Knowledge
Wendi Pohs, IBM Software Group
IBM Software Group
Agenda
•Benefits, business and technical•A few definitions•Planning•Issues•Measuring value•Futures•Q&A
IBM Software Group
The Mantra
Knowledge is in the eye of the beholder, but reflecting end user needs is as critical as representing texts....and it takes work!
IBM Software Group
Business Benefits
Mergers and acquisitionsResearch and developmentIndustries:
ConsultingPharmaceuticalsFinancial servicesLegal
If only I could find information to help me do my job better ...
IBM Software Group
Technical Benefits
•Site creation•Navigation/search•Personalization•Defining areas of expertise
IBM Software Group
•“The science, laws or principles of classification” (From the Greek: rules of arrangement)
•Biology (Linnaeus)•Education (Bloom)
•A hierarchical collection of categories and documents
•Structure and content
Definitions: Taxonomy
IBM Software Group
Definitions: Directory•More general than taxonomy
•Natural structure•Wide vs deep
•Category structure less controlled•File system•Yahoo (http://www.yahoo.com)•Yellow Pages•Corporate Web sites (http://www.ibm.com)
IBM Software Group
•Controlled vocabulary•Subject headings, labels•Synonyms (U, UF)•Relation types (TT, BT, NT,SN, HN, RT, SA)
•Examples: http://www.loc.gov/flicc/wg/taxonomy.html
Definitions: Thesaurus
IBM Software Group
Definitions: Meta-data and tagging
•Meta-data •Properties, attributes: information describing types of data [Crandall]
•The ‘energy’ required to keep things organized [Earley]
•Tagging•<META>, <Source>•Document Properties
IBM Software Group
•Analyzing documents and assigning them to predefined categories
•Rule-based vs natural•Classification schemes
•Dewey•Library of Congress•Industry-specific
Definitions: Classification
IBM Software Group
Definitions: Clustering
•Clustering•Automatically generating groups of similar documents based on distance or proximity measures
•"Bags of words"•Vector analysis determines boundaries•Adaptive, but not abstract
IBM Software Group
Develop a Plan
•Determine user information needs• Information audit, Content audit
•Select appropriate sources•Create initial taxonomy•Edit categories•Categorize new documents•Test the UI•Train the taxonomy
IBM Software Group
Plan: Information audit
•What is the objective of the system?•Who owns the project?•What do users need?•What do content creators need?•What do system managers need?
IBM Software Group
Plan: Content audit•Is there an existing taxonomy?•How clean is the meta-data?•Is the content suited to automatic classification techniques?
•Good example: Notes discussion databases
•Not-so-good example: Web site with little text, lots of links
•Is a subset of a source better than the whole?
IBM Software Group
Plan: Select sources
•Which sources?•Who owns them?•Which sources do users access most often?
•How do users access these sources?•What is the lifecycle of the content?•Who identifies the most current content?
IBM Software Group
•Resources•Centralized or department-level•Who decides when new content is added?•Term approval process
•How do new concepts get into the taxonomy?
Plan: Maintenance
IBM Software Group
Identify issues•Getting user involvement and buy-in•Maintenance resources•Directory versus taxonomy•Meta-data•Globalization and regionalization•Hidden vs published taxonomies
IBM Software Group
Understand the BIG issues
•Organizational “perfection complex” [Chait]
•Multiple taxonomies•Automated versus manual categorization
IBM Software Group
Multiple taxonomies•Many editors•Term approval process, synonyms•Standard tools across the enterprise•Federated taxonomies
•Taxonomy links, “cross-connections,” facets, views
•Taxonomy mapping
IBM Software Group
IBM Software Group
Measuring value
•NCR Corporation - Support Organization
•Needed to convince organization of the value of captured content
•Managers resisted diverting resources to maintaining content
•Current measure: Time per incident
•How could the value of a knowledge classification system be demonstrated?
IBM Software Group
Measuring value
•NCR developed a new parameter:•Knowledge helpful (the answer was in the support database and was used to solve the problem)
•Knowledge not effective (the answer sent them in the wrong direction, did not help to address the issue)
•Knowledge not available (nothing available to assist in solving the problem)
•Knowledge not required (problem solved without the use of the knowledge base)
IBM Software Group
Futures
•Methods: •Feature extraction, statistical analysis, rules-based, label generation
•Starter taxonomies, imports•Taxonomy mapping•Interfaces: Visualization, better training tools
IBM Software Group
Q&A
•?