Taxonomies and Meta Data for Business ImpactApril 13, 2005
Theresa Regli, Molecular, Inc.
Ron Daniel, Jr., Taxonomy Strategies LLC
Copyright © 2005 | 2|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 3|
Who we are: Ron Daniel, Jr.
• Over 15 years in the business of metadata & automatic classification
• Principal, Taxonomy Strategies• Standards Architect, Interwoven• Senior Information Scientist, Metacode Technologies (acquired
by Interwoven, November 2000)
• Technical Staff Member, Los Alamos National Laboratory • Metadata and taxonomies community leadership
• Chair, PRISM (Publishers Requirements for Industry Standard Metadata) working group
• Acting chair, XML Linking working group• Member, RDF working groups• Co-editor, PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1
& 2 reports.
Copyright © 2005 | 4|
Recent & current projects
• Government• Commodity Futures Trading
Commission• Defense Intelligence Agency• ERIC• Federal Aviation Administration • Federal Reserve Bank Atlanta• Forest Service• Goddard Space Flight Center• Head Start• Infocomm Development Authority of
Singapore• NASA (nasataxonomy.jpl.nasa.gov) • Small Business Administration• Social Security Administration• U.S.D.A. Economic Research Service• U.S.D.A. e-Government Program (
www.usda.gov) • U.S.G.S.A. Office of Citizen Services
(www.firstgov.gov)
• Commercial• Allstate Insurance• Blue Shield of California• Halliburton• Hewlett Packard• Motorola• PeopleSoft• Pricewaterhouse Coopers• Sprint• Time Inc.
• Commercial subcontracts• Critical Mass - Fortune 50 retailer• Deloitte Consulting - Top credit card
issuer• Gistics – Direct selling giant
• NGO’s• CEN• IDEAlliance• OCLC
Copyright © 2005 | 5|
Who we are: Theresa Regli
• Over a decade of experience in cross-media publishing and content management
• 7 years of consulting• 4 years in “traditional” media: newspapers, publishing
• Brought many New England newspapers online in the mid-90s
• Principal Consultant, CM and User Experience, Molecular• Focus on users / customers and how they interact with and
use information, industry education and conferences• Background in linguistics• Named as “one to watch” in 2005 by CMS Watch• Passion for how people, cultures – and businesses – use
words and language
Copyright © 2005 | 6|
About Molecular
• Offerings designed to help organizations leverage technology to increase revenues and decrease costs
• 10+ years of Internet professional services expertise
• 120+ consultant professionals
• Integrated service offerings- Digital strategy- User experience
design/redesign- Development &
implementation- Multi-site integration - Multi-channel
integration
Copyright © 2005 | 7|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 8|
What is Knowledge Management?• The process through which firms generate value from
their intellectual assets• The efficient sharing of knowledge across the enterprise:
not focused on presentation• Often incorrectly used synonymously with CM
What is Document Management?• The effective storage and retrieval of documents• Traditionally not about the creation aspect of new
content/documents• Often incorrectly used synonymously with CM – some of
the tools have evolved towards CM
Setting the stage: some definitions…
Copyright © 2005 | 9|
What is Content Management?• The integration of various technologies and processes to
manage content - conception thru deployment• The management of content lifecycle: create, approve,
tag, publish
What is Enterprise Content Management?• Vendor/analyst term to include all content across the firm
(web, catalog, digital, etc.)• Integration of various systems to create one unified,
“virtualized” system (CRM, financial, marketing, etc.)• Typically thought of as a strategy and not an
implementation
Some more definitions…
Copyright © 2005 | 10|
Caddy provides advice Caddy tells other caddies Other caddies provide advice
Caddy master collects advice and creates tip booklet for all caddies
Owner implements at 10 courses: ‘Online Caddy’ system and Personal Cart system
• Course Tip Sheet• Golfdigest.com• Course Yardage Books
Knowledge Sharing
Knowledge Management
Content Management
KM to ECM
Document Management
Putting it all together: golf anyone?
Copyright © 2005 | 11|
What makes DM, CM and ECM possible?
Taxonomy• Framework for organizing information based on user needs• Law for categorizing information
Meta Data• Information about content: "data about the data" • The categories, sub-categories and terms that make up a
taxonomy are often employed as meta data • Meta data is leveraged by a CMS to find and display content
easily and consistently• Enables more precise search results and personalization
Copyright © 2005 | 12|
• Facets: Allow for a more complex classification structure, where the categories are applied to the information like keywords. Thus, information about a subject can be “approached” and found in different ways. For example…
• Hypertension• Publications / Medical / Journal of Hypertension• Diseases / Cardiovascular / Hypertension • Associations / Medical / American Society of Hypertension
• Red Rock Crab • Animals / Invertebrates / Crustaceans • World / Seas / Pacific • World / Land / Australasia
Foundations for ECM Success: Key Terms
Copyright © 2005 | 13|
• Synonym Ring: A set of words/phrases that can be used interchangeably for searching. (Hypertension, high blood pressure)
• Thesaurus: A tool that controls synonyms and identifies the relationships among terms
• Controlled Vocabulary: A list of preferred and variant terms, with relationships (hierarchical and associative) defined. A taxonomy is a type of controlled vocabulary.
Foundations for ECM Success: Key Terms
Copyright © 2005 | 14|
Sample Taxonomies
Copyright © 2005 | 15|
The Library of Congress
A) General WorksB) Philosophy, Psychology, ReligionC) History: Auxiliary SciencesD) History: General and Old WorldE) History: United StatesF) History: Western HemisphereG) Geography, Anthropology, RecreationH) Social ScienceJ) Political ScienceK) Law
L) EducationM) MusicN) Fine ArtsP) Literature & LanguagesQ) ScienceR) MedicineS) AgricultureT) TechnologyU) Military ScienceV) Naval ScienceZ) Bibliography & Library Science
While both taxonomies are used in libraries,note how the differences in classification are specifically accommodating:
• Audience• Subject matter
Copyright © 2005 | 16|
Category
Facets
Meta data(rheumatoid is a type of arthritis) Enables user-intuitive presentation of information
Copyright © 2005 | 17|
Copyright © 2005 | 18|
Taxonomy as Multi-Faceted Browsing Tool
Copyright © 2005 | 19|
Epicurious, First Facet
Browse > Picnics
Copyright © 2005 | 20|
Epicurious, Second Facet
Browse > Picnics > Poultry
Copyright © 2005 | 21|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 22|
Business Case and Motivations for Taxonomies
• We divide taxonomy projects into three problems: the ROI Problem, the Tagging Problem, and the Taxonomy Problem
• The ROI Problem: How are we going to use content, metadata, and taxonomies in applications to obtain business benefits?
Copyright © 2005 | 23|
What technology analysts have said“Adding metadata to unstructured content allows it to be managed like structured content. Applications that use structured content work better.”
“Enriching content with structured metadata is critical for supporting search and personalized content delivery.”
“Content that has been adequately tagged with metadata can be leveraged in usage tracking, personalization and improved searching.”
“Better structure equals better access: Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate Web site design, content management, and search engineering. If well done, taxonomy will allow for structured Web content, leading to improved information access.”
Copyright © 2005 | 24|
ElementData Type Length
Req. / Repeat Source Purpose
Asset Metadata
Unique ID Integer Fixed 1 System supplied Basic accountability
Recipe Title String Variable 1 Licensed Content Text search & results display
Recipe summary String Variable 1 Licensed Content Content
Main Ingredients List Variable ?Main Ingredients vocabulary
Key index to retrieve & aggregate recipes, & generate shopping list
Subject Metadata
Meal Types List Variable * Meal Types vocab
Browse or group recipes & filter search results
Cuisines List Variable * Cuisines
Courses List Variable * Courses vocab
Coking Method Flag Fixed * Cooking vocab
Link Metadata
Recipe Image Pointer Variable ? Product Group Merchandize products
Use Metadata
Rating String Variable 1 Licensed Content Filter, rank, & evaluate recipes
Release Date Date Fixed 1 Product Group Publish & feature new recipes
Legend: ? – 1 or more * - 0 or more
Metadata specification – a recipe example
Copyright © 2005 | 25|
Fundamentals of taxonomy ROI
• Building and maintaining a taxonomy, and tagging content with it, are costs. They are not benefits
• There is no benefit without exposing the tagged content to users in some way that cuts costs or improves revenues
• Putting a new taxonomy into operation requires UI changes and/or backend system changes, as well as data changes
• Every metadata field costs money, time, and goodwill• You need to determine those changes, and their costs, as
part of the taxonomy ROI
Copyright © 2005 | 26|
Common taxonomy ROI scenarios• Catalog site - ROI based on increased sales through improved:
• Product findability• Product cross-sells and up-sells• Customer loyalty
• Call center - ROI based on cutting costs through:• Fewer customer calls due to improved website self-service• Faster, more accurate CSR responses through better information access
• Compliance – ROI based on:• Avoiding penalties for breaching regulations• Following required procedures (e.g. Medical claims)
• Knowledge worker productivity - ROI based on cutting costs through:• Less time searching for things• Less time recreating existing materials, with knock-on benefits of less
confusion and reduced storage and backup costs• Executive mandate
• No ROI at the start, just someone with a vision and the budget to make it happen
Copyright © 2005 | 27|
Huge cost to the user & organization• Finding information (time, frustration, precision)• “15%-30% of an employee’s time is spent looking for information, and they
find it only 50% of the time”• IDC Research, on the business drivers for building a taxonomy
• Sun’s usability experts calculated that 21,000 employees were wasting an average of six minutes per day due to inconsistent intranet navigation structures. When lost time was multiplied by staff salaries, the estimated productivity loss exceeded $10M per year
• Web Design and Development, Jakob Nielsen• Managers spend 17% of their time (6 weeks a year) searching for
information• Information Ecology, Thomas Davenport & Lawrence Prusack
Lost Learning Value • Related products, services, projects, people
Taxonomy Justification | Knowledge Worker Productivity
Copyright © 2005 | 28|
Challenges of organizing content on enterprise portals (1)• Multiple subject domains across the enterprise
• Vocabularies vary• Granularity varies• Unstructured information represents about 80%
• Information is stored in complex ways • Multiple physical locations• Many different formats
• Tagging is time-consuming and requires SME involvement• Portal doesn’t solve content access problem
• Knowledge is power syndrome• Incentives to share knowledge don’t exist• Free flow of information TO the portal might be inhibited
• Content silo mentality changes slowly• What content has changed?• What exists?• What has been discontinued?• Lack of awareness of other initiatives
Copyright © 2005 | 29|
Challenges of organizing content on enterprise portals (2)• Lack of content standardization and consistency
• Content messages vary among departments• How do users know which message is correct?
• Re-usability low to non-existent• Costs of content creation, management and delivery may not change
when portal is implemented: • Similar subjects, BUT
• Diverse media• Diverse tools• Different users
• How will personalization be implemented?• How will existing site taxonomies be leveraged?• Taxonomy creation may surface “holes” in content
Copyright © 2005 | 30|
FAQ – How do you sell it?• Don’t sell the taxonomy, sell the vision of what you want to
be able to do • Clearly understanding what the problem is and what the
opportunities are• Do the calculus (costs and benefits)• Design the taxonomy (in terms of LOE) in relation to the
value at hand
Copyright © 2005 | 31|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 32|
NASA Taxonomy Project Goal: Enable Knowledge Discovery• Make it easy for various audiences to find relevant
information from NASA programs quickly• Provide easy access for NASA resources found on the Web• Share knowledge by enabling users to easily find links to
databases and tools• Provide search results targeted to user interests• Enable the ability to move content through the enterprise to
where it is needed most
• Comply with E-Government Act of 2002• Be a leading participant in Federal XML projects
Copyright © 2005 | 33|
NASA Taxonomy Project Goal: Develop Best Practices• Design process that:
• Incorporates existing federal and industry terminology standards like NASA AFS, NASA CMS, FEA BRM, NAICS, and IEEE LOM
• Provides a product for the NASA XML namespace registry• Complies with metadata standards like Z39.19, ISO 2709, and
Dublin Core
• Practices believed to increase interoperability and extensibility
Copyright © 2005 | 34|
Development Process: Interviews
Categorized by type -
52%–Projects, Engineering & Science
> 70 Interviews conducted across NASA complex.
Funders4%
Public18%
Projects13%
Scientists4%
Administrators26%
Researchers13%
Engineers22%
Copyright © 2005 | 35|
Scale of NASA Taxonomy
Facet # Terms Source
Audiences 62 Custom
Business Purpose 96 Existing
Competencies 169 Existing
Content Types 96 Custom
Industries 22 Existing
Instruments 56 Semi
Locations 106 Custom
Missions/Projects 648 Semi
Organizations 323 Existing
Subject Categories 78 Existing
Total 1656 Facets combine, so millions of documents can be finely categorized with a relatively small number
of values.
Copyright © 2005 | 36|
http://nasataxonomy.jpl.nasa.gov
Link to XML DTDs and Schema
Background and training materials
Links to Controlled
Vocabularies
Link to Metadata
Specification
NASA Taxonomy Web Site
Copyright © 2005 | 37|
Benefits of Approach
• Facets and Use of Standards made it possible to respond to three unexpected needs during and after the project:
• Search demo• Semantic search demo• Integration with detailed vocabularies
Copyright © 2005 | 38|
Example | NASA Taxonomy Search Prototype
Facets, Values, and
Counts
Current Search State
Copyright © 2005 | 39|
Example 1 | NASA Taxonomy Search Prototype • ¾ of the way through the project, request was
made to see a demo of the taxonomy in action• Taxonomy was represented in RDF• Metadata was scraped from a few repositories
around NASA (~220k records), converted to RDF
• Some metadata automatically created with simple keyword matches
• RDF loaded into Seamark search tool
• Time: approx 2 man-weeks• Additional cost: $0• Result: Useful demo that illustrated new facts
Copyright © 2005 | 40|
Example 2 | Semantic Search• After project was over, another
project was doing ‘semantic search’
• They heard about NASA Taxonomy
• They downloaded the RDF file for the Missions & Projects vocabulary, mapped to their RDF/OWL tool, and used it to answer questions about different types of missions
• They did not have to ask any questions or request any data changes
Courtesy Dean Allemang, Top Quadrant,
Robert Brummett, NASA HORM
Copyright © 2005 | 41|
Example 3 | Local Extension
• After project, JPL wanted to incorporate content from additional repositories
• Existing metadata was easily mapped as extensions to NASA taxonomy
• RDF mapping allowed Search tool to make immediate use of the metadata.
Copyright © 2005 | 42|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 43|
The Tagging Problem• How are we going to populate metadata elements with
complete and consistent values?• What can we expect to get from automatic classifiers?
Copyright © 2005 | 44|
Tagging• Province of authors (SMEs) or editors?• Taxonomy often highly granular to meet task and re-use
needs• Vocabulary dependent on originating department• The more tags there are (and the more values for each tag),
the more hooks to the content• If there are too many, authors will resist and use “general”
tags (if available). • Automatic classification tools exist, and are valuable, but
results are not as good as humans can do.• “Semi-automated” is best• Degree of human involvement is a cost/benefit tradeoff
Copyright © 2005 | 45|
Automatic categorization vendors | Analyst viewpoint
Accuracy Levelhighlow
Con
tent
Vol
umes
low
high
Copyright © 2005 | 46|
Considerations in Automatic Classifier Performance• Classification Performance is
measured by “Inter-cataloger agreement”
• Trained librarians agree less than 80% of the time
• Errors are subtle differences in judgment, or big goofs
• Automatic classification struggles to match human performance
• Exception: Entity recognition can exceed human performance
• Classifier performance limited by algorithms available, which is limited by development effort
• Very wide variance in one vendor’s performance depending on who does the implementation, and how much time they have to do it
1) 80/20 tradeoff where 20% of effort gives 80% of performance.
2) Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.
Accuracy
Development Effort/ Licensing
Expense
Regexps
Trained Librarians
potential performance
gain
Copyright © 2005 | 47|
Tagging tool example | Interwoven MetaTagger
Manual form fill-in w/ check boxes, pull-down lists, etc.
Auto keyword & summarization
Copyright © 2005 | 48|
Tagging tool example | Interwoven MetaTagger
Auto-categorization
Parse & lookup (recognize names)
Rules & pattern matching
Copyright © 2005 | 49|
Metadata tagging workflows
Compose in Template
Submit to CMS
Analyst Editor
Review content
Problem?
Copywriter
Copy Edit content
Problem?Hard Copy
Web site
Y
Y N
N
Approve/Edit metadata
Automatically fill-in metadata
Tagging Tool Sys Admin
• Even ‘purely’ automatic meta-tagging systems need a manual error correction procedure.
• Should add a QA sampling mechanism
• Tagging models:• Author-generated• Central librarians• Hybrid – central auto-
tagging service, distributed manual review and correction
Sample of ‘author-generated’ metadata workflow
Copyright © 2005 | 50|
Automatic categorization vendors | Pragmatic viewpoint
Accuracy Levelhighlow
Con
tent
Vol
umes
low
high
Copyright © 2005 | 51|
Seven practical rules for taxonomies
1. Incremental, extensible process that identifies and enables users, and engages stakeholders
2. Quick implementation that provides measurable results as quickly as possible
3. Not monolithic—has separately maintainable facets
4. Re-uses existing IP as much as possible
5. A means to an end, and not the end in itself
6. Not perfect, but it does the job it is supposed to do—such as improving search and navigation
7. Improved over time, and maintained
Copyright © 2005 | 52|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 53|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 54|
Weeks 5-6
Typical Project Timeline*
Kick-off and Prep
Interviews
Content Analysis
Weeks 1-2 Weeks 3-4 Weeks 7-8 Weeks 9-10 Weeks 11-12
Document Requirements
Develop & Validate Taxonomy
Implementation
Caveat: this will vary greatly based on the complexity of the content and the organization
Copyright © 2005 | 55|
Seven phases of taxonomy and metadata design
1 Identify Objectives
Conduct interviews
2 Inventory Content
ID sources, spider assets & extract
metadata
Define fields & purpose
3 Specify Metadata
4 Model Content
Define content chunks & XML
DTDs
5 Specify Vocabularies
Compile controlled vocabularies
6 Specify Procedures
Develop workflow, rules & procedures
7 Train StaffDevelop
materials & train staff
Copyright © 2005 | 56|
Seven phases of taxonomy and metadata design
1 Identify Objectives
Interview core team and stakeholders
2 Inventory Content
ID sources, spider assets & extract
metadata
Define fields & purpose
3 Specify Metadata
4 Model Content
Define content
chunks & XML DTDs
5 Specify Vocabularies
Compile controlled
vocabularies
6 Specify Procedures
Start with UI sketches,
off-the-shelf rules.
7 Train StaffManually tag small sample
Review tagged
samples, default
procedures
Gather additional sources, if
any
Revise if needed, bake
into alpha CMS
Revise if needed, bake into alpha
CMS
Revise, use in alpha CMS
alpha workflows in CMS
Use alpha CMS to tag
larger sample
Interview alpha users
Modify CMS for
beta
Modify CMS for beta
Revise, use in beta CMS
Modify & extend
workflows
Finalize training materials & train
staff
Gather additional sources, if
any
Tailor the default
materials
Use beta CMS to tag larger
sample
Interview beta users
Modify for 1.0
Modify for 1.0
Revise using team
procedure
Finalize procedure materials
Plan & Prototype Alpha Dev & Test Beta D&T Final D&TProject Team Stakeholders and SMEs Friendly Users Audiences
StageParticipants
Copyright © 2005 | 57|
• What is the level of knowledge about taxonomy in the company as a whole?
• What are the most important priorities for the taxonomy?
• How much do I know about the subject matter? How much ramp up do I need?
• How many types of content will I need to consider? • How much content is there (quantity-wise)?• How many stakeholders and subject matter experts
(SMEs) are there? How are they organized? (e.g. one “owner/SME” per product line?)
• What types of politics or challenges exist today between groups of owners/subject matter experts? Will they debate and/or argue over terminology or what should be classified where?
Project Prep | Key Considerations
Copyright © 2005 | 58|
• Does any of the terminology need to be created from scratch or re-written?
• What kind of data store will the taxonomy be used in? (Database? XML repository?)
• Has any user feedback been received so far (internal or external, formal or informal), as to what they like and don’t like about finding the company’s information?
• Is there a product database of any sort in existence today? What product characteristics are accounted for? (name, description, number, etc.)
• If there is a web site, how is it organized today? (e.g. products, solutions, roles, etc.)
• How will users tag content using this taxonomy? Do they have that software/interface in place today?
• Will we need to train users to tag content?
Project Prep | Key Considerations
Copyright © 2005 | 59|
• Conduct stakeholder interviews to determine project goals and success metrics• Be sure to be prepared with your own!
• Conduct industry competitive analysis if appropriate• Review content and create a high-level inventory• Determine the terms the business uses to categorize
information (top-down approach)• Determine the term the employees use when seeking
information (bottom-up approach)• Gather all terms / categories / content types• Check vis-à-vis original content inventory to ensure
everything is accounted for
Content Analysis | Steps and Approaches
Copyright © 2005 | 60|
Example | Document Topic Inventory
Copyright © 2005 | 61|
Example | Product Topic Inventory
Copyright © 2005 | 62|
• SME analysis of content to determine categories and/or tags
• Workshops with SME and stakeholders to gain additional understanding of content
• Card sorting exercises with business users or end customers to determine intuitive clustering and category names
• Auto-generation of “rough” taxonomy via software tool• Refine with SMEs and taxonomy experts
• Iterative taxonomy creation over a period of several weeks depending on size and scope of the effort
• Validate taxonomy via user testing
Taxonomy creation process | Steps and Approaches
Copyright © 2005 | 63|
• Be aware of the competition: how they name and categorize products
• Involve engineers early: ensure that the taxonomy you’re creating can be used with the technology
• Be aware of key parties’ viewpoints• After determining the high-level categories, have a
midpoint check in with stakeholders to ensure you’re on the right track and build ongoing consensus
• For the purposes of web design, leverage sample page layouts to show how categorization and tagging will affect page layout and content
• Remember taxonomies must evolve and progress as your business changes
Taxonomy creation process | Best practices
Copyright © 2005 | 64|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 65|
Taxonomy Business Processes
• Taxonomies must change, gradually, over time if they are to remain relevant
• Maintenance processes need to be specified so that the changes are based on rational cost/benefit decisions
• A team will need to maintain the taxonomy on a part-time basis
• Taxonomy team reports into CM governance or steering committee
Copyright © 2005 | 66|
Taxonomy governance | Change process overview
Working Copiesof CVs, maintain in
Taxonomy Tool
Site Search Tool
Portal
Project Archives
’
DMS’
Metatagging Tool
Search UI
2: NASA Taxonomy Teamdecides when to
update snapshots ofexternal CVs
4: Updated versions ofCVs to Consumers
NASA Taxonomy Governance Environment
3: Team adds value to snapshots through
definitions, synonyms, classification rules,
training materials, etc.
Internally CreatedCVs
Codes
NASA Competencies
CVs from otherNASA Sources
External StandardVocabularies
’
’
2: Taxonomy Team decides when to update CV snapshots
Taxonomy Facets
3: Team adds value via definitions, synonyms, classification rules, training materials, etc.
1: External controlled vocabularies (CVs) change on their own schedule
Taxonomy Governance Environment
4: Updated versions of CVs published to consumers
CV Consumers
CV Sources
Subject Codes
Expertise
Other Internal
External Standard
Site Search Tool
Portal
Working Papers
Web CMS
DAM
Tagging Tool
Search UI
Internally Created
Taxonomy Tool
Copyright © 2005 | 67|
Taxonomy governance | Generic team charter• Taxonomy Team is responsible for maintaining:
• The Taxonomy, a multi-faceted classification scheme• Associated taxonomy materials, such as:
• Editorial Style Guide• Taxonomy Training Materials• Metadata Standard
• Team rules and procedures (subject to CIO review) • Committee will consider costs and benefits of suggested change• Taxonomy Team will:
• Manage relationship between providers of source vocabularies and consumers of the Taxonomy
• Identify new opportunities for use of the Taxonomy across the Enterprise to improve information management practices
• Promote awareness and use of the Taxonomy
Copyright © 2005 | 68|
Taxonomy governance team | Generic roles• Executive Sponsor
• Advocate for the taxonomy team
• Business Lead• Keeps committee on track with larger business objectives• Balances cost/benefit issues to decide appropriate levels of effort
• Specialists help in estimating costs• Obtains needed resources if those in committee can’t accomplish a particular task
• Technical Specialist• Estimates costs of proposed changes in terms of amount of data to be retagged, additional storage
and processing burden, software changes, etc.• Helps obtain data from various systems
• Content Specialist• Committee’s liaison to content creators• Estimates costs of proposed changes in terms of editorial process changes, additional or reduced
workload, etc.
• Taxonomy Specialist• Suggests potential taxonomy changes based on analysis of query logs, indexer feedback• Makes edits to taxonomy, installs into system with aid of IT specialist
• Content Owner• Reality check on process change suggestions
Copyright © 2005 | 69|
Taxonomy governance | Where changes come from
experience
End User
Steering Committee
Firewall
Taxonomy
Content TaggingLogic
ApplicationUI
TaggingUI
Tagging Staff
Taxonomy Editor
Staff notes
‘missing’concepts
Query log analysis
Requests from other parts of NASA
experience
End User
Steering Committee
FirewallFirewall
Taxonomy
Content TaggingLogic
TaggingLogic
ApplicationUI
ApplicationUI
TaggingUI
TaggingUI
Tagging Staff
Taxonomy Editor
Staff notes
‘missing’concepts
Query log analysis
Requests from other parts of the organization
Committee considerations
1. Business goals
2. Changes in user experience
3. Retagging cost
Recommendations by Editor
1. Small taxonomy changes (labels, synonyms)
2. Large taxonomy changes (retagging, application changes)
3. New “best bets” content
Application Logic
Copyright © 2005 | 70|
Taxonomy governance | Taxonomy maintenance workflow
Analyst Editor
Problem?
Copywriter
Problem?
Yes
Yes No
No
Suggest new name/category
Review new name
Taxon-omy
Taxonomy Tool
Copy edit new name
Add to enterprise Taxonomy
Sys Admin
Copyright © 2005 | 71|
Sample Taxonomy Editor: Data Harmony
Hierarchy Browser
Standard Term Info
Copyright © 2005 | 72|
Taxonomy editing tools vendors
Abi
lity
to E
xecu
telo
whi
gh
Completeness of VisionVisionariesNiche Players
Widely used, cheap, single-user
High functionality, high cost ($100k!)
Most popular taxonomy editor? MS
Excel
Immature industry – no vendors in upper-right quadrant!
Copyright © 2005 | 73|
Measuring Metadata and Taxonomy Quality
• Taxonomy development is an iterative process
• Elicit feedback via walk-throughs, tagging samples, and card sorting exercises
• Use both qualitative and quantitative methods, and remain flexible throughout
Copyright © 2005 | 74|
Taxonomy testing | Qualitative methods
Method Process Validation
Walk-throughs Show and explain Approach
Consistency to rules
Appropriateness to task
Usability Testing Contextual analysis Tasks are completed successfully
Time to complete task is reduced
User Satisfaction Survey Reaction to new interface
Reaction to search results
Tagging samples Tag sample content with taxonomy
Content ‘fit’
Fills out content inventory
Training materials for people & algorithms
Basis for quantitative methods
Copyright © 2005 | 75|
Quantitative Method | How evenly does it divide the content?
• Background:• Documents do not distribute uniformly
across categories• Zipf (1/x) distribution is expected behavior• 80/20 rule in action (actually 70/20 rule)
• Methodology:• Part of alpha test of ‘content type’ for
corporate intranet• 115 URLs selected at random from
search index were manually categorized. Inaccessible files and ‘junk’ were removed
• Results:• Results were slightly more uniform than
the Zipf distribution, which is better than expected
Measured and Expected Distribution of Content Types in an Intranet
0
5
10
15
20
25
Peo
ple,
Gro
ups
& P
lace
s
New
s &
Eve
nts
Man
uals
&Le
arni
ngM
ater
ials
Ope
ratio
ns &
Inte
rnal
Com
mun
icat
ions
Mar
ketin
g &
Sal
es
Reg
ulat
ions
,P
olic
ies,
Pro
cedu
res
&
Pap
ers
&P
rese
ntat
ions
Oth
er &
Unc
lass
ified
Pro
gram
s,P
ropo
sals
, P
lans
& S
ched
ules
Content Type
# D
ocu
men
ts
Measured
Expected
Measured and Expected Distribution of Top 10 Content Types in Library of Congress Database
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
Congre
sses
Biogra
phy
Period
icals
Map
s
Fiction
Exhib
itions
Juve
nile l
itera
ture
Bibliog
raph
y
Statis
tics
Top 10 Content Types
Nu
mb
er o
f R
eco
rds
Series2
Series1
Copyright © 2005 | 76|
Quantitative Method | How intuitive (repeatable) are the categorizations?• Methodology: Closed Card
Sort• For alpha test of a grocery site• 15 Testers put each of 100
best-selling products into one of 10 pre-defined categories
• Categories where fewer than 14 of 15 testers put product into same category were flagged
• Results:% of Testers Cumulative % of
Products
15/15 54%
14/15 70%
13/15 77%
12/15 83%
11/15 85%
<11/15 100%
In the trade, “Corn Tortillas” are a Dairy item!
“Cocoa Drinks – Powder” is best categorized in both
“Beverages” and “Grocery”.
Copyright © 2005 | 77|
Quantitative Method | How does taxonomy “shape” match that of content?
Term Group % Term
s
% Docs
Administrators 7.8 15.8
Community Groups
2.8 1.8
Counselors 3.4 1.4
Federal Funds Recipients and Applicants
9.5 34.4
Librarians 2.8 1.1
News Media 0.6 3.1
Other 7.3 2.0
Parents and Families
2.8 6.0
Policymakers 4.5 11.5
Researchers 2.2 3.6
School Support Staff
2.2 0.2
Student Financial Aid Providers
1.7 0.7
Students 27.4 7.0
Teachers 25.1 11.4
Source: Courtesy Keith Stubbs, US. Dept. of Education
• Background:• Hierarchical taxonomies allow
comparison of “fit” between content and taxonomy areas
• Methodology:• 25,380 resources tagged with
taxonomy of 179 terms. (Avg. of 2 terms per resource)
• Counts of terms and documents summed within taxonomy hierarchy
• Results:• Roughly Zipf distributed (top 20
terms: 79%; top 30 terms: 87%)
• Mismatches between term% and document% flagged
Copyright © 2005 | 78|
Metadata Maturity Model• Taxonomy governance processes must fit the organization• As consultants, we notice different levels of maturity in the business
processes around Content Management, Taxonomy, and Metadata• Honestly assess your organization’s metadata maturity in order to
design appropriate governance processes• We are starting to define a maturity model, similar to the SCCM model
in the software world:• Initial - ad hoc, each project begins from scratch. • Repeatable - Procedures defined and used, but not standardized across
organization or are misapplied to projects.• Defined – Standard processes are tailored for project needs. Strategic
training for long-range goals is in place.• Managed – Projects managed using quantitative quality measures. Process
itself is measured and controlled.• Optimizing – Continual process improvement. Extremely accurate project
estimation.
Copyright © 2005 | 79|
Purpose of Maturity Model• Estimating the maturity of an organization’s information
management processes tells us:• How involved the taxonomy development and maintenance
process should be• Overly sophisticated processes will fail
• What to recommend as first steps
• Maturity is not a goal, it is a characterization of an organization’s methods for achieving particular goals
• Mature processes have expenses which must be justified by consequent cost savings or revenue gains
• IT Maturity may not be core to your business
Copyright © 2005 | 80|
Metadata Maturity ScorecardInitial Repeatable Defined Managed Optimizing
Organizational Structure
Executive Sponsorship *
Budgeting *
Hiring & Training *
Quality Assurance
Manual Processes * 1
Automated Processes *
Project Management
Estimating & Scheduling *
Cost Control *
Project Methodology * 2
Design and Execution
Planning *
Design Excellence *
Development Maturity *
1 – X is starting to examine search query logs, which is an important first step in improving search. But this is only an isolated example.2 – IT has a project methodology they are trying to use across all projects. But not all business units have project methodologies.
Copyright © 2005 | 81|
Metadata Maturity Quick Quiz1) What process is in place to examine query logs?2) Is there a process for adding directories and content to the repository, or do people just
do what they want?3) Is there an organization-wide metadata standard, such as an extension of the Dublin
Core, used by search tools, multiple repositories, etc.?4) Are system features and metadata fields added based on cost/benefit analysis, rather
than things that are easy to do with the current tools?5) Who is breathing down my neck to improve search on our intranet?6) Is there an ongoing data cleansing procedure to look for ROT (Redundant, Obsolete,
Trivial content).7) Is there an established QA procedure for ensuring metadata accuracy and
conformance? Are there established qualitative and quantitative measures of metadata quality?
8) Is there a centralized metadata group with tools and services offered around the organization?
9) Are there hiring and training practices especially for metadata and taxonomy positions?10) Have features been removed from the metadata standard?
Copyright © 2005 | 82|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study: NASA
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study: PC Connection
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 83|
Example: PC Connection
Copyright © 2005 | 84|
• Drop-down menus• More visible• More traditional to
match user expectations
• Challenge: Give the “results” more real estate but keep the filters prominent
Leveraging Technology: Endeca
Copyright © 2005 | 85|
PC Connection: The Solution
• DHTML “slider” applied for second-level navigation, exposing all product attributes for easy filtering
• Validated solution • Powerful comparisons
between first and second usability tests (e.g., 5 out of 8 participants used filters on first test, 10 out of 10 used filters on second test)
Copyright © 2005 | 86|
PC Connection: Results
• All product categories consistently accessible• Drop-down menus with product attributes facilitate ease of
filtering• Easier to use different facets of taxonomy to find desired
products• Customers use, rather than struggle with, navigation
Copyright © 2005 | 87|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study
4:30 Summary and Discussion
5:00 Adjourn
Copyright © 2005 | 88|
Lessons Learned: Taxonomies for Business Impact
• Content is no longer king: the user is• Understand how your users/customers want to interact
with information before designing your taxonomy and the user interface
• Carry those user needs through to the back-end data structure and front-end user interface
• Empower the user with the categories and content attributes they need to filter and find what they want
• Leverage UE design best practices like usability testing to determine needs and validate taxonomy and interface design
• Remember that taxonomy is a “snapshot in time”: keep it up to date, let it evolve
Copyright © 2005 | 89|
Summary• What is the problem you are trying to solve?
• Improve search (or findability)• Browse for content on an enterprise-wide portal• Enable business users to syndicate content• Otherwise provide the basis for content re-use• Comply with regulations
• What data and metadata do you need to solve it?• Where will you get the data and metadata?• How will you control the cost of creating and maintaining the
data and metadata needed to solve these problems?• CMS with a metadata tagging products• Semi-automated classification• Taxonomy editing tools• Appropriate governance process
Copyright © 2005 | 90|
Agenda
1:30 Welcome and Introductions
1:40 Taxonomy Definitions and Examples
2:10 Business Case and Motivations
2:30 Case Study
2:45 Tagging and Tools
3:00 Break
3:15 Running a Taxonomy Project
3:45 Taxonomy Maintenance and Governance
4:15 Case Study
4:30 Summary and Discussion
5:00 Adjourn