Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | paula-french |
View: | 215 times |
Download: | 2 times |
Bringing Order to the Web: Automatically Categorizing Search Results
Hao Chen, CS Division, UC BerkeleySusan Dumais, Microsoft Research
ACM:CHI April 4, 2000
Organizing Search Results
List Organization Category Org (SWISH)
Query: jaguar
Outline Background
Using category structure to organize information
SWISH SystemSearching With Information Structured Hierarchically Text classification User interface
User Study Future Work
Using Category Structure To Organize Information
Superbook, Cat-a-Cone, etc. To Help Web Search
Yahoo!, Northern Light What’s New in SWISH?
Automatic categorization of new documents User interface that tightly couples
hierarchical category structure with search results
User study for the new user interface
SWISH System Combines the Advantages of
Manually crafted & easily understood directory structure
Broad coverage from search engines System Components
Text classification models User interface
Text Classification Text Classification
Assign documents to one or more of a predefined set of categories
E.g., News feeds, Email - spam/no-spam, Web data
Manually vs. automatically Inductive Learning for Classification
Training set: Manually classified a set of documents
Learning: Learn classification models Classification: Use the model to automatically
classify new documents
Category Structure (spring 99) 13 top-level categories 150 second-level categories
Training Set ~50k web pages; chosen randomly from all
cats Top-level Categories
Training Set: LookSmart Web Directory
People & ChatReference & EducationShopping & ServicesSociety & PoliticsSports & RecreationTravel & Vacations
AutomotiveBusiness & FinanceComputers & InternetEntertainment & MediaHealth & FitnessHobbies & InterestsHome & Family
Learning & Classification Support Vector Machine (SVM)
Accurate and efficient for text classification (Dumais et al., Joachims)
Model = weighted vector of words “Automobile” = motorcycle, vehicle, parts, automobile,
harley, car, auto, honda, porsche … “Computers & Internet” = rfc, software, provider,
windows, user, users, pc, hosting, os, downloads ... Hierarchical Models
1 model for N top level categories N models for second level categories Very useful in conjunction w/ user interaction
SWISH Architecture
manuallyclassified
webpages
SVMmodel
Train(offline)
websearchresults
localsearchresults
...Classify(online)
Interface Characteristics Problems
Large amount of information to display Search results Category structure
Limited screen real estate Solutions
Information overlay Distilled information display
Information Overlay Use tooltips to show
Summaries of web pages Category hierarchy
Expansion of Category Structure
Expansion of Web Page List
User Study - ConditionsCategory Interface List Interface
User Study
User Study Participants:
18 intermediate Web users Tasks
30 search taskse.g., “Find home page for Seattle Art Museum”
Search terms are fixed for each task Experimental Design
Category/List – within subjects 15 search tasks with each interface
Order (Category/List First) – counterbalanced between subjects
Both Subjective and Objective Measures
Subjective Results 7-point rating scale (1=disagree; 7=agree) Questions:
Question Category List significanceIt was easy to use this software. 6.4 3.9 p<.001I liked using this software 6.7 4.3 p<.001I prefer this to my usual Web Search engine 6.4 4.3 p<.001It was easy to get a good sense of the range of alternatives. 6.4 4.2 p<.001I was confident that I could find information if it was there. 6.3 4.4 p<.001
The "More" button was useful 6.5 6.1 n.s.The display of summaries was useful 6.5 6.4 n.s.
Use of Interface Features
Average Number of Uses of Feature per Task
Interface Features Category List significanceExpansing / Collapsing Structure 0.78 0.48 p<.003
Viewing Summaries in Tooltips 2.99 4.60 p<.001Viewing Web Pages 1.23 1.41 p<.053
Search Time
Category: 56 secsList: 85 secs p < .002
50% faster with Category interface
RT for Category vs. List
0
20
40
60
80
100
Category List
Interface Condition
Ave
rag
e M
edia
n R
T
Search Time by Query Difficulty
Top20: 57 secsNotTop20: 98 secs
•No reliable interaction between query difficulty and interface condition
•Category interface is helpful for both easy and difficult queries
RT by Interface and Query Difficulty
020406080100120140160
Category List
Interface Condition
Ave
rag
e M
edia
n R
T
Easy(Top20)
Hard(NotTop20)
Summary Text Classification
Organize search results Use hierarchical category models Classify new web pages on-the-fly
User Interface Tightly couple search results with category structure Allow manipulation of presentation of category
structure User Study
Suggest strong preference and performance advantages for categorically organized presentation of search results
Open Issues Improve Accuracy of Classification Algorithms Enhance User Interface
Heuristics for selecting categories and pages to display
Query_Match: rank of page, and sometimes match score Categ_Match: p(category for each page)
Integration with non-content information Conduct End-to-end User Study More info:
http://research.microsoft.com/~sdumais
Searching With Information Structured Hierarchically
SWISH