Post on 11-Jan-2016
description
transcript
Automatic Classification of Text Databases Through Query Probing
Panagiotis G. Ipeirotis Luis Gravano
Columbia University
Mehran SahamiE.piphany Inc.
Search-only Text Databases
Sources of valuable informationHidden behind search interfacesNon-crawlable
Example: Microsoft Support KB
Interacting With Searchable Text
Databases
1. Searching: Metasearchers
2. Browsing: Use Yahoo-like directories
3. Browse & search: “Category-enabled” metasearchers
Searching Text Databases: Metasearchers
Select the good databases for a queryEvaluate the query at these databasesCombine the query results from the databases
Examples: MetaCrawler, SavvySearch, Profusion
Browsing Through Text Databases
Yahoo-like web directories:InvisibleWeb.comSearchEngineGuide.comTheBigHub.com
Example from InvisibleWeb.comComputers > Publications > ACM DL
Category-enabled metasearchersUser-defined category (e.g. Recipes)
Problem With Current Classification Approach
Classification of databases is done manuallyThis requires a lot of human effort!
How to Classify Text Databases Automatically:
Outline
Definition of classificationStrategies for classifying searchable databases through query probingInitial experiments
Database Classification: Two Definitions
Coverage-based classification:The database contains many documents about the category (e.g. Basketball)Coverage: #docs about this category
Specificity-based classification:The database contains mainly documents about this categorySpecificity: #docs/|DB|
Database Classification: An Example
Category: Basketball
Coverage-based classificationESPN.com, NBA.com
Specificity-based classificationNBA.com, but not ESPN.com
Categorizing a Text Database:
Two Problems
Find the category of a given documentFind the category of all the documents inside the database
Categorizing Documents
Several text classifiers availableRIPPER (AT&T Research, William Cohen 1995)
Input: A set of pre-classified, labeled documentsOutput: A set of classification rules
Categorizing Documents: RIPPER
Training set: Preclassified documents“Linux as a web server”: Computers“Linux vs. Windows: …”: Computers“Jordan was the leader of Chicago Bulls”: Sports“Smoking causes lung cancer”: Health
Output: Rule-based classifierIF linux THEN ComputersIF jordan AND bulls THEN SportsIF lung AND cancer THEN Health
Precision and Recall of Document Classifier
During the training phase:100 documents about computers“Computer” rules matched 50 docsFrom these 50 docs 40 were about computers
Precision = 40/50 = 0.8Recall = 40/100 = 0.4
From Document to Database Classification
If we know the categories of all the documents, we are done!But databases do not export such data!
How can we extract this information?
Our Approach: Query Probing
Design a small set of queries to probe the databasesCategorize the database based on the probing results
Designing and Implementing Query Probes
The probes should extract information about the categories of the documents in the database
Start with a document classifier (RIPPER)Transform each rule into a queryIF lung AND cancer THEN health +lung +cancerIF linux THEN computers +linux
Get number of matches for each query
ACM DL
NBA.com
PubMED
lung AND cancer health
jordan AND bulls sports
linux computers
ACM NBA PubM
comp
sports
health
336 0 16
0 6674 0
18 103 81164
336 0 16
0 6674 0
18 103 81164
Three Categories and Three Databases
Using the Results for Classification
COVCOV ACM NBA PubMcomp 336 0 16
sports 0 6674 0
health 18 103 81164
00.10.20.30.40.50.60.70.80.9
1
Sp
ec
ific
ity
ACM NBA PubMed
compsportshealth
SPESPECC
ACM NBA PubM
comp 0.95 0 0
sports 0 0.985 0
health 0.05 0.015 1.0
We use the results to estimate
coverage and specificity
values
Adjusting Query ResultsClassifiers are not perfect!
Queries do not “retrieve” all the documents that belong to a categoryQueries for one category “match” documents that do not belong to this category
From the training phase of classifier we use precision and recall
Precision & Recall Adjustment
Computer-category:Rule: “linux”, Precision = 0.7 Rule: “cpu”, Precision = 0.9Recall (for all the rules) = 0.4
Probing with queries for “Computers”:Query: +linux X1 matches 0.7X1 correct matches
Query: +cpu X2 matches 0.9X2 correct matches
From X1+X2 documents found:
Expect 0.7 X1+0.9 X2 to be correctExpect (0.7 X1+0.9 X2)/0.4 total computer docs
Initial ExperimentsUsed a collection of 20,000 newsgroup articlesFormed 5 categories:
Computers (comp.*)Science (sci.*)Hobbies (rec.*)Society (soc.* + alt.atheism)Misc (misc.sale)
RIPPER trained with 10,000 newsgroup articles Classifier: 29 rules, 32 words used
IF windows AND pc THEN Computers (precision~0.75)IF satellite AND space THEN Science (precision~0.9)
Web-databases ProbedUsing the newsgroup classifier we probed four web databases:
Cora (www.cora.jprc.com) CS Papers archive (Computers)
American Scientist (www.amsci.org)Science and technology magazine (Science)
All Outdoors (www.alloutdoors.com)Articles about outdoor activities (Hobbies)
Religion Today (www.religiontoday.com)News and discussion about religions (Society)
Results
53
1450
113103
202
15128 50231
95
733
43215215
170
7498
45 7 67 1520
0.10.20.30.40.50.60.70.80.9
1
Cora American Scientist AllOutdoors ReligionToday
Spe
cific
ity
Computers Science Hobbies Society Misc
Only 29 queries per web siteNo need for document retrieval!
Conclusions
Easy classification using only a small number of queriesNo need for document retrieval
Only need a result like: “X matches found”
Not limited to search-only databases Every searchable database can be classified this way
Not limited to topical classification
Current Issues
Comprehensive classification schemeRepresentative training data
Future WorkUse a hierarchical classification schemeTest different search interfaces
Boolean modelVector-space modelDifferent capabilities
Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task)Study classification efficiency when documents are accessible
Related Work
Gauch (JUCS 1996)Etzioni et al. (JIIS 1997)Hawking & Thistlewaite (TOIS 1999)Callan et al. (SIGMOD 1999)Meng et al. (CoopIS 1999)