Automatic Classification of Text Databases Through Query Probing

transcript

Panagiotis G. Ipeirotis Luis Gravano

Columbia University

Mehran SahamiE.piphany Inc.

Search-only Text Databases

Sources of valuable informationHidden behind search interfacesNon-crawlable

Example: Microsoft Support KB

Interacting With Searchable Text

Databases

1. Searching: Metasearchers

2. Browsing: Use Yahoo-like directories

3. Browse & search: “Category-enabled” metasearchers

Searching Text Databases: Metasearchers

Select the good databases for a queryEvaluate the query at these databasesCombine the query results from the databases

Examples: MetaCrawler, SavvySearch, Profusion

Browsing Through Text Databases

Yahoo-like web directories:InvisibleWeb.comSearchEngineGuide.comTheBigHub.com

Example from InvisibleWeb.comComputers > Publications > ACM DL

Category-enabled metasearchersUser-defined category (e.g. Recipes)

Problem With Current Classification Approach

Classification of databases is done manuallyThis requires a lot of human effort!

How to Classify Text Databases Automatically:

Outline

Definition of classificationStrategies for classifying searchable databases through query probingInitial experiments

Database Classification: Two Definitions

Coverage-based classification:The database contains many documents about the category (e.g. Basketball)Coverage: #docs about this category

Specificity-based classification:The database contains mainly documents about this categorySpecificity: #docs/|DB|

Database Classification: An Example

Category: Basketball

Coverage-based classificationESPN.com, NBA.com

Specificity-based classificationNBA.com, but not ESPN.com

Categorizing a Text Database:

Two Problems

Find the category of a given documentFind the category of all the documents inside the database

Categorizing Documents

Several text classifiers availableRIPPER (AT&T Research, William Cohen 1995)

Input: A set of pre-classified, labeled documentsOutput: A set of classification rules

Categorizing Documents: RIPPER

Training set: Preclassified documents“Linux as a web server”: Computers“Linux vs. Windows: …”: Computers“Jordan was the leader of Chicago Bulls”: Sports“Smoking causes lung cancer”: Health

Output: Rule-based classifierIF linux THEN ComputersIF jordan AND bulls THEN SportsIF lung AND cancer THEN Health

Precision and Recall of Document Classifier

During the training phase:100 documents about computers“Computer” rules matched 50 docsFrom these 50 docs 40 were about computers

Precision = 40/50 = 0.8Recall = 40/100 = 0.4

From Document to Database Classification

If we know the categories of all the documents, we are done!But databases do not export such data!

How can we extract this information?

Our Approach: Query Probing

Design a small set of queries to probe the databasesCategorize the database based on the probing results

Designing and Implementing Query Probes

The probes should extract information about the categories of the documents in the database

Start with a document classifier (RIPPER)Transform each rule into a queryIF lung AND cancer THEN health +lung +cancerIF linux THEN computers +linux

Get number of matches for each query

ACM DL

NBA.com

PubMED

lung AND cancer health

jordan AND bulls sports

linux computers

ACM NBA PubM

sports

health

336 0 16

0 6674 0

18 103 81164

336 0 16

0 6674 0

18 103 81164

Three Categories and Three Databases

Using the Results for Classification

COVCOV ACM NBA PubMcomp 336 0 16

sports 0 6674 0

health 18 103 81164

00.10.20.30.40.50.60.70.80.9

ACM NBA PubMed

compsportshealth

SPESPECC

ACM NBA PubM

comp 0.95 0 0

sports 0 0.985 0

health 0.05 0.015 1.0

We use the results to estimate

coverage and specificity

values

Adjusting Query ResultsClassifiers are not perfect!

Queries do not “retrieve” all the documents that belong to a categoryQueries for one category “match” documents that do not belong to this category

From the training phase of classifier we use precision and recall

Precision & Recall Adjustment

Computer-category:Rule: “linux”, Precision = 0.7 Rule: “cpu”, Precision = 0.9Recall (for all the rules) = 0.4

Probing with queries for “Computers”:Query: +linux X1 matches 0.7X1 correct matches

Query: +cpu X2 matches 0.9X2 correct matches

From X1+X2 documents found:

Expect 0.7 X1+0.9 X2 to be correctExpect (0.7 X1+0.9 X2)/0.4 total computer docs

Initial ExperimentsUsed a collection of 20,000 newsgroup articlesFormed 5 categories:

Computers (comp.*)Science (sci.*)Hobbies (rec.*)Society (soc.* + alt.atheism)Misc (misc.sale)

RIPPER trained with 10,000 newsgroup articles Classifier: 29 rules, 32 words used

IF windows AND pc THEN Computers (precision~0.75)IF satellite AND space THEN Science (precision~0.9)

Web-databases ProbedUsing the newsgroup classifier we probed four web databases:

Cora (www.cora.jprc.com) CS Papers archive (Computers)

American Scientist (www.amsci.org)Science and technology magazine (Science)

All Outdoors (www.alloutdoors.com)Articles about outdoor activities (Hobbies)

Religion Today (www.religiontoday.com)News and discussion about religions (Society)

Results

113103

15128 50231

43215215

45 7 67 1520

0.10.20.30.40.50.60.70.80.9

Cora American Scientist AllOutdoors ReligionToday

Computers Science Hobbies Society Misc

Only 29 queries per web siteNo need for document retrieval!

Conclusions

Easy classification using only a small number of queriesNo need for document retrieval

Only need a result like: “X matches found”

Not limited to search-only databases Every searchable database can be classified this way

Not limited to topical classification

Current Issues

Comprehensive classification schemeRepresentative training data

Future WorkUse a hierarchical classification schemeTest different search interfaces

Boolean modelVector-space modelDifferent capabilities

Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task)Study classification efficiency when documents are accessible

Related Work

Gauch (JUCS 1996)Etzioni et al. (JIIS 1997)Hawking & Thistlewaite (TOIS 1999)Callan et al. (SIGMOD 1999)Meng et al. (CoopIS 1999)