Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology...

Post on 30-Jun-2020

6 views 0 download

transcript

Department of Electrical Engineering and Computer Science

Text Classification Combining Clustering and Hierarchical Approaches

Shankar RanganathanMS Thesis Defense

May 3rd, 2004

CommitteeDr. Susan Gauch (Chair)

Dr. Perry AlexanderDr. David Andrews

Department of Electrical Engineering and Computer Science

Presentation OutlineSearch Engines TodayContributions Related WorkText Classification – Our ApproachExperiments and EvaluationConclusionsFuture Work

Department of Electrical Engineering and Computer Science

Search Engines TodayReturn results based on simple key-word

matches.No regard for conceptual information.For E.g. : If the query is “SALSA”, Is it……

Department of Electrical Engineering and Computer Science

KeyConcept Architecture

Department of Electrical Engineering and Computer Science

Contributions

Novel approach to Text Classification by combining clustering within the concepts with hierarchical text classificationEffect of clustering on flat classification versus hierarchical classificationEffect of ignoring versus using concept wise distinction lower down the hierarchies

Department of Electrical Engineering and Computer Science

Related Work IText Classification

Yang, Sebastiani: Comparison of Text classification methods - K-Nearest Neighbors, linear least square fit, Naïve Bayesian, Support Vector Machines, Decision treesHierarchical Classification: Proposed by Koller. Further work by – Sun, Labrou, Sasaki, Dumais, Wang

Department of Electrical Engineering and Computer Science

Related Work IIChaffee, YAHOO, Open Directory Project : OntologyManning, Dubes, Kaufman –Document clusteringAgglomerative (Guha, Karypis) vs. Divisive (Zhao)Lots of packages available on net –Cluto, Chameleon, Rock, Cure, DocCluster, Siftware etc.,Perkowitz – Cluster Mining

Department of Electrical Engineering and Computer Science

Text ClassificationTwo Step Process : Training the classifier and Classification of new documentsTraining Phase:

Classifier is fed with documents that have been classified manuallyLearns about the features (vocabulary) of the various categories into which new documents can be classified

Department of Electrical Engineering and Computer Science

Text Classification contd…

Classification Phase:Classifier assigns category (ies) to new documents based on the similarity of the features of input document and of the categories that it learned during training

Department of Electrical Engineering and Computer Science

Text Classification – Our ApproachVector Space model (tf-idf)Training data are documents that are manually assigned to the categories Open Directory Project’s Standard Tree which is our reference OntologyClassifier creates a vector of vocabulary terms and associated weights in an inverted file

Department of Electrical Engineering and Computer Science

Standard Tree

Department of Electrical Engineering and Computer Science

Text Classification – Our Approach ..Feature selection during training (selecting training documents) plays a primary role towards improving classification accuracy.

Hierarchical classificationUse of Clustering

Department of Electrical Engineering and Computer Science

Flat Classification vs. Hierarchical Classification

1

2 3

654

1 65432

Hierarchical ClassificationFlat Classification

Top

BusinessArts

Music TV Employment

Top-down level-based Hierarchical classification

Department of Electrical Engineering and Computer Science

Role of ClusteringImprove feature selectionEliminate documents that tend to confuse the classifierIdentify within-category clusters, and extract cluster(s)’ representative pagesDocument mining within the framework of cluster mining

Department of Electrical Engineering and Computer Science

Text Classification – Our Approach contd…

During Classification phase, a vector of input document is createdSimilarity between training this vector and vector of each concept during training is computed using dot productNew document is assigned to the categories with best matches

Department of Electrical Engineering and Computer Science

Classifier Output

Department of Electrical Engineering and Computer Science

Experimental Set-upSource of training data: Open Directory Project (dmoz.org) – ODP ontology contains hierarchical informationTest data: Randomly-selected level 3 documentsClustering package: CLUTO

Clustering method: Partitional clusteringSimilarity function: Cosine functionProgram used: vcluster - zscores

Department of Electrical Engineering and Computer Science

Experimental Setup…..Baseline – Random Selection

All concepts from levels 1, 2 and 3 with at least 32 documents (total 1484)2 documents from each concept was randomly withheld for testing (total - 2978)Trained with randomly-selected 30 documents from each concept( around 44500)Accuracy = 46.6 %

Performance of the flat classifier

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

1 2 3 4 5 6 7 8 9 10

topn %

of d

ocum

ents

in to

pn

Baseline - RandomSelection

Department of Electrical Engineering and Computer Science

EvaluationDoes selecting documents closest to the centroid to train improve accuracy ?For hierarchical classification, how far down the hierarchy should we go in each step ?What is the number of documents to train the classifier to get best results ?‘Ignore’ or ‘consider’ tree structure among children ?

Department of Electrical Engineering and Computer Science

Experiment 1 : Effect of clustering on Flat Classification

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1 2 3 4 5 6 7 8 9 10

Random

closest to centroid

Farthest from centroid

farthest to each other

Best observed accuracy – Selecting documents closest to the centroid (49.5%)Poor performance – Selecting documents farthest from the centroid (29.5%)Selecting documents farthest from each other –48.6%

Department of Electrical Engineering and Computer Science

Experiment 2- Effect of clustering on training Set selection for hierarchical classification

1 Classifier at level 1, 15 at level 2, 358 at level 3Documents from parent & children ( & grandchildren put in the same pool to select)Parameters we tune : Depth, Random selection vs. clustering, # of documents

0

1 2

3 4 5 6

87 109

0

1 2

3 4 5 6

8710

9

0

1 2

3 4 5 6

8710

9

Department of Electrical Engineering and Computer Science

Experiment 2a – Study of Level 1 Decision

Using Level I documents

12.5

13

13.5

14

14.5

15

15.5

16

10 20 30 40

Number of documents

Maximum observed accuracy 15.8% - Very Poor

Very few documents at level-1

So, go deeper…...

% c

orre

ct m

atch

for l

evel

I de

cisi

on

Random

closest tcentroid

o

Department of Electrical Engineering and Computer Science

2.a: Study of Level 1 Decision.....

Level I Decision Using Level I and II

010203040506070

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r de

cisi

on a

t Lev

el 1

Random

Closest tocentroid

Level I Decision using Level I , II and III

0

20

40

60

80

100

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r lev

el

one

deci

sion Random

Closest tocentroid

Maximum accuracy of 81.6% for level 1 decision when documents from levels 1,2 & 3 are used

Department of Electrical Engineering and Computer Science

Expt 2.b: Study of level 2 decision

Level II Decision Using Just level II documents

0

10

20

30

40

50

60

70

10 20 30 40 50 60

Number of documents

% C

orre

ct M

atch

es

Random

closest to centroid

Level I Decision using Level I , II and III

0

20

40

60

80

100

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r lev

el

one

deci

sion Random

Closest tocentroid

Maximum accuracy of 71.3% for level 2 decision when documents from levels 1,2 & 3 are used. 40 documents to train per concept.

Department of Electrical Engineering and Computer Science

Expt 2.c: Study of Level 3 DecisionLevel III Decision Using Level III Documents

0

10

20

30

40

50

60

70

10 20 30 40 50 60

Number of Documents

%Co

rrec

tMat

ches

RandomClosest to centroid

Maximum accuracy for random selection = 55.2%Maximum accuracy by selecting docts closest to the centroid = 65.4%40.3% relative improvement over baseline

Department of Electrical Engineering and Computer Science

Expt 3: Effect of clustering on hierarchical classification, distributing training set across sub-concepts

Documents selected from each sub-conceptParameters we plan to tune : Depth, # of docts, random vs. closest to the centroid

Department of Electrical Engineering and Computer Science

Experiment 3.a: Level 1 DecisionLevel One decision Using Documents

Closest Centroid

0

20

40

60

80

100

1 2 3 4

Number of documents

% A

ccur

acy

Usingdocumentsfrom level one Using documentsupto level 2Usingdocumentsupto level 3Usingdocumentsupto level 4

Including level 4 – almost same results as level 391.2% Accuracy – 2 documents closest to the centroid from each concept down till level 3Poor results while using just level 1 or level 1 & 2

Department of Electrical Engineering and Computer Science

Experiment 3.b: Level 2 DecisionUsing documents from levels 2&3, 2,3&4 yield almost identical resultsWe use till level 3 -computational time and complexityBest observed accuracy – 84.4% - 2 docts per concept closest to the centroid

Level Two Decision Using Documents Closest to Centroid

0102030405060708090

1 2 3 4

Number of Documents

% A

ccur

acy

Usingdocuments atlevel 2 onlyUsingDocuments atlevel 2 and 3Usingdocuments atlevel 2, 3 and 4

Department of Electrical Engineering and Computer Science

Experiment 3.c Level 3 decisionOverall best accuracy of 79.1% at level 3 using one document from each concept that is closest to the centroid.

Level Three Decision Using Documents Closest to Centroid

7374757677787980

1 2 3 4

Number of documents

% A

ccur

acy

Using documents atlevel 3 only

Using Documents atlevels 3 and 4

Department of Electrical Engineering and Computer Science

Training Strategy2 training documents from each concept Down to level-3These documents are closest to the centroid in each conceptAccuracy of 77.9% when we use clustering as compared to 71.8% when we select random documents

Two documents per category closest to the centroid

0

20

40

60

80

100

1 2 3

Decision at Level

% A

ccur

acy

RandomClosest to centroid

Department of Electrical Engineering and Computer Science

Validation TestingValidation Testing

0

20

40

60

80

100

1 2 3

Level

% A

ccur

acy Random

Documents closestto the centroid

Different Test dataRole of clustering enhances accuracy from 79.7% to 89% at level-1 and final accuracy from 69.8% to 76.2%.Statistically significant( t-test value = 3.23E-05) improvement

Department of Electrical Engineering and Computer Science

ConclusionsMaximum Accuracy of 77.9% when we use :Hierarchical Classification,2 documents closest to the centroid from each concept down till level-3 to train the classifier

Summary

46.649.5

65.469.8

77.9

0

10

20

30

40

50

60

70

80

90

1

Experiments

% A

ccur

acy

Flat Classif icat ion

Flat Classif icat ion with Clustering

Hierarchical Classif icat ion (putt ingsub-concepts in the same pool)

Hierarchical classif icat ion -randomly-select ing 2 documentsfrom each subconcept

Hierarchical classif icat ion -Select ing 2 documents from eachconcept that are closest to thecentroid

Department of Electrical Engineering and Computer Science

Future WorkUse of other classifiers like the SVMHow to deal with the dynamic web ?Trials on other data setsRecovery mechanism when error is made at the parent levelFurther ‘divide and conquer’ –Binary decisions

Department of Electrical Engineering and Computer Science

????’s or !!!!’s

Thank You