Document Classification Techniques using LSI

transcript

Barry BrittUniversity of Tennessee

PILOT PresentationSpring 2007

Introduction

Automatic Classification of Documents

The problem: Brought to Dr. Berry by InRAD, LLC Develop a system that can automatically score

proposals (NSF, SBIR) Proposals are scored by the authors. The current system grades proposals based on the

writing skill of the author The solution:

An automatic system for classifying readiness levels.

LSI and GTP reduced rank vector space model Queries

are reduced rank approximations Produce contextually relevant results Semantically linked terms and documents are grouped

together GTP is an implementation of LSI

Local and global weighting and document frequencies Document normalization Much more…

Document Sets - Composition Consist of Technology Readiness Reports

Proposals Subjective score from 1 (low) to 9 (high) No structure to documents

Example (from DS1):windshield windshield windshield windshield windshield windshield windshield triphammer triphammer night night ithaca ithaca fleet fleet airline airline worthy window william warn visual visible variant snow severe retrofit reflect red prior primarily popular owner outside oem look imc gages day dangerous cue certification brook analog accumulation accumulate accretion

Document Sets - Composition Document Set 1 (DS1)

4000 documents 85.4 terms per document

Document Set 2 (DS2) 4000 documents 49.2 terms per document

Document Set 3 (DS3) 455 documents 37.1 terms per document

2 Classes - “Phase1” and “Phase2”

Document Sets - Labels

# Class 1 # Class 2 % Class 1 % Class 2 DS1 3244 756 81.10% 19.90% DS2 2955 1045 73.88% 26.13% DS3 166 289 36.48% 63.52%

Figure 1-b: Document Set Class information

Class labels for the individual documents are determined by the authors of the proposals…

Document Classification

Classification - Majority Rules

3 Steps to classification: Choose a document and make it the query

document Retrieve the x closest documents; documents

with the highest cosine similarity values Class = max[n1,n2]

Majority Rules - Results Phase 1 Phase 2

Phase 1 3239 5 Phase 2 752 4

Figure 2-a: "Majority Rules" confusion matrix for DS1

Phase 1 Phase 2 Phase 1 2875 80 Phase 2 833 212

Figure 2-b: "Majority Rules" confusion matrix for DS2

Phase 1 Phase 2 Phase 1 98 68 Phase 2 53 236

Figure 2-c: "Majority Rules" confusion matrix for DS3

Actual

Predicted

Actual

Predicted

Majority Rules - Results

Why are these results not good? Good representation from Class 1 Very poor representation from Class 2

How can we improve results in the underrepresented class?

Class 1 Class 2 Overall DS1 99.85% 0.53% 81.08% DS2 97.29% 20.29% 77.18% DS3 59.04% 81.66% 73.41%

Figure 2-d: Precision calculations for DS1, DS2, and DS3

Classification - Class Weighting

Add a “weight” to our classification. Steps:

Choose a document and make it the query document

Retrieve the x closest documents; documents with the highest cosine similarity values

Class = max[weight1 * n1,weight2 * n2] Each class has its own separate weight

Weighted Classifier - Results

Actual

Predicted

Actual

Predicted

Actual

Predicted

Phase 1 Phase 2 WeightsPhase 1 2355 889 1Phase 2 323 433 4

(Figure 1-a: Weighted classification confusion matrix for DS1)

(Figure 2-a: Weighted classification confusion matrix for DS1)

(Figure 2-b: Weighted classification confusion matrix for DS2)

Weighted Classifier - Results

Better classifier Still a good representation from majority class Better representation from minority class

We can still improve on these results for the minority class.

“Weight - Document Size” (WS) Classifier

Problem: Minority class still underrepresented

Hypothesis: Documents in the same class will have similar

“sizes”, or total number of relevant terms. Solution:

Account for document size in results for the Weighted Classifier

Only consider documents within x total words of the query document

Steps: Choose a document and make it the query document Retrieve the x closest documents within n number of

words of the query document Class = max[weight1 * n1,weight2 * n2]

Each class, like the regular weighted classifier, has its own weight value

Actual

Predicted

Actual

Predicted

Actual

Predicted

(Figure 2-a: Weight/Size (size=3) classification confusion matrix for DS3)

Best classifier so far Good representation from both classes Best representation so far from the minority

class Can we improve this further?

Term Classifier Rather than classifying based on similar

documents, classify based on similar terms. Steps:

Analyze the terms in each document, and the class of those documents.

Choose a document and make it the query document Retrieve the x closest documents (note: we are not

accounting for document size) Class = max[weight1 * n1,weight2 * n2]

Again, each class has its own weight.

Term Classifier

Class 1 and Class 2words

Class 1 words Class 2 words

In one of our document sets, the list of exclusive wordswas less than 3% of the total words.

Term Classifier

Take the exclusive words list. If a document clusters near a “Phase1” exclusive word, classify it as “Phase1”, and vice versa

We can use this information to produce an alternate classification.

Term Classifier

Actual

Predicted

Actual

Predicted

Actual

Predicted

(Figure 5-a: Term classification confusion matrix for DS1)

(Figure 5-b: Term classification confusion matrix for DS2)

(Figure 5-c: Term classification confusion matrix for DS3)

Term Classifier

Comparable to the WS Classifier Better for DS3, probably because it is a much smaller

set Not good for Class 2 in the other sets.

The real value lies in reclassification.

Document Reclassification

The Term Classifier correctly identifies some documents missed by the WS Classifier.

Confidence Value: If a classification of the WS classifier does

not have a high confidence value, then check to see what the Term classifier says.

c = vi / vjj =0

Document Reclassification

Technique is good for checking small numbers of documents.

Technique is not good for completely reclassifying an entire set.

Classification Class Scheme

Weighted Class 1

Weighted Class 2

Confidence Predicted Actual

WS 3997 5 4 0.556 0 1 Term 3997 8 16 0.667 1 1

Figure 6 - An example of reclassification of a document from DS1

Related Work

Java GUI Front End

Developed in Spring 2007 Helps by providing a stable interface

through which to run GTP and classify documents.

Can “save state”, saves LSI model and all internal data structures for later use.

All tables used in this document were generated by this program.

Windows GTP

Direct port of GTP from UNIX to Windows Developed on Windows XP, SP2 Completely self-contained, doesn’t require

external programs or shared libraries Sorting parsed words:

Original GTP uses UNIX sort command… Windows GTP uses an external merge sort…

Acknowledgements

These people and groups assisted by providing their knowledge and experiences to the project Dr. Michael Berry Murray Browne Mary Ann Merrell Nathan Fisher The InRAD staff

References “Using Linear Algebra for Intelligent Information

Retrieval.” Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien, December 1994. Published in SIAM Review 37:4 (1995), pp. 573-595.

Understanding Search Engines: Mathematical Modeling and Text Retrieval. M. Berry and M. Browne, SIAM Book Series: Software, Environments, and Tools. (2005), ISBN 0-89871-581-4.

“GTP (General Text Parser) Software for Text Mining.” J. T. Giles, L. Wo, and M. W. Berry, Statistical Data Mining and Knowledge Discovery. H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.