Text Classification and Images

transcript

by Carl Sable

Overview

• Text Classification.– Involves assigning text documents to one or more

groups (classes).

– Techniques can be applied to image captions to classify corresponding images.

• Various methods, evaluation techniques, and related issues will be discussed.

• Some discussion of other research involving image captions.

Text Classification Tasks

• Text Categorization (TC) - Assign text documents to existing, well-defined categories.

• Information Retrieval (IR) - Retrieve text documents which match user query.

• Clustering - Group text documents into clusters of similar documents.

• Text Filtering - Retrieve documents which match a user profile.

Text Categorization

• Classify each test document by assigning category labels.– M-ary categorization assumes M labels per

document.– Binary categorization requires yes/no decision for

every document/category pair.

• Most techniques require training.– Parametric vs non-parametric.– Batch vs on-line.

Early Work

• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either

Hamilton or Madison).

• Mostellar and Wallace, 1963.– Compared rate per thousand words of high

frequency words.– Collected very strong evidence in favor of

Madison.

Rocchio

• All documents and categories represented by word vectors.

• TF*IDF weights for words.– Term frequency is number of times word appears in

document or category.

– Inverse document relates to scarcity of word over entire training collection.

• Similarity computed for all document, category pairs.

Naïve Bayes

• Estimates probabilities of categories given a document.

• Uses joint probabilities of words and categories (Bayes’ rule).

• Assumes words are independent of each other.

• Can incorporate a priori probabilities of categories.

Other Common Methods

• K-Nearest Neighbor (kNN) - Use k closest training documents to predict category.

• Decision Trees (DTree)- Construct classification trees based on training data.

• Neural Networks (NNet) - Learn non-linear mapping from input words to categories.

• Expert Systems - Use manually constructed, domain-specific, application-specific rules.

Advanced Techniques

• Support Vector Machines (SVMs).– Use Structural Risk Minimization principle.– Find hypothesis which minimizes “true error”.

• Widrow-Hoff and EG - Update weight vector based on each training example.

• Maximum Entropy - Derive constraints expressing characteristics of training data.

• Boosting - Combine weak hypotheses to produce highly accurate classification rule.

Common Test Corpora

• Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories.

• TREC-AP newswire stories from 1988 to 1990, labeled with categories.

• OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned.

• UseNet newsgroups.

• WebKB - Web pages gathered from university CS departments.

Other Issues to Consider

• Which words to use (feature selection).

• Normalization.

• Use of lexical databases.– Longman Dictionary of Contemporary English

(LDOCE), WordNet, English Verb Classes and Alternations (EVCA).

– May cause problems due to lexical ambiguity.

• High cost of manual labels.

Categorizing Images

• Some previous research on content-based image categorization, very little on text-based image categorization!

• WebSEEk.– Categorizes images and videos based on key-terms

extracted from URL, alt text, hyperlinks, and directory names.

– Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.

Evaluation Metrics

• Per Category Measures:– simple accuracy or error measures

can be misleading.

– precision, recall, and fallout.

– F-measure, average precision, and break-even point (BEP) combine precision and recall.

• Macro-averaging vs Micro-averaging.

• Should choose metric ahead of time (maybe)!

Yes iscorrect

No iscorrect

AssignedYES

AssignedNO

p = a / (a + b)

r = a / (a + c)

f = b / (b + d)

Acc = (a + d) / n

Err = (b + c) / n

contingency table:

Some Results and Analysis

• Comparisons.– SVM and kNN, AdaBoost, WH, and EG all showed

very impressive performance.– Naïve Bayes and Rocchio tended to show relatively

poor performance.

• Rocchio possibly could have done better.– Should be using probabilistic Rocchio.– Works best if categories are mutually exclusive.– May perform at its best when only 2 categories.

Information Retrieval

• User inputs query, system should retrieve all relevant documents.

• Simple technique: keyword search.

• Other techniques use on word vectors.– TF*IDF commonly used for weights.– Can compute similarity between query vector and

document vectors.

• Evaluation - Similar to text categorization, treat relevant documents as single category.

Relevance Feedback

• After initial retrieval, user makes relevance judgements for retrieved documents.

• New round of retrieval based on feedback.• Similar to text categorization with two

categories: relevant vs non-relevant.• Rocchio algorithm originally created for this

task.• Naïve Bayes very successful.

Possible Improvements

• Lexical databases sometimes used for query expansion.

• Word sense disambiguation.– Expand query with correct senses.– Used on documents to prevent retrieval based

on false matches.

• Notion of semantic similarity.

Retrieval of Captioned Images

• Typical properties of image captions:– Shorter than documents in typical IR tasks.– Subject noun phrase usually denotes most significant

object in picture.– In news domain, first sentence generally describes

image, rest is background.

• Different types of queries.

• Many techniques from general IR not applicable.

Related Research

• Smeaton.– Automatically derived Hierarchical Concept Graphs

(HCGs) based on WordNet IS-A links.– Computed semantic similarity between nouns.– Some success improving image retrieval.

• Guglielmo and Rowe.– Used logical form records to capture meaning of

queries and captions for comparison.– System significantly beat keyword search.

Other Text Classification Tasks

• Clustering documents.– Create groups with similar attributes.– Various methods and algorithms exist.– Hierarchical vs non-hierarchical.– Each group has centroid.– Can aid in Information Retrieval.

• Text Filtering.– Filter articles of potential interest for a user.– Uses many of the same methods as TC and IR.

Processing Image Captions

• The Correspondence Problem - How to correlate visual information with words.– Visual semantics.– Symbolic representation of visual data.

• Srihari.– Piction - System that automatically identifies human

faces in captioned newspaper photos.– Integrates NLP module which parses captions with

IU module that detects objects.

Final Observations

• Previous Work.– General text categorization studied extensively.– Some research on text-based image retrieval.– Very little research involving text-based image

categorization.

• Image captions contain information unlikely to be extracted from just images.

• High potential exists for significant research involving text-based image categorization.

Text Classification and Images

Documents