Text Classification and Images

Post on 16-Jan-2016

49 views 0 download

Tags:

description

Text Classification and Images. by Carl Sable. Overview. Text Classification. Involves assigning text documents to one or more groups (classes). Techniques can be applied to image captions to classify corresponding images. - PowerPoint PPT Presentation

transcript

Text Classification and Images

by Carl Sable

Overview

• Text Classification.– Involves assigning text documents to one or more

groups (classes).

– Techniques can be applied to image captions to classify corresponding images.

• Various methods, evaluation techniques, and related issues will be discussed.

• Some discussion of other research involving image captions.

Text Classification Tasks

• Text Categorization (TC) - Assign text documents to existing, well-defined categories.

• Information Retrieval (IR) - Retrieve text documents which match user query.

• Clustering - Group text documents into clusters of similar documents.

• Text Filtering - Retrieve documents which match a user profile.

Text Categorization

• Classify each test document by assigning category labels.– M-ary categorization assumes M labels per

document.– Binary categorization requires yes/no decision for

every document/category pair.

• Most techniques require training.– Parametric vs non-parametric.– Batch vs on-line.

Early Work

• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either

Hamilton or Madison).

• Mostellar and Wallace, 1963.– Compared rate per thousand words of high

frequency words.– Collected very strong evidence in favor of

Madison.

Rocchio

• All documents and categories represented by word vectors.

• TF*IDF weights for words.– Term frequency is number of times word appears in

document or category.

– Inverse document relates to scarcity of word over entire training collection.

• Similarity computed for all document, category pairs.

Naïve Bayes

• Estimates probabilities of categories given a document.

• Uses joint probabilities of words and categories (Bayes’ rule).

• Assumes words are independent of each other.

• Can incorporate a priori probabilities of categories.

Other Common Methods

• K-Nearest Neighbor (kNN) - Use k closest training documents to predict category.

• Decision Trees (DTree)- Construct classification trees based on training data.

• Neural Networks (NNet) - Learn non-linear mapping from input words to categories.

• Expert Systems - Use manually constructed, domain-specific, application-specific rules.

Advanced Techniques

• Support Vector Machines (SVMs).– Use Structural Risk Minimization principle.– Find hypothesis which minimizes “true error”.

• Widrow-Hoff and EG - Update weight vector based on each training example.

• Maximum Entropy - Derive constraints expressing characteristics of training data.

• Boosting - Combine weak hypotheses to produce highly accurate classification rule.

Common Test Corpora

• Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories.

• TREC-AP newswire stories from 1988 to 1990, labeled with categories.

• OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned.

• UseNet newsgroups.

• WebKB - Web pages gathered from university CS departments.

Other Issues to Consider

• Which words to use (feature selection).

• Normalization.

• Use of lexical databases.– Longman Dictionary of Contemporary English

(LDOCE), WordNet, English Verb Classes and Alternations (EVCA).

– May cause problems due to lexical ambiguity.

• High cost of manual labels.

Categorizing Images

• Some previous research on content-based image categorization, very little on text-based image categorization!

• WebSEEk.– Categorizes images and videos based on key-terms

extracted from URL, alt text, hyperlinks, and directory names.

– Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.

Evaluation Metrics

• Per Category Measures:– simple accuracy or error measures

can be misleading.

– precision, recall, and fallout.

– F-measure, average precision, and break-even point (BEP) combine precision and recall.

• Macro-averaging vs Micro-averaging.

• Should choose metric ahead of time (maybe)!

Yes iscorrect

No iscorrect

AssignedYES

a b

AssignedNO

c d

p = a / (a + b)

r = a / (a + c)

f = b / (b + d)

Acc = (a + d) / n

Err = (b + c) / n

contingency table:

Some Results and Analysis

• Comparisons.– SVM and kNN, AdaBoost, WH, and EG all showed

very impressive performance.– Naïve Bayes and Rocchio tended to show relatively

poor performance.

• Rocchio possibly could have done better.– Should be using probabilistic Rocchio.– Works best if categories are mutually exclusive.– May perform at its best when only 2 categories.

Information Retrieval

• User inputs query, system should retrieve all relevant documents.

• Simple technique: keyword search.

• Other techniques use on word vectors.– TF*IDF commonly used for weights.– Can compute similarity between query vector and

document vectors.

• Evaluation - Similar to text categorization, treat relevant documents as single category.

Relevance Feedback

• After initial retrieval, user makes relevance judgements for retrieved documents.

• New round of retrieval based on feedback.• Similar to text categorization with two

categories: relevant vs non-relevant.• Rocchio algorithm originally created for this

task.• Naïve Bayes very successful.

Possible Improvements

• Lexical databases sometimes used for query expansion.

• Word sense disambiguation.– Expand query with correct senses.– Used on documents to prevent retrieval based

on false matches.

• Notion of semantic similarity.

Retrieval of Captioned Images

• Typical properties of image captions:– Shorter than documents in typical IR tasks.– Subject noun phrase usually denotes most significant

object in picture.– In news domain, first sentence generally describes

image, rest is background.

• Different types of queries.

• Many techniques from general IR not applicable.

Related Research

• Smeaton.– Automatically derived Hierarchical Concept Graphs

(HCGs) based on WordNet IS-A links.– Computed semantic similarity between nouns.– Some success improving image retrieval.

• Guglielmo and Rowe.– Used logical form records to capture meaning of

queries and captions for comparison.– System significantly beat keyword search.

Other Text Classification Tasks

• Clustering documents.– Create groups with similar attributes.– Various methods and algorithms exist.– Hierarchical vs non-hierarchical.– Each group has centroid.– Can aid in Information Retrieval.

• Text Filtering.– Filter articles of potential interest for a user.– Uses many of the same methods as TC and IR.

Processing Image Captions

• The Correspondence Problem - How to correlate visual information with words.– Visual semantics.– Symbolic representation of visual data.

• Srihari.– Piction - System that automatically identifies human

faces in captioned newspaper photos.– Integrates NLP module which parses captions with

IU module that detects objects.

Final Observations

• Previous Work.– General text categorization studied extensively.– Some research on text-based image retrieval.– Very little research involving text-based image

categorization.

• Image captions contain information unlikely to be extracted from just images.

• High potential exists for significant research involving text-based image categorization.