+ All Categories
Home > Documents > Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios...

Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios...

Date post: 06-Jan-2018
Category:
Upload: chad-skinner
View: 218 times
Download: 0 times
Share this document with a friend
Description:
Clues for Indoor versus Outdoor: Text (as opposed to visual image features) Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.
97
Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang
Transcript
Page 2: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Text Categorization

• Text categorization (TC) refers to the automatic labeling of documents, using natural language text contained in or associated with each document, into one or more pre-defined categories.

• Idea: TC techniques can be applied to image captions or articles to label the corresponding images.

Page 3: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Clues for Indoor versus Outdoor:Text (as opposed to visual image features)

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21.

The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.

Page 4: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Two Paradigms of Research

• Machine learning (ML) techniques– Common in the literature– Usually involve the exploration of new algorithms

applied to bag of words representations of documents• Novel representation

– Rare in the literature– Usually more specific, but often interesting and can

lead to substantial improvement– Important for certain tasks involving images!

Page 5: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Contributions• General:

– An in-depth exploration of the categorization of images based on associated text

– Incorporating research into Newsblaster• Novel machine learning (ML) techniques:

– The creation of two novel TC approaches– The combination of high-precision/low-recall rules

with other systems• Novel representation:

– The integration of NLP and IR– The use of low-level image features

Page 6: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Framework

• Collection of Experiments– Various tasks– Multiple techniques– No clear winner for all tasks– Characteristics of tasks often dictate which

techniques work best• “No Free Lunch”

Page 7: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 8: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Corpus

• Raw data:– Postings from news related Usenet newsgroups– Over 2000 include embedded captioned images

• Data sets:– Multiple sets of categories representing various

levels of abstraction– Mutually exclusive and exhaustive categories

Page 9: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Outdoor Indoor

Page 10: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Events Categories

Politics Struggle

Disaster Crime Other

Page 11: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Subcategories for Disaster Images

Politics Struggle

Disaster Crime Other

Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Affected People OtherWreckageWorkers Responding

Page 12: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Page 13: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Subcategories for Politics Images

Politics Struggle

Disaster Crime Other

Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Meeting OtherPoliticianPhotographed

Announcement Civilians Military

Page 14: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Politics Image Categories

Meeting

Other

CiviliansAnnouncement

MilitaryPolitician Photographed

Page 15: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Collect Labels to Train Systems

Page 16: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 17: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Two Novel ML Approaches

• Density estimation– Applied to the results of some other system– Often improves performance– Always provides probabilistic confidence measures for

predictions• BINS

– Uses binning to estimate accurate term weights for words with scarce evidence

– Extremely competitive for two data sets in my corpus

Page 18: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Density Estimation

• First apply a standard system:– For each document, compute a similarity or score for

every category.– Apply to training documents as well as test documents.

• For each test document:– Find all documents from training set with similar

category scores.– Use categories of close training documents to predict

categories of test documents.

Page 19: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Density Estimation Example

85, 35, 25, 95, 20

100, 75, 20, 30, 5

60, 95, 20, 30, 5

90, 25, 50, 110, 25

40, 30, 80, 25, 40

80, 45, 20, 75, 10

Category score vectorsfor training documents:

Category score vectorfor test document:

20.092.5

106.4

27.491.4

36.7

Predictions:Rocchio/TF*IDF: StruggleDE: Crime (Probability .679)

100, 40, 30, 90, 10

Struggle

Politics

Disaster

Crim

e

Other

Distances:

679.07.36

14.27

10.20

17.36

10.20

1

(Crime)

(Struggle)

(Disaster)

(Struggle)

(Politics)

(Crime)

Actual Categories:

Page 20: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Density Estimation Significantly Improves Performancefor the Indoor versus Outdoor Data Set

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

OverallAccuracy

Indoor F1 Outdoor F1

Density EstimationRocchio/TF*IDF

Page 21: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Density Estimation Slightly Degrades Performancefor the Events Data Set

30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Accu

racy

Strugg

le F1

Politic

s F1

Disaste

r F1

Crime F

1

Other F1

Density EstimationRocchio/TF*IDF

Page 22: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Density Estimation Sometimes Improves Performance,Always Provides Confidence Measures

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

DensityEstimation

Rocchio/TF*IDF

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

Page 23: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Confidence Range # of Images Overall Accuracy %

High (P 0.9) 285 92.6

Medium (0.9 > P 0.7) 98 75.5

Low (0.7 > P 0.5) 62 72.6

Confidence Range # of Documents Overall Accuracy %

High (P 0.9) 301 94.4

Medium (0.9 > P 0.7) 68 79.4

Low (0.7 > P 0.5) 60 53.3

Very Low (0.5 > P) 14 42.9

Results of Density Estimation Experiments for the Events Data Set:

Results of Density Estimation Experiments for the Indoor versus Outdoor Data Set:

Page 24: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

BINS System:Naïve Bayes + Smoothing

• Binning: based on smoothing in the speech recognition literature– Not enough training data to estimate term weights for

words with scarce evidence– Words with similar statistical features are grouped into

a common “bin”• Estimate a single weight for each bin

– This weight is assigned to all words in the bin– Credible estimates even for small (or zero) counts

Page 25: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Binning Uses Statistical Features of Words

Intuition Word

Indoor Category

Count

Outdoor Category

CountQuantized

IDF

Clearly Indoor

conference 14 1 4

bed 1 0 8

Clearly Outdoor

plane 0 9 5

earthquake 0 4 6

Unclearspeech 2 2 6

ceremony 3 8 5

Page 26: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

“plane”

• Sparse data– “plane” does not occur in any Indoor training

documents– Infinitely more likely to be Outdoor ???

• Assign “plane” to bins of words with similar features (e.g. IDF, category counts)

• In first half of training set, “plane” appears in:– 9 Outdoor documents – 0 Indoor documents

Page 27: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Lambdas: Weights• First half of training set: Assign words to bins• Second half of training set: Estimate term weights

binword ||

)(||

1)|( docswordDF

binbinobsP

)|(log2bin binobsP

Page 28: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Lambdas for “plane”:4.03 times more likely in an Outdoor document

310*31.5)bin |obs( IndoorP

210*13.2)bin |obs( OutdoorP

01.2bin) |P(obs

bin) |P(obslog221 OutdoorIndoor

Page 29: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Binning Credible Log Likelihood Ratios

Intuition Word

Indoor minus Outdoor

Indoor Category

Count

Outdoor Category

CountQuantized

IDF

Clearly Indoor

conference 4.84 14 1 4

bed 1.35 1 0 8

Clearly Outdoor

plane -2.01 0 9 5

earthquake -1.00 0 4 6

Unclearspeech 0.84 2 2 6

ceremony -0.50 3 8 5

Page 30: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Lambdas Decrease with IDF

Disaster lambdas

-11-10-9-8-7-6-5-4

1 2 3 4 5 6 7 8IDF

lam

bda

count=0count=1

Page 31: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Methodology of BINS

• Divide training set into two halves:– First half used to determine bins for words– Second half used to determine lambdas for bins

• For each test document:– Map every word to a bin for each category– Add lambdas, obtaining a score for each category

• Switch halves of training and repeat • Combine results and assign each document to

category with highest score

Page 32: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Binning Improves Performancefor the Indoor versus Outdoor Data Set

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

OverallAccuracy

Indoor F1 Outdoor F1

BINSNaïve Bayes

Page 33: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Binning Improves Performancefor the Events Data Set

20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Accu

racy

Strugg

le F1

Politic

s F1

Disaste

r F1

Crime F

1

Other F1

BINSNaïve Bayes

Page 34: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

BINS: Robust Version of Naïve Bayes

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

BINS

Naïve Bayes

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

baseline

humans

Page 35: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Combining Bin Weights and Naïve Bayes Weights

• Idea:– It might be better to use the Naïve Bayes weight when

there is enough evidence for a word– Back off to the bin weight otherwise

• BINS allows combinations of weights to be used based on the level of evidence

• How can we automatically determine when to use which weights???– Entropy– Minimum Squared Error (MSE)

Page 36: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Can Provide File to BINS that Specifies How to Combine Weights

0

0.5

1

0

0.25

0.5

0.75

1

Based on Entropy: Based on MSE:

Use only bin weight for evidence of 0

Average bin weight and NB weight for evidence of 1

Use only NB weight for evidence of 2 or more

Page 37: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Appropriately Combining the Bin Weight and the Naïve Bayes Weight Leads to the Best Performance Yet

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

BINS (Combo #2)

BINS (Combo #1)

BINS

Naïve Bayes

Page 38: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

BINS Performs the Best of All Systems Tested

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%Density Estimation

Rocchio/TF*IDF

BINS (Combo #2)

BINS (Combo #1)

BINS

Naïve Bayes

K-Nearest Neighbor

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

BINS BINSSVMs SVMs

Page 39: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

How Can We Improve Results?

• One idea: Label more documents!– Usually works – Boring

• Another idea: Use unlabeled documents!– Easily obtainable– But can this really work??? – Maybe it can…

Page 40: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Binning Using Unlabeled Documents

• Apply system to unlabeled documents• Choose documents with “confident” predictions

– Each word has new feature: # of unlabeled documents containing the word that are confidently predicted to belong to each category (unlabeled category counts)

– Probably less important than regular category counts– Binning provides a natural mechanism for weighting

the new feature appropriately

Page 41: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Determining Confident Predictions

• BINS computes a score for each category– BINS predicts category with highest score– Confidence for predicted category is score of that category

minus score of second place category– Confidence for non-predicted category is score of that

category minus score of chosen category• Cross validation experiments can be used to determine

a confidence cutoff for each category– Maximize F for category– Beta of 1 gives precision and recall equal weight, lower beta

weights precision higher

Page 42: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Results for Struggle Category

0

0.2

0.4

0.6

0.8

1

-600 -300 0 300

Confidence Cutoff

Valu

e of

Met

rics

Precision

Recall

F1

F(1/3)

Use F to Optimize Confidence Cutoffs (example for a single category)

Page 43: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Use F to Optimize Confidence Cutoffs (important region of graph highlighted)

Results for Struggle Category

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Confidence Cutoff

Valu

e of

Met

rics

Precision

Recall

F1

F(1/3)

Page 44: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Should the New Feature Matter?

zero count lambdas (IDF=8, beta=0.5)

-12

-11

-10

-9

-8

-7

-6

-5

0 1 2 3 4 5 6 7 8 9

unlabeled category count

lambd

a

DisStrPolCri

zero count lambdas (category=Disaster, beta=1.0)

-12

-11

-10

-9

-8

-7

-6

-5

0 1 2 3 4 5 6 7 8 9 10

unlabeled category count

lambd

a

IDF=5

IDF=6

IDF=7

IDF=8

Page 45: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Does the New Feature Help?

• No• Why???

– New features add info but make bins smaller– Perhaps more data isn’t needed in the first place

• Should more data matter?– Hard to accumulate more labeled data– Easy to try out less labeled data!

Page 46: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Does Size Matter?

Effect of Training Data

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percentage Used

Pefo

rman

ce

IN/OUT

EVENTS

Page 47: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 48: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Page 49: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Performance of Standard SystemsNot Very Satisfying

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 50: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Ambiguity for Disaster Images:Workers Responding vs. Affected People

Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.

Hypothetical alternative caption: A fire victim who perished in a blaze at a Manila disco is carried by Philippine rescuers March 19.

Workers Responding Affected People

Page 51: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Summary of Observations About Task

• Need to distinguish foreground from background, determine focus of image

• Not all words are important; some are misleading• Hypothesis: the main subject and verb are

particularly useful for this task– Problematic for bag of words approaches– Need linguistic analysis to determine predicate

argument relationships

Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.

Page 52: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Hypothesis: Subject and Verbare Useful Clues

Subject Verb Category Guessable?

Truck makes Wreckage No

couple mourn Affected People Yesblocks suffered Wreckage YesNAME gather Affected People No

child sleeps Affected People Yesinspectors search Workers Responding Yes

NAME observes Workers Responding No

workers confer Workers Responding Yes

child covers Affected People Yeschimney stands Wreckage Yes

Page 53: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Experiments with Humans Subjects: 4 Conditions

Test Hypothesis: Subject and Verb are Useful Clues

SENT: First sentence of caption

Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.

RAND: All words from first sentence in random order

At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire

IDF: Top two TF*IDF words

disco rescuers

S-V: Subject and verb subject = “rescuers”, verb = “carry”

Page 54: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

• More words are better than fewer words– SENT, RAND > S-V, IDF

• Syntax is important– SENT > RAND; S-V > IDF

Experiments with Humans Subjects: ResultsHypothesis: Subject and Verb are Useful Clues

50.0%

55.0%60.0%

65.0%70.0%

75.0%80.0%

85.0%90.0%

95.0%

SENTRANDIDFS-V Condition Average Time

(in seconds)RAND 68SENT 34IDF 22S-V 20

Page 55: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

RAND is Very Slow!

• Perhaps human subjects unscrambled words, regaining syntactic information

Condition Average Time (in seconds)

RAND 68SENT 34IDF 22S-V 20

Page 56: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Using Just Two Words (S-V)Almost as Good as All the Words (Bag of Words)

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%

SENT S-V

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 57: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Operational NLP Based System

• For each test document:– Extract subject and verb– Compare to those from training set using some method

of word-to-word similarity– Based on similarities, generate a score for every

category

Sentence POS taggerCASS shallow parser

Perl script WordNet Output

Subjects 83.9%

Verbs 80.6%

• Extract subjects and verbs from all documents in training set

Page 58: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Word Similarity• Examine large “extended corpus” to generate many

subject/verb pairs• Use to compute similarities:

total verbs#commonin verbs#Sim ,

21 SubSub

totalsubjects #commonin subjects #Sim ,

21 VerbVerb

totalsappearance # togethersappearance # * 2Sim , VerbSub

Page 59: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Choosing a Category

• For given test document d, calculate total score for every category c:

• Choose category with highest score• If subject is NAME, a bit more complicated

cc cdcd

cc cdcd

Vv vvvs

Ss svssdc

,,

,,

SimSim

SimSim|Score

Page 60: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The NLP Based System Beats All Others by a Considerable Margin

52.0%

54.0%

56.0%

58.0%

60.0%

62.0%

64.0%

66.0%

68.0%

SENT S-V

NLP Based System

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 61: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Politics Image Categories

Meeting

Other

CiviliansAnnouncement

MilitaryPolitician Photographed

Page 62: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The NLP Based System is in the Middle of the Pack for the Politics Image Data Set

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

SENT S-V

NLP Based System

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 63: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Why is the Performance for the NLP Based System not as Strong for the Politics Image Data Set?

• A much wider range of performance scores– Range for Politics images is 36% to 64.7%– Range for Disaster images is 54% to 59.7%– The top systems are harder to beat

• Too many proper names as subjects– 60% of test instances for Politics images– Only 13% of test instances for Disaster images– For 60% of test documents, only one word (the main

verb) is being used to determine the prediction

Page 64: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 65: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The Original Premise

• For the Disaster image data set, the performance of the NLP based system still leaves room for improvement– NLP based system achieves 65% overall accuracy for

the Disaster image data set– Humans viewing all words in random order achieve

about 75%– Humans viewing full first sentence achieve over 90%

• Main subject and verb are particularly important, but sometimes other words might offer good clues

Page 66: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Higinio Guereca carries family photos he retrieved from his mobile home which was destroyed as a tornado moved through the Central Florida community, early December 27.

Page 67: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Choosing Indicative Words

• Let x be the number of training documents containing a word w

• Let p be the proportion of these documents that belong to category c

• If x > X and p > P then w is indicative of c• X and P can be varied to generate lists of

indicative words• Lists can be pruned manually

Page 68: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Selected Indicative Words for the Disaster Image Data Set

Word Indicated Category Total Count (x) Proportion (p)her Affected People 7 1.0his Affected People 7 0.86family Affected People 6 0.83relatives Affected People 6 1.0rescue Workers Responding 15 1.0search Workers Responding 9 1.0similar Other 2 1.0soldiers Workers Responding 6 1.0workers Workers Responding 12 1.0

Page 69: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Selected Indicative Words for the Politics Image Data Set

Word Indicated Category Total Count (x) Proportion (p)hands Meeting 10 0.90journalists Announcement 4 1.0local Civilians 4 1.0media Announcement 3 1.0presidential Politician Photographed 9 0.78press Announcement 7 0.71reporters Announcement 8 0.88meeting Meeting 15 0.73session Meeting 6 0.83victory Politician Photographed 6 0.83waves Politician Photographed 4 1.0wife Politician Photographed 6 1.0

Page 70: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

High-Precision/Low-Recall Rules

• If a word w that indicates category c occurs in a document d, then assign d to c

• Every selected indicative word has an associated “rule” of the above form– Each rule is very accurate but rarely applicable– If only rules are used:

• most predictions will be correct (hence, high precision)• most instances of most categories will remain unlabeled

(hence, low recall)

Page 71: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Combining the High-Precision/Low-Recall Rules with Other Systems

• Two-pass approach:– Conduct a first-pass using the indicative words and

the high-precision/low-recall rules– For documents that are still unlabeled, fall back to

some other system• Compared to the fall back system:

– If the rules are more accurate for the documents to which they apply, overall accuracy will improve!

– Intended to improve the NLP based system, but easy to test with other systems as well

Page 72: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The Rules Improve Every Fall Back System for the Disaster Image Data Set

52.0%

54.0%

56.0%

58.0%

60.0%

62.0%

64.0%

66.0%

68.0%

Without Rules With Rules

NLP Based System

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 73: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The Rules Improve 7 of 8 Fall Back Systems for the Politics Image Data Set

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

Without Rules With Rules

NLP Based System

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 74: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 75: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Low-Level Image Features

• Collaboration with Paek and Benitez– They have provided me with information,

pointers to resources, and code– I have reimplemented some of their code

• Color histograms– Based on entire images or image regions– Can be used as input to machine learning

approaches (e.g. kNN, SVMs)

Page 76: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Color• Three components to color

– Red, green, blue (RGB)– Hue, saturation, value (HSV)

• Can convert from RGB to HSV– Can quantize HSV triples– 18 hues * 3 saturations * 3 values + 4 grays = 166 slots

Page 77: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Color Histograms

• For each pixel of image, compute its quantized HSV triple

• Color histogram of image is vector such that:– There are 166 dimensions– Each dimension represents one possible HSV triple– Value of dimension is proportion of pixels with

associated HSV triple• Can be computed for image regions and

concatenated together• Can be input for machine learning techniques

Page 78: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Images Divided into 8 x 8 Rectangular Regions of Equal Size

Page 79: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Using Color Histograms to Predict Labels for the Indoor versus Outdoor Data Set

70.0%

72.0%

74.0%

76.0%

78.0%

80.0%

whole images image regions

K-Nearest Neighbor

Support VectorMachines

Page 80: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Combining Text and Image Features

• Combining systems has had mixed results in the TC literature, but:– Most attempts have involved systems that use the same

features (bag of words)– There is little reason to believe that indicative text is

correlated with indicative low-level image features• Most text based systems are beating the image

based systems, but:– Distance from optimal hyperplane can be used as a

confidence measure for support vector machine– Predictions with high confidence may be more accurate

than text systems

Page 81: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Accuracy of Support Vector Machine Approach Tends to be Higher when Confidence is Greater

Distance Cutoff Overall Accuracy % % of Images Above Cutoff3.5 --- 0.03.0 100.0 0.42.5 87.5 1.82.0 92.3 5.81.5 94.4 16.01.0 91.0 34.10.5 84.6 70.10.0 78.0 100.0

Page 82: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

The Combination of Text and Image Beats Text Alone:Most systems show small gains, one has major improvement

77.0%

79.0%

81.0%

83.0%

85.0%

87.0%

89.0%

Text Only Text and Image

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]

Page 83: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 84: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 85: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Newsblaster Categories

Entertainment Science/Technology Sports

U.S. News World News Finance

Page 86: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 87: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 88: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 89: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 90: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 91: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 92: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.
Page 93: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Newsblaster

• A pragmatic showcase for NLP• My contributions:

– Extraction of images and captions from web pages

– Image browsing interface– Categorization of stories (clusters) and images– Scripts that allow users to suggest labels for

articles with incorrect predictions

Page 94: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Page 95: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Conclusions

• TC techniques can be used to categorize images• Many methods exist

– No clear winner for all tasks– BINS is very competitive– NLP can lead to substantial improvement, at least for

certain tasks– High-precision/low-recall rules are likely to improve

performance for tough tasks– Image features show promise

• Newsblaster demonstrates pragmatic benefits of my work

Page 96: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

Future Work

• BINS– Explore additional binning features– Explore use of unlabeled data

• NLP and TC– Improve current system– Explore additional categories

• Image features– Explore additional low-level image features– Explore better methods of combining text and image

Page 97: Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu.

And Now the Questions…


Recommended