Text Categorization and Images Thesis Defense for Carl Sable Committee: Kathleen McKeown, Vasileios...

Text Categorization and ImagesThesis Defense for Carl Sable

Committee: Kathleen McKeown, Vasileios Hatzivassiloglou, Shree Nayar, Kenneth W. Church, Shih-Fu Chang

http://www.cs.columbia.edu/~sable/research/photos/cnt15187.jpg











Text Categorization

• Text categorization (TC) refers to the automatic labeling of documents, using natural language text contained in or associated with each document, into one or more pre-defined categories.

• Idea: TC techniques can be applied to image captions or articles to label the corresponding images.

Clues for Indoor versus Outdoor:Text (as opposed to visual image features)

Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21.

The two engines of an Amtrak passenger train lie in the mud at the edge a marsh after the train, bound for Boston from Washington, derailed on the bank of the Hackensack River, just after crossing a bridge.

Two Paradigms of Research

• Machine learning (ML) techniques– Common in the literature– Usually involve the exploration of new algorithms

applied to bag of words representations of documents• Novel representation

– Rare in the literature– Usually more specific, but often interesting and can

lead to substantial improvement– Important for certain tasks involving images!

Contributions• General:

– An in-depth exploration of the categorization of images based on associated text

– Incorporating research into Newsblaster• Novel machine learning (ML) techniques:

– The creation of two novel TC approaches– The combination of high-precision/low-recall rules

with other systems• Novel representation:

– The integration of NLP and IR– The use of low-level image features

Framework

• Collection of Experiments– Various tasks– Multiple techniques– No clear winner for all tasks– Characteristics of tasks often dictate which

techniques work best• “No Free Lunch”

Overview

I. The Main IdeaII. Description of CorpusIII. Novel ML SystemsIV. NLP Based SystemV. High-Precision/Low-Recall RulesVI. Image FeaturesVII. NewsblasterVIII. Conclusions and Future Work

Corpus

• Raw data:– Postings from news related Usenet newsgroups– Over 2000 include embedded captioned images

• Data sets:– Multiple sets of categories representing various

levels of abstraction– Mutually exclusive and exhaustive categories

Outdoor Indoor

Events Categories

Politics Struggle

Disaster Crime Other

Subcategories for Disaster Images

Politics Struggle


Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Affected People OtherWreckageWorkers Responding

Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Subcategories for Politics Images

Politics Struggle


Category F1

Politics 89%Struggle 88%Disaster 97%Crime 90%Other 59%

Meeting OtherPoliticianPhotographed

Announcement Civilians Military

Politics Image Categories

Meeting

Other

CiviliansAnnouncement

MilitaryPolitician Photographed

Collect Labels to Train Systems

Overview


Two Novel ML Approaches

• Density estimation– Applied to the results of some other system– Often improves performance– Always provides probabilistic confidence measures for

predictions• BINS

– Uses binning to estimate accurate term weights for words with scarce evidence

– Extremely competitive for two data sets in my corpus

Density Estimation

• First apply a standard system:– For each document, compute a similarity or score for

every category.– Apply to training documents as well as test documents.

• For each test document:– Find all documents from training set with similar

category scores.– Use categories of close training documents to predict

categories of test documents.

Density Estimation Example

85, 35, 25, 95, 20

100, 75, 20, 30, 5

60, 95, 20, 30, 5

90, 25, 50, 110, 25

40, 30, 80, 25, 40

80, 45, 20, 75, 10

Category score vectorsfor training documents:

Category score vectorfor test document:

20.092.5

106.4

27.491.4

36.7

Predictions:Rocchio/TF*IDF: StruggleDE: Crime (Probability .679)

100, 40, 30, 90, 10

Struggle

Politics

Disaster

Crim

e

Other

Distances:

679.07.36

14.27

10.20

17.36

10.20

1

(Crime)

(Struggle)

(Disaster)

(Struggle)

(Politics)

(Crime)

Actual Categories:

Density Estimation Significantly Improves Performancefor the Indoor versus Outdoor Data Set

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

OverallAccuracy

Indoor F1 Outdoor F1

Density EstimationRocchio/TF*IDF

Density Estimation Slightly Degrades Performancefor the Events Data Set

30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Accu

racy

Strugg

le F1

Politic

s F1

Disaste

r F1

Crime F

1

Other F1

Density EstimationRocchio/TF*IDF

Density Estimation Sometimes Improves Performance,Always Provides Confidence Measures

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

DensityEstimation

Rocchio/TF*IDF

Indoor versus Outdoor Events: Politics, Struggle, Disaster, Crime, Other

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

Confidence Range # of Images Overall Accuracy %

High (P 0.9) 285 92.6

Medium (0.9 > P 0.7) 98 75.5

Low (0.7 > P 0.5) 62 72.6

Confidence Range # of Documents Overall Accuracy %

High (P 0.9) 301 94.4

Medium (0.9 > P 0.7) 68 79.4

Low (0.7 > P 0.5) 60 53.3

Very Low (0.5 > P) 14 42.9

Results of Density Estimation Experiments for the Events Data Set:

Results of Density Estimation Experiments for the Indoor versus Outdoor Data Set:

BINS System:Naïve Bayes + Smoothing

• Binning: based on smoothing in the speech recognition literature– Not enough training data to estimate term weights for

words with scarce evidence– Words with similar statistical features are grouped into

a common “bin”• Estimate a single weight for each bin

– This weight is assigned to all words in the bin– Credible estimates even for small (or zero) counts

Binning Uses Statistical Features of Words

Intuition Word

Indoor Category

Count

Outdoor Category

CountQuantized

IDF

Clearly Indoor

conference 14 1 4

bed 1 0 8

Clearly Outdoor

plane 0 9 5

earthquake 0 4 6

Unclearspeech 2 2 6

ceremony 3 8 5

“plane”

• Sparse data– “plane” does not occur in any Indoor training

documents– Infinitely more likely to be Outdoor ???

• Assign “plane” to bins of words with similar features (e.g. IDF, category counts)

• In first half of training set, “plane” appears in:– 9 Outdoor documents – 0 Indoor documents

Lambdas: Weights• First half of training set: Assign words to bins• Second half of training set: Estimate term weights

binword ||

)(||

1)|( docswordDF

binbinobsP

)|(log2bin binobsP

Lambdas for “plane”:4.03 times more likely in an Outdoor document

310*31.5)bin |obs( IndoorP

210*13.2)bin |obs( OutdoorP

01.2bin) |P(obs

bin) |P(obslog221 OutdoorIndoor

Binning Credible Log Likelihood Ratios

Intuition Word

Indoor minus Outdoor

Indoor Category

Count

Outdoor Category

CountQuantized

IDF

Clearly Indoor

conference 4.84 14 1 4

bed 1.35 1 0 8

Clearly Outdoor

plane -2.01 0 9 5

earthquake -1.00 0 4 6

Unclearspeech 0.84 2 2 6

ceremony -0.50 3 8 5

Lambdas Decrease with IDF

Disaster lambdas

-11-10-9-8-7-6-5-4

1 2 3 4 5 6 7 8IDF

lam

bda

count=0count=1

Methodology of BINS

• Divide training set into two halves:– First half used to determine bins for words– Second half used to determine lambdas for bins

• For each test document:– Map every word to a bin for each category– Add lambdas, obtaining a score for each category

• Switch halves of training and repeat • Combine results and assign each document to

category with highest score

Binning Improves Performancefor the Indoor versus Outdoor Data Set

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

OverallAccuracy

Indoor F1 Outdoor F1

BINSNaïve Bayes

Binning Improves Performancefor the Events Data Set

20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%

100.0%

Overall

Accu

racy

Strugg

le F1

Politic

s F1

Disaste

r F1

Crime F

1

Other F1

BINSNaïve Bayes

BINS: Robust Version of Naïve Bayes

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

BINS

Naïve Bayes


baseline

humans

Combining Bin Weights and Naïve Bayes Weights

• Idea:– It might be better to use the Naïve Bayes weight when

there is enough evidence for a word– Back off to the bin weight otherwise

• BINS allows combinations of weights to be used based on the level of evidence

• How can we automatically determine when to use which weights???– Entropy– Minimum Squared Error (MSE)

Can Provide File to BINS that Specifies How to Combine Weights

0

0.5

1

0

0.25

0.5

0.75

1

Based on Entropy: Based on MSE:

Use only bin weight for evidence of 0

Average bin weight and NB weight for evidence of 1

Use only NB weight for evidence of 2 or more

Appropriately Combining the Bin Weight and the Naïve Bayes Weight Leads to the Best Performance Yet


70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

BINS (Combo #2)

BINS (Combo #1)

BINS

Naïve Bayes

BINS Performs the Best of All Systems Tested

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%Density Estimation

Rocchio/TF*IDF

BINS (Combo #2)

BINS (Combo #1)

BINS

Naïve Bayes

K-Nearest Neighbor

Naïve Bayes [R]

Rocchio/TF*IDF [R]

K-Nearest Neighbor [R]

Probabilistic Indexing [R]

Support Vector Machines [R]

Maximum Entropy [R]


BINS BINSSVMs SVMs

How Can We Improve Results?

• One idea: Label more documents!– Usually works – Boring

• Another idea: Use unlabeled documents!– Easily obtainable– But can this really work??? – Maybe it can…

Binning Using Unlabeled Documents

• Apply system to unlabeled documents• Choose documents with “confident” predictions

– Each word has new feature: # of unlabeled documents containing the word that are confidently predicted to belong to each category (unlabeled category counts)

– Probably less important than regular category counts– Binning provides a natural mechanism for weighting

the new feature appropriately

Determining Confident Predictions

• BINS computes a score for each category– BINS predicts category with highest score– Confidence for predicted category is score of that category

minus score of second place category– Confidence for non-predicted category is score of that

category minus score of chosen category• Cross validation experiments can be used to determine

a confidence cutoff for each category– Maximize F for category– Beta of 1 gives precision and recall equal weight, lower beta

weights precision higher

Results for Struggle Category

0

0.2

0.4

0.6

0.8

1

-600 -300 0 300

Confidence Cutoff

Valu

e of

Met

rics

Precision

Recall

F1

F(1/3)

Use F to Optimize Confidence Cutoffs (example for a single category)

Use F to Optimize Confidence Cutoffs (important region of graph highlighted)

Results for Struggle Category

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Confidence Cutoff

Valu

e of

Met

rics

Precision

Recall

F1

F(1/3)

Should the New Feature Matter?

zero count lambdas (IDF=8, beta=0.5)

-12

-11

-10

-9

-8

-7

-6

-5

0 1 2 3 4 5 6 7 8 9

unlabeled category count

lambd

a

DisStrPolCri

zero count lambdas (category=Disaster, beta=1.0)

-12

-11

-10

-9

-8

-7

-6

-5

0 1 2 3 4 5 6 7 8 9 10

unlabeled category count

lambd

a

IDF=5

IDF=6

IDF=7

IDF=8

Does the New Feature Help?

• No• Why???

– New features add info but make bins smaller– Perhaps more data isn’t needed in the first place

• Should more data matter?– Hard to accumulate more labeled data– Easy to try out less labeled data!

Does Size Matter?

Effect of Training Data

55.0%

60.0%

65.0%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percentage Used

Pefo

rman

ce

IN/OUT

EVENTS

Overview


Disaster Image Categories

Affected People

OtherWreckage

Workers Responding

Performance of Standard SystemsNot Very Satisfying

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Ambiguity for Disaster Images:Workers Responding vs. Affected People

Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.

Hypothetical alternative caption: A fire victim who perished in a blaze at a Manila disco is carried by Philippine rescuers March 19.

Workers Responding Affected People

Summary of Observations About Task

• Need to distinguish foreground from background, determine focus of image

• Not all words are important; some are misleading• Hypothesis: the main subject and verb are

particularly useful for this task– Problematic for bag of words approaches– Need linguistic analysis to determine predicate

argument relationships


Hypothesis: Subject and Verbare Useful Clues

Subject Verb Category Guessable?

Truck makes Wreckage No

couple mourn Affected People Yesblocks suffered Wreckage YesNAME gather Affected People No

child sleeps Affected People Yesinspectors search Workers Responding Yes

NAME observes Workers Responding No

workers confer Workers Responding Yes

child covers Affected People Yeschimney stands Wreckage Yes

Experiments with Humans Subjects: 4 Conditions

Test Hypothesis: Subject and Verb are Useful Clues

SENT: First sentence of caption


RAND: All words from first sentence in random order

At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire

IDF: Top two TF*IDF words

disco rescuers

S-V: Subject and verb subject = “rescuers”, verb = “carry”

• More words are better than fewer words– SENT, RAND > S-V, IDF

• Syntax is important– SENT > RAND; S-V > IDF

Experiments with Humans Subjects: ResultsHypothesis: Subject and Verb are Useful Clues

50.0%

55.0%60.0%

65.0%70.0%

75.0%80.0%

85.0%90.0%

95.0%

SENTRANDIDFS-V Condition Average Time

(in seconds)RAND 68SENT 34IDF 22S-V 20

RAND is Very Slow!

• Perhaps human subjects unscrambled words, regaining syntactic information

Condition Average Time (in seconds)

RAND 68SENT 34IDF 22S-V 20

Using Just Two Words (S-V)Almost as Good as All the Words (Bag of Words)

52.00%

54.00%

56.00%

58.00%

60.00%

62.00%

64.00%

66.00%

SENT S-V

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Operational NLP Based System

• For each test document:– Extract subject and verb– Compare to those from training set using some method

of word-to-word similarity– Based on similarities, generate a score for every

category

Sentence POS taggerCASS shallow parser

Perl script WordNet Output

Subjects 83.9%

Verbs 80.6%

• Extract subjects and verbs from all documents in training set

Word Similarity• Examine large “extended corpus” to generate many

subject/verb pairs• Use to compute similarities:

total verbs#commonin verbs#Sim ,

21 SubSub

totalsubjects #commonin subjects #Sim ,

21 VerbVerb

totalsappearance # togethersappearance # * 2Sim , VerbSub

Choosing a Category

• For given test document d, calculate total score for every category c:

• Choose category with highest score• If subject is NAME, a bit more complicated

cc cdcd

cc cdcd

Vv vvvs

Ss svssdc

,,

,,

SimSim

SimSim|Score

The NLP Based System Beats All Others by a Considerable Margin

52.0%

54.0%

56.0%

58.0%

60.0%

62.0%

64.0%

66.0%

68.0%

SENT S-V

NLP Based System

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Politics Image Categories

Meeting

Other

CiviliansAnnouncement

MilitaryPolitician Photographed

The NLP Based System is in the Middle of the Pack for the Politics Image Data Set

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

SENT S-V

NLP Based System

Density Estimation

Rocchio/TF*IDF

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Why is the Performance for the NLP Based System not as Strong for the Politics Image Data Set?

• A much wider range of performance scores– Range for Politics images is 36% to 64.7%– Range for Disaster images is 54% to 59.7%– The top systems are harder to beat

• Too many proper names as subjects– 60% of test instances for Politics images– Only 13% of test instances for Disaster images– For 60% of test documents, only one word (the main

verb) is being used to determine the prediction

Overview


The Original Premise

• For the Disaster image data set, the performance of the NLP based system still leaves room for improvement– NLP based system achieves 65% overall accuracy for

the Disaster image data set– Humans viewing all words in random order achieve

about 75%– Humans viewing full first sentence achieve over 90%

• Main subject and verb are particularly important, but sometimes other words might offer good clues

Higinio Guereca carries family photos he retrieved from his mobile home which was destroyed as a tornado moved through the Central Florida community, early December 27.

Choosing Indicative Words

• Let x be the number of training documents containing a word w

• Let p be the proportion of these documents that belong to category c

• If x > X and p > P then w is indicative of c• X and P can be varied to generate lists of

indicative words• Lists can be pruned manually

Selected Indicative Words for the Disaster Image Data Set

Word Indicated Category Total Count (x) Proportion (p)her Affected People 7 1.0his Affected People 7 0.86family Affected People 6 0.83relatives Affected People 6 1.0rescue Workers Responding 15 1.0search Workers Responding 9 1.0similar Other 2 1.0soldiers Workers Responding 6 1.0workers Workers Responding 12 1.0

Selected Indicative Words for the Politics Image Data Set

Word Indicated Category Total Count (x) Proportion (p)hands Meeting 10 0.90journalists Announcement 4 1.0local Civilians 4 1.0media Announcement 3 1.0presidential Politician Photographed 9 0.78press Announcement 7 0.71reporters Announcement 8 0.88meeting Meeting 15 0.73session Meeting 6 0.83victory Politician Photographed 6 0.83waves Politician Photographed 4 1.0wife Politician Photographed 6 1.0

High-Precision/Low-Recall Rules

• If a word w that indicates category c occurs in a document d, then assign d to c

• Every selected indicative word has an associated “rule” of the above form– Each rule is very accurate but rarely applicable– If only rules are used:

• most predictions will be correct (hence, high precision)• most instances of most categories will remain unlabeled

(hence, low recall)

Combining the High-Precision/Low-Recall Rules with Other Systems

• Two-pass approach:– Conduct a first-pass using the indicative words and

the high-precision/low-recall rules– For documents that are still unlabeled, fall back to

some other system• Compared to the fall back system:

– If the rules are more accurate for the documents to which they apply, overall accuracy will improve!

– Intended to improve the NLP based system, but easy to test with other systems as well

The Rules Improve Every Fall Back System for the Disaster Image Data Set

52.0%

54.0%

56.0%

58.0%

60.0%

62.0%

64.0%

66.0%

68.0%

Without Rules With Rules

NLP Based System

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

The Rules Improve 7 of 8 Fall Back Systems for the Politics Image Data Set

35.0%

40.0%

45.0%

50.0%

55.0%

60.0%

65.0%

Without Rules With Rules

NLP Based System

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Overview


Low-Level Image Features

• Collaboration with Paek and Benitez– They have provided me with information,

pointers to resources, and code– I have reimplemented some of their code

• Color histograms– Based on entire images or image regions– Can be used as input to machine learning

approaches (e.g. kNN, SVMs)

Color• Three components to color

– Red, green, blue (RGB)– Hue, saturation, value (HSV)

• Can convert from RGB to HSV– Can quantize HSV triples– 18 hues * 3 saturations * 3 values + 4 grays = 166 slots

Color Histograms

• For each pixel of image, compute its quantized HSV triple

• Color histogram of image is vector such that:– There are 166 dimensions– Each dimension represents one possible HSV triple– Value of dimension is proportion of pixels with

associated HSV triple• Can be computed for image regions and

concatenated together• Can be input for machine learning techniques

Images Divided into 8 x 8 Rectangular Regions of Equal Size

Using Color Histograms to Predict Labels for the Indoor versus Outdoor Data Set

70.0%

72.0%

74.0%

76.0%

78.0%

80.0%

whole images image regions

K-Nearest Neighbor

Support VectorMachines

Combining Text and Image Features

• Combining systems has had mixed results in the TC literature, but:– Most attempts have involved systems that use the same

features (bag of words)– There is little reason to believe that indicative text is

correlated with indicative low-level image features• Most text based systems are beating the image

based systems, but:– Distance from optimal hyperplane can be used as a

confidence measure for support vector machine– Predictions with high confidence may be more accurate

than text systems

Accuracy of Support Vector Machine Approach Tends to be Higher when Confidence is Greater

Distance Cutoff Overall Accuracy % % of Images Above Cutoff3.5 --- 0.03.0 100.0 0.42.5 87.5 1.82.0 92.3 5.81.5 94.4 16.01.0 91.0 34.10.5 84.6 70.10.0 78.0 100.0

The Combination of Text and Image Beats Text Alone:Most systems show small gains, one has major improvement

77.0%

79.0%

81.0%

83.0%

85.0%

87.0%

89.0%

Text Only Text and Image

BINS

Naïve Bayes [R]

Rocchio/TF*IDF [R]




Maximum Entropy [R]

Overview


Newsblaster Categories

Entertainment Science/Technology Sports

U.S. News World News Finance

Newsblaster

• A pragmatic showcase for NLP• My contributions:

– Extraction of images and captions from web pages

– Image browsing interface– Categorization of stories (clusters) and images– Scripts that allow users to suggest labels for

articles with incorrect predictions

Overview


Conclusions

• TC techniques can be used to categorize images• Many methods exist

– No clear winner for all tasks– BINS is very competitive– NLP can lead to substantial improvement, at least for

certain tasks– High-precision/low-recall rules are likely to improve

performance for tough tasks– Image features show promise

• Newsblaster demonstrates pragmatic benefits of my work

Future Work

• BINS– Explore additional binning features– Explore use of unlabeled data

• NLP and TC– Improve current system– Explore additional categories

• Image features– Explore additional low-level image features– Explore better methods of combining text and image

And Now the Questions…

Date post:	06-Jan-2018
Category:	Documents
Upload:	chad-skinner
View:	218 times
Download:	0 times