+ All Categories
Home > Documents > Combining Image and Language to Predict and Understand the...

Combining Image and Language to Predict and Understand the...

Date post: 12-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
1
Back Ground Results Cast the prediction problem to Regression and Classification Make usefulness prediction by combing two neural networks CNN on an image associated with the review Attention based Convolutional RNN on the text of the review Analyze Attention weights to gain insight into what make a review useful Using 10.2K Reviews - Class 0: Reviews have 0 useful votes - Class 1: Reviews have >9 useful votes - A subset of images from the 200,000 business pictures provided by yelp # of ConvRNN Layer V.S. Classification Accuracy 71% 73.25% 75.5% 77.75% 80% 0 1 2 3 4 5 # of ConvRNN Layer V.S. Regression Accuracy (RMS) 2.79 2.805 2.82 2.835 2.85 0 1 2 3 4 Classification Results (All Model ) Accuracy 50% 57.5% 65% 72.5% 80% Image+ConvRNN ConvRNN RNN+Attention Bi-RNN SVM Regression Results (All Model ) RMSE 2.6 2.75 2.9 3.05 3.2 Image+ConvRNN ConVRNN RNN+Attention Bi-RNN Linear Regression Combining Image and Language to Predict and Understand the Usefulness of Yelp Reviews David Z Liu ([email protected]) Challenge Predicting a reviews usefulness is important and challenging Knowing the usefulness of a review in advance, businesses can recommend high quality and fresh reviews to customers and gain business insight For reviews with the same length, No obvious features that directly indicate the usefulness of a document Only a small amount of data have significant number of useful votes Approach Data Set Link Image With Text - Link a business photo to a review written about that business by match words in the image caption with words in the review text - Pick the image with the highest Jaccard similarity Model Input to the model: Written review to the RNN. Text are converted to 300 dimensional GloVe Associated image to the CNN, converted to 64X64X3 Output from the model: Classification or regression prediction 64 x 64 x 3 32 x 32 x 8 16 x 16 x 16 8 x 8 x 64 4 x 4 x 64 I Love food and Images 300x300x1 200x200x16 100x100x16 100 (0 = bidirectional RNN without attention) Visualizing Attention Weight A typical review with the top 25 words greened base on attention intensity. Image + ConvRNN Bidirectional RNN Without Attention Bidirectional RNN With Attention Prediction ConvRNN Size corresponds to the attention weight
Transcript
Page 1: Combining Image and Language to Predict and Understand the ...cs231n.stanford.edu/reports/2017/posters/816.pdf · Combining Image and Language to Predict and Understand the Usefulness

Back Ground

Results

• Cast the prediction problem to Regression and Classification

• Make usefulness prediction by combing two neural networks

• CNN on an image associated with the review

• Attention based Convolutional RNN on the text of the review

• Analyze Attention weights to gain insight into what make a review useful

Using 10.2K Reviews - Class 0: Reviews have 0

useful votes- Class 1: Reviews have

>9 useful votes- A subset of images from

the 200,000 business pictures provided by yelp

# of ConvRNN Layer V.S. Classification Accuracy

71%

73.25%

75.5%

77.75%

80%

0 1 2 3 4 5# of ConvRNN Layer V.S.

Regression Accuracy (RMS)

2.79

2.805

2.82

2.835

2.85

0 1 2 3 4

Classification Results (All Model )

Accu

racy

50%

57.5%

65%

72.5%

80%

Image+ConvRNN

ConvRNN

RNN+Attention

Bi-RNN

SVMRegression Results

(All Model )

RMSE

2.6

2.75

2.9

3.05

3.2

Image+ConvRNN

ConVRNN

RNN+Attention

Bi-RNN

Linear Regression

Combining Image and Language to Predict and Understand the Usefulness of Yelp ReviewsDavid Z Liu ([email protected])

Challenge

• Predicting a reviews usefulness is important and challenging

• Knowing the usefulness of a review in advance, businesses can recommend high quality and fresh reviews to customers and gain business insight

• For reviews with the same length, No obvious features that directly indicate the usefulness of a document

• Only a small amount of data have significant number of useful votes

Approach

Data Set

Link Image With Text - Link a business photo to a review

written about that business by match words in the image caption with words in the review text

- Pick the image with the highest Jaccard similarity

Model

• Input to the model:• Written review to the RNN.

Text are converted to 300 dimensional GloVe

• Associated image to the CNN, converted to 64X64X3

• Output from the model:• Classification or regression

prediction

64x64x3

32x

32x8

16x

16x

16

8x8x

64

4x4x

64

I Love food and Images

300x300x1

200x200x16

100x100x16

100

(0 = bidirectional RNN without attention)

Visualizing Attention Weight A typical review with the top 25 words greened base on attention intensity.

Image + ConvRNNBidirectional RNN Without Attention

Bidirectional RNN With Attention

Prediction

ConvRNN

Size corresponds to the attention weight

Recommended