Deep Learning Applications
Fall 2018
/ 38Outline
► Problem Definition
► Background & Related Works
► Proposed Method
► Experimental Results
► Conclusion &Future Works
Problem Definition
3
/ 38Introduction
► What is Multimodal Data?
►Multiple channels of input
►Multiple views of a same concept
A Red Bird in a jungle.
پرندهیقرمز Red Bird الطائراألحمر 红鸟 Красная Птица
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Applications
►Help inter-modal retrieval.
►Help intra-modal retrieval.
►Help Classification or clustering.When you type “مطمئن” Google search
engine retrieves this image.
When you search similar images for left image Google search engine retrieves right image.
Sport Delicious
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Challenges
►Distinct Modality Specific Properties.
►High Correlation Between Modalities.
►Higher intra-modality than inter-modality correlation.
Man eating apes
Man-eating apes
Red apple on the book
More Correlation
Less Correlation
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Problem Formulation
►Inputs:
► Two modalities like X and Z
►Goals:
► Extracting the most informative representation from X and Z
►Ability to generate missing modality from the present one
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Background
8
/ 38Deep Neural Networks
►Traditional neural networks using:
►More training data
►Deeper Architecture
► Better Optimization algorithms
►Popular Deep Neural Networks:
► Stacked Denoising Auto-encoders
► Recurrent Neural Networks
►Generative Adversarial Networks
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38De-noising Auto-encoders
►Corrupt Input data with the noise of its own.
► Try to find a representation for corrupted version of data in order to
reconstruction has the most information about clean input.
𝑿෩𝑿
𝒁𝒀 𝑰(𝑿, 𝒁)
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Stacking Auto-encoders (SAE)
►Extract high level representation by stacking auto-encoders in a
deep manner.
𝑿
𝒀
𝑿′𝒀’
𝒁
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Recurrent Neural Networks (RNNs)
► Feedforward networks with additional recurrent edges
► Powerful for sequential data like sentences
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Generative Adversarial Networks (GANs) [3]
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
[3] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Related Works
14
/ 38Multimodal Deep Learning [1]
► Use two modality-specific auto-encoders and a joint layer on top of them
Train network in order to reconstruct every modality from the other and itself.
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
[1] Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.
/ 38MDL-CW: A multimodal deep learning framework with cross weights [2]
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
[2] Rastegar, Sarah, et al. "Mdl-cw: A multimodal deep learning framework with cross weights." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
/ 38Generative Adversarial Text to Image Synthesis [4]
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
[4] Reed, Scott, et al. "Generative adversarial text to image synthesis." arXiv preprint arXiv:1605.05396 (2016).
/ 38Prior Works
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Approach Pros. Cons.
SAE(Ng 2011, Sohn 2014)
Simple implementation Discarding low level interactions
MDL-CW(Rastegar 2016)
Considers lower level interactions Non-generative
RNN(Socher 2013, Karapathy
2014, Karapthy 2015)
Considers Sentence Structure Convergence problem
GAN(Reed 2016)
Generative Memorization
Proposed Method
19
/ 38Shadow Networks
►Train a network to extract when a certain class is absent
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Relativeness
►Two relative data are similar in one particular sense
► Binary Relativeness
► Fuzzy Relativeness
►Relativeness is a function of representation level
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Representation Binding by degree K
►For each of K final Layers the representation for two relative data are
the same
►Relativeness is a function of level so two relative data in a level can
be irrelative in other levels
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Binding Representations for both networks
►Main network:
► For each level choose nearest neighbors among relatives from higher layer
► Bind the representation in this layer for these relatives
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Relatives
Layer
Horses Dark HorsesDark Arabic
Horses
Final Before final Two Before final
/ 38Binding Representations for both networks
►Shadow network:
► For each level choose farthest neighbors among relatives from higher layer
► Bind the representation in this layer for these relatives
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Relatives
Layer
Non-Horses Dog, Plane, Table, …
Final Before final
/ 38Cross Edges
►Learn cross edge weights between shadow and main networks
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38Representation Gating
►Three representations are available from lower layer representations
►Using modality presence signals to deduce final representation
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Gate
Same modality
Cross modality
Cross modality shadow
Modality Presence signals
Higher level representation
Higher Same modality
Higher Cross modality
Experimental Results
27
/ 38Experimental Results
►We have used PASCAL-Sentence for this section:
► Each image annotated by 5 sentences
► 500 train and 500 test images
► 1408 textual features
► 260 visual features
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
/ 38PASCAL-Sentence Dataset Experiments
Text to whole Image and Text Image to whole Image and TextPro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
[1] [6][7][9][12][10]
/ 38Qualitative Results
Image to wholeImage & TextText to wholeImage & Text
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Conclusion
31
/ 38Conclusions
►Using Shadow networks allows us to detect non-existence of topics
►Using Representation binding leads to better generalization
►Gating representations preserve informative representation and do
not corrupt it with weaker representations
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Future Works
33
/ 38Creation and Deception
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Main
Main Generator
Shadow
Shadow Generator
Creation CreationDeception Deception
/ 38Creation
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
►Main Generator generates unreal data which has desired label
/ 38Deception
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
►Shadow generator generates data which deceive the main network
to a wrong label
/ 38Future Works
►Neuron augmentation
►Using RNNs to distinguish between creation and deception
►Implementing brain cognitive functions
►Implementing social interactions between networks
Pro
ble
m
Def
init
ion
Rel
ate
d
Wo
rks
Pro
po
sed
Met
ho
d
Ex
per
imen
tal
Res
ult
s
Co
ncl
usi
on
&
Fu
ture
Wo
rks
Thank You!
/ 38References
1. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 689–696.
2. S. Rastegar, M. Soleymani M, H. R. Rabiee ,S. M. Shojaee, “Mdl-cw: A multimodal deep learning framework with cross weights” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2601-2609.
3. N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in neural information processing systems, 2012, pp. 2222–2230.
4. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 207–218, 2014.
5. K. Sohn, W. Shang, and H. Lee, “Improved multimodal deep learning with variation of information,” in Advances in Neural Information Processing Systems, 2014, pp. 2141–2149.
6. A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Advances in neural information processing systems, 2014, pp. 1889–1897.
7. R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes and natural language with recursive neural networks,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 129–136.
8. M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Predictable dual-view hashing,” in Proceedings of The 30th International Conference on Machine Learning, 2013, pp. 1328–1336.
9. B. Ozdemir and L. S. Davis, “A probabilistic framework for multimodal retrieval using integrative indian buffet process,” in Advances in Neural Information Processing Systems, 2014, pp. 2384–2392.
10. P. L. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” International Journal of Neural Systems, vol. 10, no. 05, pp. 365–377, 2000.
11. Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
/ 38References
11. Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in IEEE Conferenceon Computer
Vision and Pattern Recognition (CVPR). IEEE, 2011, pp. 817–824.
12. Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding
model." In Advances in Neural Information Processing Systems, pp. 2121-2129. 2013.
13. A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in VLDB, vol. 99, 1999, pp. 518–529.
14. G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Džeroski, “An extensive experimental comparison of methods for multilabel learning,”
Pattern Recognition, vol. 45, no. 9, pp. 3084–3104, 2012.
/ 38Multimodal Deep Boltzmann Machine [1]
/ 38MDL-CL: A Multimodal Deep Learning Framework with Cross Layers