+ All Categories
Home > Documents > Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Cross modality: Interaction between image, video and language - A Trinity and personal perspective

Date post: 08-Jan-2016
Category:
Upload: nellis
View: 36 times
Download: 5 times
Share this document with a friend
Description:
Cross modality: Interaction between image, video and language - A Trinity and personal perspective. 1. Khurshid Ahmad School of Computer Science and Statistics, Trinity College, Dublin A Seminar presentation. Preamble. - PowerPoint PPT Presentation
77
1 Cross modality: Interaction between image, video and language - A Trinity and personal perspective 1 Khurshid Ahmad School of Computer Science and Statistics, Trinity College, Dublin A Seminar presentation
Transcript
Page 1: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

1

Cross modality: Interaction between

image, video and language - A Trinity and personal perspective

1

Khurshid AhmadSchool of Computer Science and Statistics,

Trinity College, Dublin

A Seminar presentation

Page 2: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

2

PreambleOne key message in modern neuroscience is cross-modality & multi-sensory integration: •uni-modal areas in the brain, such as vision, process complex data received in a single mode –e.g. images

•these areas to interact with each other for the animals to deal with the world of multi-modal data

•unimodal areas interact with hetero-modal areas, areas that are activated by two or more input modalities, to converge the outputs of the uni-modal systems for producing ‘higher cognitive’ behaviour: quantify (enumeration and counting), retrieve images given linguistic cues and vice versa

Page 3: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

3

Neural Correlates of Behaviour: Modality and Neuronal Correlation

M. Alex Meredith (2002). On the neuronal basis for multisensory convergence: a brief overview. Cognitive Brain Research Vol. 14 (2002) 31–40

Neural underpinnings of Multisensory Integration:

Page 4: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

4

PreambleOne key message in modern neuroscience is cross-modality: ‘Sensory information undergoes extensive associative elaboration and attentional modulation as it becomes incorporated in the texture of cognition’;

Cognitive processes are supposed to arise ‘from analogous associative transformations of similar sets of sensory inputs’ – differences in the resultant cognitive operations are determined by the anatomical and physiological properties of the transmodal node that acts as the critical gateway for the dominant transformation’

Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052

Thin arrows monosynaptic connections;

Thick arrows ‘massive connections’

Broken arrows motor output pathways

Core synaptic hierarchy: primary sensory, upstream and downstream unimodal, and transmodal –heteromodal, paralimbic, and limibic- zones of the cerebral cortex;

Page 5: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

5

Neural Correlates of Behaviour: Modality and Neuronal Correlation

‘In addition to […] modality-specific motion-processing areas, there are a number of brain areas that appear to be responsive to motion signals in more than one sensory modality [….] the IPS, [..] precentral gyrus can be activated by auditory, visual or tactile motion signals’

Soto-Faraco, S. et al (2004). ‘Moving Multisensory Research Along: Motion Perception Across Sensory Modalities’. Current Directions in Psy. Sci. Vol 13(1), pp 29-32

Neural underpinnings of Multisensory Motion Integration:

Page 6: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

6

Sensation and Cognition

The highest synaptic levels of sensory fugal processing are occupied by heteromodal, paralimbic and limbic cortices – collectively known as transmodal areas.

Key anatomically distinct brain networks with communicating epicentres

Mesulam, M.-Marsel (1998) ‘From sensation to cognition’ Brain Vol. 121, pp 1013-1052

Network Epicentre 1 Epicentre 2

Spatial awareness Posterior Parietal Cortex Frontal eye fields

Language Wernicke’s Area Broca’s area

Explicit memory/emotion

Hippocampal-entorhinal complex

The Amygdla

Face-Object recognition

Mid-temporal cortex Temporo-polar cortex

Working memory-executive function

Lateral pre-frontal cortex Posterior Parietal Cortex (?)

Page 7: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

7

Uni- and Cross Modality @ Trinity

Indexing (Rapporteur: Declan O'Sullivan)

Anil Kokaram

Retrieval in Context: A holistic view of intelligent multimedia access

Frank Boland

Audio information acquisition and source localisation

Niall Rea and Rozenn Dahyot

Detection of Illicit Content in Video Streams

Retrieval (Rapporteur: John Dingliana)

Simon Wilson

Bayesian content-based image retrieval

Niall Rooney

Search strategies for cluster-based document indexing and retrieval

Anton Zamolotskikh

A machine learning approach for ontology construction within Collaborative Media Tagging Environments Abstract

Simulation & Visualisation (Rapporteur: Carl Vogel)

Carol O'Sullivan

Perception of dynamic events and implications for real-time Computer Graphics

Gerard Lacey

Efficient rigid body motion tracking with applications to human psychomotor performance assessment

Page 8: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

8

Uni- and Cross Modality @ Trinity

Other friends and colleagues:

Trinity Centre for Neurosciences (Fiona Newell, Shane O’Mara, Hugh Garavan & Ian Robertson Cross modality and fMRI imaging);

Linguistics and Phonetics (Ailbhe Ní Chasaide)

Centre for Health Informatics (Jane Grimson)

Page 9: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

9

Uni/Cross Modality & Ontology @ Trinity

The key problem for the evolving semantic web and the creation of large data repositories (memories for life in health care, infotainment) is the indexation and efficient retrieval of images – both still and moving- and the identification of key objects and events in the images.

The visual features under-constrain an image and supplemental, collateral, contextual knowledge is required to index the images: Linguistic description and motion features are one of the candidates.

Above all, there must be a conceptual basis of any indexing scheme for it to be robust against changes in the subject domain and changes in the user perspective.

Page 10: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

10

Uni/Cross Modality & Ontology @ Trinity

The key term in distributed and soft computing for a conceptual basis is ontology: A consensus amongst a group of people (system developers, domain experts and end-users) about what there is.

We have had a seminar where we discussed the philosophical, formal, linguistic, computational and inter-operability issues related to ontology systems.

There is a work programme that is evolving under the co-ordination of Declan O’Sullivan.

The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation

Page 11: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

11

Uni/Cross Modality & Ontology @ Trinity

The intention is to see the fit between the work of the ontology consortium with that of folks in video annotation. The intention is to have a system that works in a distributed environment and interacts with users with variety of devices and allows access and update of large repositories of life and mission critical data.

We have tremendous opportunities: (a) Major government initiatives in health-care an integrated system for text and images related to patients accessible to authorised users on a range of mobile devices; (b) Major opportunities in animation and surveillance; (‘c) Key applications in mini-robotics systems; (d) the opening up of TCIN in clinical care; (e) ageing initiatives

Page 12: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

12

Uni/Cross Modality & Ontology @ Trinity

There are key groups in the College that can contribute to the knowledge in computing and contribute to the advancement of key disciplines – health care, neurosciences. This is a win-win opportunity for all.

1. Communications and Value Chain Centre2. Intelligent Systems Cluster in CS (Ontology,

Linguistics, Graphics, Vision)3. Theory and Architecture Cluster (Formal

Methods)4. Distributed Systems Cluster (Ubiquitous

Systems)5. Vision and Speech Groups in EE6. Statistics Cluster (Bayesian Reasoning)

Page 13: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

13

Uni/Cross Modality & Ontology @ Trinity

The key message here is this:

Trinity is good at good science;Trinity has substantial expertise and potential in

building novel computing systems;Trinity has demonstrable ability to deal with real

world audio/video systems;All the key players involved have a peer-reviewed

track record

We have the critical mass or have the desire create one!!

Page 14: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

14

PreambleNeural computing systems are trained on the principle that if a network can compute then it will learn to compute;

Most neural computing systems are single net, cellular systems – the single net systems;Lesson from biology: no network is an island, the bell tolls across networks

Page 15: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

15

PreambleNeural computing systems are trained on the principle that if a network can compute then it will learn to compute.

Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously or sequentially , then the multi-net will learn to compute.

Page 16: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

16

PreambleMulti-net neural computing systems can be traced backed to the hierarchical mixture of experts’ systems originally reported by Jordan, Jacobs and Barto.

In turn, these systems relate to the broader family of systems – the mixtures of ‘X’Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250.

Page 17: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

17

PreambleOne key message in modern neuroscience is multi-modality:

My work has been in the multi-net simulation of

language development;

aphasia;

numerosity;

cross-modal retrieval;

attention and automatic video annotation.

Page 18: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

18

Learning to Compute: Cross-Modal Interaction and Spatial

Attention

The key to spatial attention is that different stimuli, visual and auditory, help to identify the spatial location of the object generating the stimuli.

One argument is that there may be a neuronal correlate of such crossmodal interaction between two stimuli.

Information related to the location of the stimulus (where) and identifying the stimulus (what) appears to have correlates at the neuronal level in the so-called dorsal and ventral streams in the brain.

Page 19: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

19

Learning to Compute: Numerosity, Number Sense and

‘Numerons’ A number of other animal species appear to have the ‘faculty’ of visual enumeration or subitisation. The areas identified have ‘homologs’ in the human brain.

Author % Numerical Neurons in Macque Parietal

Cortex

% Numerical Neurons in Macque Prefrontal

Cortex

Sawamura et al 2002

c. 30 c. 15

Neider et al 2002 c. 15 c. 30

Measurements are a tad problematic in neurobiology

Page 20: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

20

Learning to Compute: Numerosity, Number Sense and

‘Numerons’

‘Monkeys watched two displays (first sample, then test) separated by a 1-s delay. [the displays varied in shape, size, texture and so on.] They were trained to release a lever if the displays contained the same number of items. Average performance of both monkeys was significantly better than chance for all tested quantities, with a decline when tested for higher quantities similar to that seen in humans performing comparable tasks. Andreas Nieder, David J. Freedman, Earl K. Miller (2002). ‘Representation of the Quantity of Visual Items in the Primate Prefrontal Cortex’. Science Vol. 297, pp 1709-11.

The ‘Edge’ Effect

Page 21: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

21

Computing to Learn

Neural computing systems are trained on the principle that if a network can compute then it will learn to compute.

Multi-net neural computing systems are trained on the principle that if two or more networks learn to compute simultaneously, then the multi-net will learn to compute.

Page 22: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

22

Combining multiple modes of information using unsupervised neural classifiers

Two SOMs linked by Hebbian connections One SOM learns to classify a primary

modality of information One SOM learns to classify a collateral

modality of information Hebbian connections associate patterns of

activity in each SOM

Computing to Learn:Unsupervised Self

Organisation

Page 23: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

23

Computing to Learn:Unsupervised Self

Organisation

Sequential Multinet Neural Computing Systems: SOMs and Hebbian

connections trained synchronously.

.

.

. . . .

Primary SOM

Bi directional Hebbian Network

Primary Vector

Collateral Vector

Collateral SOM

Page 24: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

24

Computing to Learn:Unsupervised Self

Organisation

Work under my supervision at Surrey includes the development of multi-net neural computing architectures for: -language development-language degradation-Collateral images and texts-Numerosity development-In the case of the latter two, then the connections between modules are learnt too – cross-modal interaction via Hebbian connections

Page 25: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

25

Hebbian connections associate neighbourhoods of activity

Not just a one-to-one linear association Each SOM’s output is formed by a pattern

of activity centred on the winning neuron for the primary and collateral input

Training is deemed complete when both SOM classifiers have learned to classify their respective inputs

Computing to Learn:Unsupervised Self

Organisation

Page 26: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

26

Computing to Learn: The Development of

Numerosity

Hebbian connections from the winning node of the magnitude representation SOFM to all nodes of the verbal SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs.

(a) (b)

Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

An unsupervised multinet alternative

Page 27: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

27

Computing to Learn:Image and Collateral Texts

MAGNITUDE SOFM

VERBAL SOFM

Number words

HEBBIANCONNECTIONS

Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

An unsupervised multinet alternative

Page 28: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

28

Computing to Learn:The Development of

Numerosity

One

Tw o ThreeFour

Five

Six

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Kohonen Layer Node Number

Avera

ge A

cti

vati

on

An unsupervised multinet alternative: Simulating Fechners’

Law

Ahmad K., Casey, M. & Bale, T. (2002). Connectionist Simulation of Quantification Skills. Connection Science, vol. 14(3), pp. 165-201.

Page 29: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

29

Computing to Learn:The Development of

Numerosity

A ‘Hebbian-like learning rule’ that ‘resembles [..] Kohonen learning rule’: A confirmation of the results of Neider &

Miller

Verguts, Tom., & Fias, Vim. (2004). ‘Representation of Numbers in Animals and Humans: A Neural Model. Journal of Cognitive Neuroscience. Vol. 16(No. 9) pp 1493-1504

Page 30: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

30

Computing to Learn:Image and Collateral Texts Images have been traditionally indexed

with short texts describing the objects within the image. The accompanying text is sometimes described as collateral to the image.

The ability to use the collateral texts for building computer-based image retrieval systems will help in dealing with image collections that can now be stored digitally.

Theoretically, the manner in which we grasp the relationship between the ‘features’ of the image and the ‘features’ of the collateral text relates back to cross-modality.

Page 31: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

31

Computing to Learn:Image and Collateral Texts

Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201

•The approximate locations of [lateral] regions where information about object form, motion and object-use-associated motor patterns may be stored. •Information from an increasing number of sources may be

integrated in the temporal lobes, with specificity increasing along the posterior to anterior axis. •Specific regions of the Left Inferior Parietal Cortex and the polar region of the temporal lobes may be involved differentially in retrieving, monitoring, selecting and maintaining semantic information.

Page 32: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

32

Computing to Learn:Image and Collateral Texts

Alex Martin* and Linda L Chao (2001). Semantic memory and the brain: structure and processes. Current Opinion in Neurobiology. Vol. 11, pp 194–201

• Activation of the fusiform gyrus when subjects retrieve color word associates has recently been replicated in two additional studies• Activation in a similar region has been reported during the spontaneous generation of color imagery in auditory color-word synaesthetes

Page 33: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

33

Computing to Learn:Image and Collateral Texts

In principle, image collections can be indexed by the visual features of the content alone (colour, texture, shapes, edges). The content-based image retrieval have not been a resounding success:

K. Ahmad, B. Vrusias, and M. Zhu. ‘Visualising an Image Collection?’ In (Eds.) Ebad Banisi et al. Proceedings of the 9th International Conference Information Visualisation (London 6-8 July 2005). Los Alamitos: IEEE Computer Society Press. pp 268-274.

Visual Similarity (Similar Colours) Conceptual Similarity (Balls / Fruits)

Page 34: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

34

Computing to Learn:Image and Collateral Texts

We have developed a multi-net system that learns to classify images within an image collection, where each image has a collateral text, based on the common visual features and the verbal features of the collateral text.

The multi-net can also learn to correlate images and their collateral texts using Hebbian links – this means that one image may be associated with more than one collateral text and vice versa

Ahmad, K., Casey, M., Vrusias, B., & Saragiotis P. Combining Multiple Modes of Information using Unsupervised Neural Classifiers. In (Ed.) Terry Windeatt and Fabio Rolli. Proc.4th Int. Workshop, MCS 2003. LNCS 2709. Heidelberg: Springer-Verlag. pp 236-245.

Page 35: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

35

Computing to Learn:Image and Collateral Texts

Hebbian connections from the winning node of the text SOFM to all nodes from the image SOFM (a), and the vice versa (b). During training those connections are strengthened based on the activations of the node pairs.

(a) (b)

Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

Automatic Image Annotation and Illustration

Page 36: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

36

Computing to Learn:Image and Collateral Texts

IMAGE SOFM

TEXT SOFM

Keywords

HEBBIANCONNECTIONS

Vrusias B. L. (2004). Combining Unsupervised Classifiers: A Multimodal Case Study. Unpublished PhD Dissertation, University of Surrey.

Automatic Image Annotation and Illustration

Page 37: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

37

Computing to Learn:Image and Collateral Texts

Input Layer: Text 30 50 195

Input Layer: Image 67    

Output Layer 10 x 10 15 x 15 50 x 50

Hebbian Links 10000 50625 6250000

Training Cycles 1000 10000

Output Layer 15 x 15

Input Text Vector Length 50

Hebbian Links 50625

Training Cycles 1000

Optimum SOFM Configuration

Different SOFM Configurations Used in Simulations

Automatic Image Annotation and Illustration

Page 38: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

38

Computing to Learn:Image and Collateral Texts

Modality Method Components

Text Vector Construction

Texts represented through their keywords

Frequency and patterns of usage: most used and least used terms

Image Vector Selection

Standard Physical Features

Colour; Texture; Shape; Brightness

Automatic Image Annotation and Illustration

Page 39: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

39

Computing to Learn:Image and Collateral Texts

The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.

The performance of the two networks was compared using a ratio of precision (p) & recall (r) statistics, called the

effectiveness ratio F. We use =0.5.

rp

F1

)1(1

1

Page 40: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

40

Computing to Learn:Image and Collateral Texts

•The Hemera “PhotoObjects was used as the primary dataset collection for our experiments.• The collection contains about 50,000 photo objects (single object images with no

background), and has been used extensively for image analysis. •Each image (object) in the collection has associated keywords attached, and is characterised by a general category type.

Page 41: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

41

Computing to Learn:Image and Collateral Texts

Hemara Collection: Training Subset Used – 1151 images randomly selected from 50000 obejcts

CATEGORY AVERAGE No TERMS

TOTAL No TERMS

No OF OBJECTS

BALLS 9 915 97

BUTTERFLIES & MOTHS 8 993 129

CARS 10 1217 118

DRINKS 10 664 65

FLOWERS 9 1099 117

FRUIT 4 561 131

MONEY 8 909 120

SEATING 8 862 107

TRAINS & PLANES 11 1542 139

WEAPONS 10 1256 128

AVERAGE 8.7 1,002 115

Page 42: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

42

Computing to Learn:Image and Collateral Texts

1 1 1 5 5 5 5 5 5 5 7 7 7 7 71 1 1 5 5 5 5 5 5 5 7 7 7 7 71 1 1 6 5 5 5 5 5 10 7 7 7 7 71 1 6 6 6 6 6 6 6 10 10 7 7 7 71 1 6 6 6 6 6 6 6 10 10 10 10 7 71 1 6 6 6 6 6 6 6 6 10 10 2 2 21 9 9 9 6 6 6 6 6 6 2 2 2 2 29 9 9 9 9 9 6 6 6 6 2 2 2 2 29 9 9 9 9 9 9 6 6 10 2 2 2 2 29 9 9 9 9 9 10 10 10 10 10 2 2 2 29 9 9 9 9 10 10 10 10 10 10 4 2 2 29 3 3 9 10 10 10 10 10 8 8 8 4 2 23 3 3 3 3 10 10 10 8 8 8 8 4 4 43 3 3 3 3 3 10 8 8 8 8 8 8 4 43 3 3 3 3 3 8 8 8 8 8 8 8 4 4

6 6 8 5 10 10 10 10 10 10 10 2 2 2 46 7 9 9 10 10 10 10 10 10 10 8 8 8 53 3 9 9 10 10 10 10 10 10 10 8 8 5 52 2 3 4 4 10 10 10 10 10 8 6 6 5 52 2 3 4 4 10 10 9 2 2 2 6 5 5 52 2 4 4 3 3 9 9 2 2 2 2 5 6 62 2 2 3 3 3 9 9 2 2 2 8 6 6 62 2 2 2 3 3 3 9 3 3 3 5 8 6 62 2 2 2 3 3 3 3 3 3 3 5 5 1 12 2 2 2 9 3 3 3 3 3 3 5 5 6 62 2 2 9 9 9 9 9 3 3 3 5 5 6 67 7 7 7 9 9 9 9 10 4 9 5 5 6 17 7 7 7 7 9 9 9 6 4 4 5 5 1 17 7 7 7 7 7 6 6 10 10 5 1 1 1 17 7 7 7 7 1 6 6 1 1 8 1 8 1 1

Visualising the clusters formed by the image-based SOFM.

Visualising the clusters formed by the text-based SOFM.

Page 43: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

43

Computing to Learn:Image and Collateral Texts

Automatic Image Annotation and Illustration

• It is not possible for an SOFM to output categories inherent in the training data. Recently, a sequential clustering scheme has been suggested: Produce the initial categorisation using a SOFM and then cluster the output using conventional clustering algorithms like k-means, hierarchical clustering, fuzzy c-means and so on.

• We have obtained the best results with a SOFM+k-means clustering.

Page 44: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

44

Computing to Learn:Image and Collateral Texts

• The visual features proved too generic to be useful for classification.• Precision and recall figures were persistently below 0.5 for both metrices.• The results, however, were good for visually well defined objects like coins.• This perhaps explains the poor performance of some of the computer vision systems.

Page 45: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

45

Computing to Learn:Image and Collateral Texts

•Textual descriptors are much better for categorisation with precision and recall both quite high.

Toplogy F Precision

Recall

10 x 10

0.70

0.60 0.83

15 x 15

0.72

0.63 0.84

50 x 50

0.80

0.70 0.92 Toplog

yF Precisio

nRecal

l

10x10

0.25 0.24 0.27

15x15

0.26 0.25 0.27

50x50

0.29 0.28 0.29

Text-based categorisation

Image-based categorisation

Page 46: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

46

Computing to Learn:Image and Collateral Texts

Neural Network Architecture F0.5

Multinet system: Simple Collateral Mapping

0.76

Monolithic single net system 0.50

The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.

Page 47: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

47

Computing to Learn:Image and Collateral Texts

The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.

System Input Vector Output Vector

F-Measure

SingleNet Monolithic Monolithic 0.36

MultiNet: Auto Annotation

Visual Features Keyword Features

0.38

MultiNet: Auto Illustration

Keyword Features

Visual Features 0.48

Hemara Data Set: Single Objects + No Background

Page 48: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

48

Computing to Learn:Image and Collateral Texts

The performance of the multinet system was compared with a ‘monolithic’ single net SOFM. The monolith was trained with a conjoined text-image vector on a single SOFM.

System Input Vector Output Vector

F-Measure

SingleNet Monolithic Monolithic 0.37

MultiNet: Auto Annotation

Visual Features Keyword Features

0.25

MultiNet: Auto Illustration

Keyword Features

Visual Features 0.43

Correl Data Set: Multiple Objects + Background

Page 49: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

49

Computing to Learn:Image and Collateral Texts

Automatic Image Illustration through Hebbian cross-modal linkage

Text-query Matched Text

Retrieved Image

Page 50: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

50

Computing to Learn:Image and Collateral Texts

Query Image

Matched Image

Retrieved TextAutomatic Image Annotation through Hebbian cross-modal linkage

Page 51: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

51

Computing to Learn:Image and Collateral Texts

Performance

TEXT IMAGE IMAGE TEXT

Best 75 76

Worst 70 71

Average 73 74

Results signifying the accuracy of the Hebbian network learning to identify the link between an image – text pair.

Automatic Image Annotation and Illustration

Page 52: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

52

How to retrieve the stored video (sequences)?

Unusual events? Unusual behaviour? Unusual objects?

TODAY’S VISION TECHNOLOGY

Page 53: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

53

How to retrieve the stored video (sequences)? By

keywords:Experts to annotate video sequences by hand!

Between 5-40 minutes per still image

Inter-indexer variability

TODAY’S VISION TECHNOLOGY

Page 54: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

54

VISUAL THESAURUS & VIDEO SUMMARISATION – The Written Word in Closely

and Broadly Collateral Texts

CLOSELY CLOSELY COLLATERAL TEXTSCOLLATERAL TEXTS

CAPTION

CRIME SCENE

REPORT

CLOSELY CLOSELY COLLATERAL TEXTSCOLLATERAL TEXTS

CAPTION

CRIME SCENE

REPORT

BROADLY BROADLY COLLATERAL TEXTSCOLLATERAL TEXTS

NEWSPAPER ARTICLE

DICTIONARY DEFINITION

BROADLY BROADLY COLLATERAL TEXTSCOLLATERAL TEXTS

NEWSPAPER ARTICLE

DICTIONARY DEFINITION

CRIME SCENE IMAGE

Page 55: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

55

New concepts generate new keywords and kill off old ones;

New concepts are inevitably written up and published in research papers, magazines, newspapers.

Automatic extraction from text??

TOMMOROW’S VISION TECHNOLOGY?

Earprints?

Telephone Chatter?

Suicide Bomber?

Grassprints?

Crowd Dynamics?

Page 56: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

56

New concepts are inevitably written up and published in research papers, magazines, newspapers.

Automatic extraction from text??

TOMMOROW’S VISION TECHNOLOGY?

Earprint Thesaurus created at Surrey automatically!

Page 57: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

57

VISUAL THESAURUS & VIDEO SUMMARISATION

Expert Observations Research Reports

Novel DevicesNew Methods

The source of new concepts and terms

InformationExtraction

System

New Terminology: Earprint; Pyrolysis; Bitemarks

Page 58: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

58

Development of a Visual Evidence Thesaurus

A visual thesaurus is an arrangement of words and phrases of a language not in alphabetical order but according to the images associated with the words and phrases express.

A visual thesaurus is not a pictorial dictionary:A pictorial dictionary

explains the meanings of words and phrases associated with an image

A visual thesaurus suggests a range of words and phrases associated with an image

VISUAL THESAURUS & VIDEO SUMMARISATION

Page 59: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

59

Development of a Visual Evidence Thesaurus

The challenge of the REVEAL Project is to create a visual evidence in arbitrary domains by using a set of systematically collected moving images and associated texts.

A systematic collection of texts is called a CORPUS: A corpus comprises the evidence of how a language is being used at various levels of description:

at the level of word usage (lexical), at the level of phrases and sentences (grammatical),

at the level of meaning (semantics), & at the level if intentions (pragmatics).

VISUAL THESAURUS & VIDEO SUMMARISATION

Page 60: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

60

An expert’s description is rich in vocabulary and can be subsequently used to annotate a picture gallery Nick Mitchell (SOCO, Surrey Police) describing a mock murder scene

VISUAL THESAURUS & VIDEO SUMMARISATION – The Spoken Word

Page 61: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

61

Scene of Crime Information System (SOCIS) (An earlier EPSRC funded project)

This EPSRC-sponsored project, involving Universities of Surrey and Sheffield, is developing methods and techniques for automatically indexing images with the descriptions provided by Scene of Crime Officers.

9 mm Browning high power pistol

Footwear impression in

blood

Body on floor showing

adjacent table

Fingerprints showingridges

Typical Scene of Crime Images

BUILDING A VISUAL THESAURUS

Page 62: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

62

Surrey Forensic Science Corpus ( 0.58 Million words)

USE OF WORDS IN THE FORENSIC SCIENCE CORPUS COMPARED and CONTRASTED WITH

A GOOD SAMPLE OF TEXTS OF EVERDAY USAGE

British National Corpus(100 million words)

contains major works of fiction, science and technology texts, newspapers and magazines

BUILDING A VISUAL THESAURUS

Page 63: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

63

SFSC:Relative

Frequency

BNC: Relative

Frequency

SFSC/BNC:WEIRDNESS

the 6.8% 6.2% 1.1

of 3.7% 2.9% 1.2

and 2.7% 2.7% 1.0

to 2.5% 2.6% 1.0

a 2.4% 2.1% 1.1

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus.

BUILDING A VISUAL THESAURUS

Page 64: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

64

SFSC:Relative

Frequency

BNC: Relative

Frequency

SFSC/BNC:WEIRDNESS

evidence

0.47% 0.021% 22

crime 0.40% 0.007% 57

scene 0.27% 0.007% 40

forensic

0.25% 0.001% 473

police 0.25% 0.028% 9

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC.

BUILDING A VISUAL THESAURUS

Page 65: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

65

SFSC:Relative Frequency

BNC: Relative

Frequency

SFSC/BNC:WEIRDNESS

bitemark 0.0187% 0%

earprint 0.0137% 0%

accelerant

0.0115% 0%

pyrolysis 0.0139% 0.00001%

634

ballistics 0.0146% 0.00002%

1263

British National Corpus (BNC) = 100 Million words;

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;

The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC.

BUILDING A VISUAL THESAURUS

Page 66: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

66

BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)

The SOCIS system, available freely, automatically indexes images taken at a crime scene with the descriptions provided by scene of crime officers. These descriptions were supplemented by a visual thesaurus constructed from a forensic science corpus.

Page 67: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

67

IDENTIFICATION LOCATION ELABORATION

[1] Close up view of exhibit ABC/3 [.] [2] Red and silver knife handle.

On alleyway

floor

Adjacent to building and metal gate

[SOCO 1 – spontaneous free text:] Close up view of exhibit ABC/3 red and silver knife handle on alleyway floor adjacent to building and metal gate.

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)

Page 68: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

68

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)

•Novices attempted to describe everything in the image

“I now see the bedroom or one of the bedrooms it looks like a child's bedroom single bed and a cot to the right of the bed there's a a bedside cabinet four drawers the bottom two drawers are completely open with some items hanging out the third drawer up is slightly open with an item hanging out and on top of that there's some different toys and ornaments the bed doesn't look like it's been disturbed it's just cuddly toys over the pillow cot the cot is open and the doors the side door is down with a cuddly toy in it and a bike propped up against the wall or a scooter looks like an old scooter propped up against the wall” (124 words)

Page 69: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

69

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)

•Novices were quite succinct in their description after training (c. 6 weeks).

“now a view of the picture from child's bedroom, single bed, with a cot, to the right of the bed a bedside cabinet with two drawers open a four drawer cabinet a bedside cabinet, the bottom two are open, the third one up is slightly open with some clothing hanging out.”(51 Words)

Page 70: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

70

Indexer Variability: Given the image descriptions are in free text, perhaps each SOCO gives a different description of the image?

BUILDING A VISUAL THESAURUSScene of Crime Information System (SOCIS)

•Novices were quite succinct in their description after training (c. 6 weeks) but were nothing like an expert:

“View of ground floor bedroom towards bed and cot (as viewed from door).” (13 Words)

Page 71: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

71

BUILDING A VISUAL THESAURUSKey to building a visual thesaurus

1. Terminology;

2. Conceptual Structures or Ontology;

3. Methods for updating terminology & ontology;

4. Access to experts, to exemplar images, and to collateral descriptions of images

Page 72: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

72

Learning to Compute: Visual Attention: An early processing

model

L. Itti & C. Koch, (2001). Computational Modeling of Visual Attention, Nature Reviews Neuroscience, Vol. 2, No. 3, pp. 194-203, Mar 2001.

Cortical Area

Tasks Function

‘dorsal stream’ (PPC)

spatial localization; directing attention and gaze towards objects of interest in the scene.

Deploy attention

‘ventral stream’ (infero-temporal cortex; IT)

recognition and identification of visual stimuli

Receive attentional feedback modulation;Represent attended locations and objects

Page 73: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

73

Learning to Compute: Visual Attention: Itti and Koch Model

L. Itti, (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention, IEEE Transactions on Image Processing, Vol. 13, No. 10, pp. 1304-1318, Oct 2004.

Inputs are decomposed into multiscale analysis channels sensitive to low-level visual features (two color contrasts, temporal flicker, intensity contrast, four orientations, and four directional motion energies). After strong non-linear competition for saliency, all channels are combined into a unique saliency map. This map either directly modulates encoding priority (higher priority for more salient pixels), or guides several virtual foveas towards the most salient locations (highest priority given to fovea centers).

Page 74: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

74

Learning to Compute: Visual Attention: Itti and Koch Model

L. Itti, (2004). Automatic Foveation for Video Compression Using a Neurobiological Model of Visual Attention, IEEE Transactions on Image Processing, Vol. 13, No. 10, pp. 1304-1318, Oct 2004.

Inputs are decomposed into multiscale analysis channels sensitive to low-level visual features (two color contrasts, temporal flicker, intensity contrast, four orientations, and four directional motion energies). After strong non-linear competition for saliency, all channels are combined into a unique saliency map. This map either directly modulates encoding priority (higher priority for more salient pixels), or guides several virtual foveas towards the most salient locations (highest priority given to fovea centers).

Page 75: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

75

Summary

Preliminary results show that: Modular co-operative multi-net system using

unsupervised learning techniques can improve classification with multiple modalities

Future work: Evaluate against larger sets of data Further understanding of clustering and

classification in SOMs Further explore linkage of neighbourhoods,

more than just a one-to-one mapping, and theory underlying model

Page 76: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

76

Afterword

Important that research and development of neural computing systems continues to be informed by, and inspired by, latest results from neuroscience – e.g. insights into multimodal abilities suggests research of modular multi-net architecturesMay claim that the multi-net systems reported in this talk have a heteromodal region, in which the connection between uni-modal networks is learnt

Page 77: Cross modality: Interaction between image, video and language - A Trinity and personal perspective

77

Afterword

Here is a movie of my latest project


Recommended