Jinyong Lee Computer Science Department University of Southern California Linking Vision, Brain and...

Jinyong Lee

Computer Science Department

University of Southern California

Linking Vision, Brain and Language

04/20/23

Part I: Hearing Gazes – eye movements and visual scene recognition

[2]An unexpected visitor (I.E. Repin, 1884-1888)

SemRep (Semantic Representation) is a hierarchically organized graph-like representation for encoding semantics and concepts that are possibly anchored on a visual scene.

[3]

Original image from: “Invisible Man Jangsu Choi”, Korean Broadcasting System (KBS)

MAN

WOMAN

HITTING ACTION

BLUECLOTHE

Another whole SemRep structure embedded in WOMAN node

objects or actions as nodes

edges to specify relations between nodes

SemRep is proposed as a ‘bridge’ between the vision and the language system.

[4]

“A woman in blue hits a man”

“A woman hits a man and she wears a blue dress”

“A man is hit by a woman who wears a blue dress”

“A pretty woman is hitting a man”

SemRep

Vision SystemLanguage System

•Dynamically changing (even for static scenes)

•Cognitively important entities encoded only (different case by case)

•Temporally stored in WM

What’s happening?

[5]http://icanhascheezburger.files.wordpress.com/2008/10/funny-pictures-fashion-sense-cat-springs-into-action.jpg

A single structure of SemRep usually encodes a cognitive event that corresponds to a ‘minimal subscene (Itti & Arbib, 2006)’, which can later be extended into an ‘anchored subscene’.

[6]

1

2

3

4

1. Firstly the man’s face captures attention (human faces are cognitively salient); minimal subscene

2. Then the high contrast of the cat’s paw captures attention action recognition is performed and it biases the attention system onto the agent (cat)

3. The cat on the roof found as the agent of the paw action; incremental building

4. It completes a bigger subscene building the Agent-Action-Patient propositional structure; anchored subscene

The entities encoded in SemRep are NOT logical but perceptually grounded symbolic representations.

[7]

Theories and findings on conceptual deficits (Damasio, 1989; Warrington & Shallice 1984; Caramazza & Shelton, 1998; Tyler & Moss, 2001) suggest modality-specific representations distributed categorically over the brain

Perceptual Symbol System (PSS) (Barsalou, 1999): stimulus inputs are captured in the modality-specific feature maps, then integrated in an association area

representing the encoded concepts as perceptual symbols (SemRep)

learning (encoding) the concept CAR

re-enactment (simulation) of the learned concept CAR

Figure 2. Barsalou et al., 2003

The schematic structure (relations, hierarchies, etc.) of SemRep can be viewed as a cross-modal association between very high level sensori-motor integration areas in the brain.

[8]

Conceptual Topography Theory (CTT) (Simmons & Barsalou, 2003): hierarchically organized convergence zones the higher, the more symbolic and amodal

X

Figure 3. Simmons & Barsalou, 2003

Since SemRep is still in its hypothetical stage, an eye-tracking study (but it only captures ‘overt’ attention) has been designed and conducted to evaluate its validity.

[9]

MicrophoneMicrophone

Capturing human eye movements and verbal description simultaneously while showing images/videos To investigate the temporal relationship between eye-movements and the verbal description

Mainly four hypotheses are proposed and two types of tasks are suggested in order to verify each hypothesis.

[10]

Hypothesis I – Mental Preparation

Hypothesis II – Structural Message

Hypothesis III – Dynamic Interaction

Hypothesis IV – Spatial Anchor

Task I – “describe what you are seeing as quickly as possible”

Task II – “describe what you have seen during the scene display”

Table 1. Possible experiment configurations between hypotheses and tasks/visual scenes

‘Task I’ is to investigate the eye movements for estimating possible SemRep structures built and their connection to the spoken description.

[11]

Instruction

“Describe what you are seeing as quickly as possible”

Real-time description (with speech recording) during the visual scene display

Expectations

1) There would be rich dynamic interaction with the eye movements and the utterance

2) Having the subjects speak as quickly as possible would reveal the structure of the internal structure that they are using for speaking

NOTE: Experiments for only this with static images were conducted

‘Task II’ is to investigate the eye movements after the scene display for evaluating the validity of the spatial component of SemRep.

[12]

Instruction

“Describe what you have seen during the previous scene display”

Post-event description (with speech recording) after the visual scene display

Expectations

1) The eye movements will sincerely reflect the remembered object locations being described

2) The sentence structure of the description would be more well-formed, and the order of description and the order of fixations would be less congruent

The Mental Preparation Hypothesis (Hypothesis I) asserts that gazes are not merely meaningless gestures but they actually reflect attention and mental effort in verbal description production.

[13]

1) Speakers plan (building an internal representation) what to say while gazing, and only after it is relatively complete they begin speaking (Griffin, 2005)

2) There is a tight correlation between gazing and description usually gazing comes 100ms ~ 300ms earlier than the actual description (Griffin, 2001)

Figure 2. Griffin, 2001

An overlapping utterance pattern for two consecutive nouns A and B

The experiment result suggests that there is a very strong correlation between the eye movements and the verbal description such that gazes come before the utterances, indicating what to be spoken next.

[14]

Eye gazes ‘right before’ verbal descriptions01_soldiers [fairman]

4.08~5.2::“one soldier smiles at him” gaze on a soldier’s face7.33~8.79::“he’s got sandals and shoes on” gaze on person’s shoes

05_basketball [fairman]2.82 ~ 3.52:: “he has the ball” gaze on ball

06_conversation [fairman]4.08~5.01:: “reading books” gazes on books14.86~16.62:: “the woman on the right wears boots” gazes on boots and a woman…and a lot more!!!

Others1)The delay between gazes and descriptions seem to be a bit longer than suggested by Griffin (2005; 2001) – about 500ms ~ 1,500ms this might be due to the fact that Griffin’s case is in the word level but ours is in the sentential/clausal level2)There seems to be a parallelism between the vision system (gaze) and the language system (description) since subjects already moved their gazes to some other locations during speaking about an object/event

The Structural Message Hypothesis (Hypothesis II) asserts that the internal representation for speakers is an organized structure that can be directly translated into a proposition or clause.

[15]

1) van der Meulen (2003): speakers outline messages (some semantic representation similar to SemRep, used for producing utterances) approximately one proposition or clause at a time

2) Itti & Arbib (2006): for whatever reason, an agent, action or object may attract the viewer’s attention then attention would be directed to seek to place this ‘anchor’ in context within the scene, completing the ‘minimal subscene’ in an incremental manner; a minimal subscene can be extended by adding details to elements, resulting in ‘anchored subscene’ each subscene corresponds to a propositional structure that in turn corresponds to a clause (or sentence)

3) There must be a difference between the eye movement patterns for “a woman hits a man and she wears a hat” and “a woman wearing a hat hits a man” although the used lexicons are about the same

The experiment result indicates that a relatively complex eye movements are followed by utterances either with more complex sentence structures or with more thematic information involved.

[16]

Eye gazes of a complex movement pattern03_athletic [fairman]

2.81~3.93::“one team just lost” after inspecting at the both groups3.93~5.52::“they are looking at the team that won” this involves inspecting the both of

the groups (whereas [karl] “they’re really skinny… got a baton” only focuses on the left group)11_pool [fairman]

6.26~7.57:: “sort of talking to some girls…” after inspecting the guy and the girls at the bar

7.57~9.62:: “but he's actually watching these two guys play pool..” but right after switched the gaze between the guy and the pool table (hole)06_converstaion [fairman]

9.26~11.07:: “...looks like he's wondering what they shared” looking at both sides of the groups and the guy’s face (probably for figuring out the eye-gaze direction) and body

Eye gazes of a simple patternConversely, eye movements involving with simple sentence structures (e.g. “there is a man...”, “He’s wearing sandals…”, “…got something”, etc.) are quite simple and they are more like a serial movement from one item to another whereas complex ones involve visiting several locations

The Dynamic Interaction Hypothesis (Hypothesis III) asserts that the vision and language system are constantly interacting so that not only the visual attention guides the verbal expression but also does the other way.

[17]

1) Fixations followed by producing complex nouns are longer than those followed by simple nouns (Myer, 2005)

2) Both the elemental (low-level elemental attention attraction) and structural (more high-level/top-down structure) considerations are done for producing sentences on the given scene (Bock et al., 2005)

a relatively complex structure lengthens the fixation duration

Passive sentence simple causal schemas might have been involved

Active sentence occur more frequently and easier to produce

Figure 1. Bock et al., 2005Figure 1. Myer, 2005

The tendency of longer fixation for complex expressions was found from the result; also, there were several cases where the description being produced seems to bias the gaze.

[18]

Construction requiring more information01_soldiers [fairman]

5.21~5.72 5.72~7.33:: “His hat…” “he’s wearing a military hat…”10.18~11.45 11.45~12.70:: “…and he’s got a bag” “it’s like a crumpled up bag”

longer gazes that follow right after short gazes; probably the subjects thought the construction is too simple or missing crucial information to be produced as a sentence

More dynamic interaction between gaze and utterance08_disabled [fairman]

4.5~5.68 5.68~7.42:: “...someone who just fell...” “..black woman just fell...number seven” probably the construction required more detailed information to fill in instead of just ‘someone’05_basketball [fairman]

10.35~12.28 13.41~14.39:: “...being guarded by another guy in the black who...” firstly produced sentence seemed to require more information (more than the gender and the color of uniform) on the guy who blocks the player “...looks kinda out of it...” later it turned out to be irrelevant

The Spatial Anchor Hypothesis (Hypothesis IV) asserts that the attentional focus serves the function of associating spatial pointers with elements perceived; accessing them in working memory (WM) later requires covert/overt attention shifts.

[19]

1) SemRep (Arbib & Lee, 2008): an internal structure that encodes semantic information associated with the elements of the scene by coarsely coded spatial coordinates

2) Hollywood Squares experiment (Richardson & Spivey, 2000)

3) The tropical fish experiment (Laeng & Teodorescu, 2002)

More likely to see here!

Figure 2. Richardson & Spivey, 2000

NOTE: This experiment is not conducted yet

The current study can be extended by including: (1) account for gist (covert attention), and (2) more thorough account for the language bias on the attentional shifts.

[20]

1) For almost all experiment results, ‘gist’ (without particular gazes involves) plays a role at least at the initial phase of the scene recognition the current study focuses only on the overt-type attention, but the covert-type would be important as well

05_basketball [fairman]1.04~1.79:: “basket ball” after a short gaze only on a player

06_converstation [fairman]1.48~2.53:: “Home situation” there is no specific gaze foundThis is usually at the beginning of the scene display later description goes on to more specific objects/events in the scene

2) van der Meulen et al.(2001): even if an object is repeatedly mentioned, the subjects still looks back at the object although less frequently a good example of the influence of the language system on the vision system (the Dynamic Interaction Hypothesis)Firstly mentioned: 82%, Secondly: 67% (with noun) / 50% (with pronoun) It needs to devise a more thorough way on accounting for the hypothesis

Part II: A New Approach – Construction Grammar and Template Construction Grammar (TCG)

[21]Sidney Harris’s Science Cartoons (S. Harris, 2003)

Constructions are basically ‘form-meaning pairings’ that serve as basic building blocks for grammatical structure, thus blurring the distinction between semantic and syntax.

[22]

Goldberg (2003): Constructions are stored pairings of form and function, including morphemes, words, idioms, partially lexically filled and fully general linguistic patterns

Table 1. Examples of constructions, varying in size and complexity; form and function are specified if not readily transparent (Goldberg, 2003)

They are not grammatical components and might be different among individuals

Seriously, there is virtually NO difference between semantic and grammatical components – semantic constraints and grammatical constraints are interchangable.

[23]

Examples where the distinction between semantic and syntactic constraints blurs out

(a) Bill hit Bob’s arm.(b) Bill broke Bob’s arm.

(a) Bill hit Bob on the arm.(b) *Bill broke Bob on the arm.

[X V Y on Z] V has to be a ‘contact verb (e.g. caress, kiss, lick, pat, stroke)’ not ‘break verb (e.g. break, rip, smash, spliter, tear)’ (Kemmerer, 2005)

(a) John sent the package to Jane.(b) John sent Jane the package.

(a) John sent the package to the dormitory.(b) *John sent the dormitory the package.

[X transfer-verb Y Z] X and (especially) Y have to be ‘animate’ objects

The semantic of the verb, not the syntactic category, is what decides the syntactic category of the sentence

Construction grammar paradigm is also supported (1) from the neurophysiological perspective and (2) by the ontogenetic account.

[24]

1) Bergen & Wheeler (2005): not only the content words but also (or even more) the function words contribute to the context of a sentence

2) Vocabulary among each individual expands/shrinks through time grammatical categories are at theoretical, not necessarily cognitive, convenience

(a) John is closing the drawer. motor activation (strong ACE found)(b) John has closed the drawer. visual activation (less or no ACE found)

ACE: Action-sentence Compatibility Effect

Although the same content words are used, the mental interpretations vary

“want milk” “want X”(holophrase) (general construction) (Hill, 1983)

“I’m-sorry” “I[NP] am[VP] sorry[ADJ]”(holophrase) (general construction)

Constructions defined in Template Construction Grammar (TCG) represent the ‘form-meaning’ pairings as ‘templates’.

[25]

construction name

class defines syntactic hierarchy level as well as the ‘competition and cooperation’ style

Template SemRep as the semantic meaning and a lexical sequence (with slots) as the syntactic form

There are broadly two types of constructions; one encodes rather simple lexical information while the other encodes a more abstract syntactic structure.

[26]

Complex one encodes a higher level abstract structure slots define syntactic connections while red lines indicate the expression order

Simple one only encodes lexical information it corresponds to a word

In the production mode of TCG, constructions are trying to ‘cover’ the SemRep generated by the vision system, and the matching is done based on similarity.

[27]

more than one construction is attached at this moment it will be sorted out later

Constructions interact each other under the ‘competition and cooperation’ paradigm; they compete when there are spatial conflicts (among the same classes).

[28]

COMPETITION

COMPETITION

Constructions interact each other under the ‘competition and cooperation’ paradigm; they cooperate (conjoin) when a grammatical connection is possible.

[29]

COOPERATION(all over the place)

Conjoined constructions form a hierarchical structure that resembles a parse tree

The proposed system basically consists of three main parts, the vision system, language system and visuo-linguistic working memory, with attention on top of them.

[30]

building SemRep from visual scenes

TCG works here

Three parts communicate via attention

SemRep

Concurrency is the virtue of the system; the system is designed as a real-time multi-schema network, and it allows the system a great deal of flexibility.

1)The result of eye experiment strongly implicate parallelism

2)Constructions are basically schemas that run parallel (the C&C paradigm)

[31]

X

The vision system provides parts of SemRep produced as the interpretation of the scene

VLWM holds SemRep (with constructions attached) changing dynamically

The language system attaches constructions and reads off the (partially) complete sentence at the moment

Attention is proposed as a computational resource, rather than a procedural unit, which affects all over the system – thus it can act as a communication medium.

Zoom Lens model of attention (Eriksen & Yeh, 1985): attention is a kind of multi-resolution focus whose magnification is inversely proportional to the level of detail

Low resolution used for large region, encompassing more objects, fewer details; perceiving groups of entities as a coherent whole

High resolution used for small region, fewer objects, more details; perceiving individual entities

[32]

Attention

LanguageSystem

VisionSystem

VLWM

total computationalresource required

More attention is required to retrieve more concrete details it corresponds to the PSS (Barsalou, 1999) account

There is a forgetting effect which discards SemRep components or constructions a certain amount of computational resource for a certain time duration required

requesting more details on a certain part of SemRep by biasing attention distribution

There is an attentional map that covers over the regions of the visual input via SemRep this affects the operation of the vision system and VLWM

= attention at each time unit X time duration

Constructions are distributed, being centered around the perisylvian regions which include classical Broca’s (BA 45/44), Wernicke’s (BA 22) and the left superior temporal lobe areas.

[33]

a phonological rehearsal device as

well as a working memory circuit for complex syntactic verbal processes

(Aboitiz & Garcia, 1997)

a ‘functional web’ for linking phonological information related to the articulatory and acoustic pattern of a word form is developed

(Pullvermuler, 2001)

Kaan & Swaab 2002; Just et al. 1996; Keller et al. 2001; Stowe et al. 2002

Concrete words, or the lexicon constructions are topographically distributed across brain areas associated with the process of corresponding categorical properties.

[34]

It is also supported by the category-specific deficit accounts (Warrington & McCarthy 1987; Warrington & Shallice 1984; Caramazza & Shelton 1998; Tyler & Moss 2001).

concepts of animals mostly associated with the temporal visual properties

concepts of tools or actions correlated to the motor and parietal areas action and tool use

The higher-level constructions that encode grammatical information might be distributed across the areas stretched from the inferior to medial part of the left temporal cortex.

[35]

the left-inferior-temporal and fusiform gyri are activated during processing of pragmatic, semantic and syntactic linguistic information (Kuperberg et al., 2000)

The perirhinal cortex is involved in object identification and its representation formation by integrating multi-modal attributes (Murray & Richmond 2001)

The syntacti manipulation of constructions (competition and cooperation) mainly happens in Broca’s area, which is believed to be activated when handling complex symbolic structure.

[36]

for retrieving semantic or linguistic components during the matching and ordering process

for handling the arrangement of lexical items

Broca’s area is activated more when handling sentences of complex syntactic structure than of simple structure (Stromwold et al., 1996)

BA 45 is activated by both speech and signing during the production of language narratives done by bilingual subjects whereas BA 44 is activated by the generation of complex articulatory movements of oral/laryngeal or limb musculature (Horwitz et al., 2003)

VLWM is built around the left dorsolateral prefrontal cortex (DLPFC; BA 9/46) stretched to the left posterior parietal cortex (BA40).

[37]

the executive component for processing of the contents of working memory (Smith et al., 1998)

the storage buffer for verbal working memory circuit (Smith et al., 1998)

The monkey prefrontal cortex is found to involve in sustaining memory for object identity and location (Rainer et al., 1998)

References

Aboitiz, F; Garcia, R (1997) The evolutionary origin of the language areas in the human brain. A neuroanatomical perspective, BRAIN RESEARCH REVIEWS 25 (3):381-396

Arbib, MA; Lee, J (2008) Describing visual scenes: towards a neurolinguistics based on Construction Grammar, BRAIN RESEARCH 1225:146-162

Barsalou, LW (1999) Perceptual symbol systems, BEHAVIORAL AND BRAIN SCIENCES 22 (4):577-+

Price, C; Thierry, G; Griffiths, T (2005) Speech-specific auditory processing: where is it?, TRENDS IN COGNITIVE SCIENCES 9 (6):271-276

Barsalou, LW; Simmons, WK; Barbey, AK; Wilson, CD (2003) Grounding conceptual knowledge in modality-specific systems, TRENDS IN COGNITIVE SCIENCES 7 (2):84-91

Bergen, BK; Wheeler, KB (2005) Sentence understanding engages motor processes, PROCEEDINGS OF THE TWENTY-SEVENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY

Bock, JK; Irwin, DE; Davidson, DJ (2005) Putting first things first, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:249-278 (Chapter 8)

Caramazza, A; Shelton, JR (1998) Domain-specific knowledge systems in the brain: The animate-inanimate distinction, JOURNAL OF COGNITIVE NEUROSCIENCE 10 (1):1-34

[38]

References

Damasio, AR (1989) Time-locked multiregional retroactivation - A systems-level proposal for the neural substrates of recall and recognition, COGNITION 33 (1-2):25-62

Eriksen, CW; Yeh, YY (1985) Allocation of attention in the visual-field, JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE 11 (5):583-597

Goldberg, AE (2003) Constructions: a new theoretical approach to language, TRENDS IN COGNITIVE SCIENCES 7 (5):219-224

Griffin, ZM (2001) Gaze durations during speech reflect word selection and phonological encoding, COGNITION 82 (1):B1-B14

Griffin, ZM (2005) Why look? Reasons for eye movements related to language production, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:213-248 (Chapter 7)

Horowitz, TS; Wolfe, JM (1998) Visual search has no memory, NATURE 394 (6):575-577

Itti, L; Arbib, MA (2006) Attention and the minimal subscene, Action to Language: via the Mirror Neuron System (Arbib, MA ed), Cambridge University Press:289-346 (Chapter 9)

Just, MA; Carpenter, PA; Keller, TA; Eddy, WF; Thulborn, KR (1996) Brain activation modulated by sentence comprehension, SCIENCE 274 (5284):114-116

[39]

References

Kaan, E; Swaab, TY (2002) The brain circuitry of syntactic comprehension, TRENDS IN COGNITIVE SCIENCES 6 (8):350-356

Keller, TA; Carpenter, PA; Just, MA (2001) The neural bases of sentence comprehension: a fMRI examination of syntactic and lexical processing, CEREBRAL CORTEX 11 (3):223-237

Kuperberg, GR; McGuire, PK; Bullmore, ET; Brammer, MJ; Rabe-Hesketh, S; Wright, IC; Lythgoe, DJ; Williams, SCR; David, AS (2000) Common and distinct neural substrates for pragmatic, semantic, and syntactic processing of spoken sentences: an fMRI study, JOURNAL OF COGNITIVE NEUROSCIENCE 12 (2):321-341

Laeng, B; Teodorescu, D (2002) Eye scanpaths during visual imagery reenact those of perception of the same visual scene, COGNITIVE SCIENCE 26 (2):207-231

Meyer, AS (2005) The use of eye tracking in studies of sentence generation, The Interface of Language, Vision, and Action (Henderson, JM; Ferreira F ed), Psychology Press:191-211 (Chapter 6)

Murray, EA; Richmond, BJ (2001) Role of perirhinal cortex in object perception, memory, and associations, CURRENT OPINION IN NEUROBIOLOGY 11 (2):188-193

Pulvermuller, F (2001) Brain reflections of words and their meaning, TRENDS IN COGNITIVE SCIENCES 5 (12):517-5242001

[40]

References

Rainer, G; Asaad, WF; Miller, EK (1998) Memory fields of neurons in the primate prefrontal cortex, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (25):15008-15013

Richardson, DC; Spivey, MJ (2000) Representation, space and Hollywood Squares: looking at things that aren’t there anymore, COGNITION 76 (3):269-295

Simmons, WK; Barsalou, LW (2003) The similarity-in-topography principle: Reconciling theories of conceptual deficits, COGNITIVE NEUROPSYCHOLOGY 20 (3-6):451-486

Smith, EE; Jonides, J; Marshuetz, C; Koeppe, RA (1998) Components of verbal working memory: Evidence from neuroimaging, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 95 (3):876-882

Stowe, L; Withaar, R; Wijers, A; Broere, C; Paans, A (2002) Encoding and storage in working memory during sentence comprehension. Sentence Processing and the Lexicon: Formal, Computational and Experimental Perspectives (Merlo, P; Stevenson, S eds), John Benjamins Publishing Company:181-206

Stromswold, K; Caplan, D; Alpert, N; Rauch, S (1996) Localization of syntactic comprehension by positron emission tomography, BRAIN AND LANGUAGE 52 (3):452-473

Tyler, LK; Moss, HE (2001) Towards a distributed account of conceptual knowledge, TRENDS IN COGNITIVE SCIENCES (6):244-252

[41]

References

van der Meulen, FF; Meyer, AS; Levelt, WJ (2001) Eye movements during the production of nouns and pronouns, MEMORY AND COGNITION 29 (3):512-521

van der Meulen, FF (2003) Coordination of eye gaze and speech in sentence production, Mediating

Warrington, EK; Shallice, T (1984) Category specific semantic impairments, BRAIN 107 (SEP):829-854

Warrington, EK; McCarthy, RA (1987) Categories of knowledge: Further fractionations and an attempted integration, BRAIN 110 (5):1273-1296

[42]

Date post:	03-Jan-2016
Category:	Documents
Upload:	abigail-byrd
View:	225 times
Download:	7 times

Jinyong Lee Computer Science Department University of Southern California Linking Vision, Brain and...

Documents