Going after object recognition peformance to discover how...

James DiCarlo MD, PhD

Professor of NeuroscienceHead, Department of Brain and Cognitive Sciences Investigator, The McGovern Institute for Brain ResearchMassachusetts Institute of Technology, Cambridge MA, USA

Going after object recognition peformance to discover how the ventral stream works.

“invariance” is crux problem

hierarchical, working system

Ventral visual stream

Systems neuroscience: the non human primate model



Powerful set of visual features



Powerful set of visual features

Understanding the brain and discovering game-changing information processing

technology are two sides of the same coin.

How the brain works

When biological brains perform better than computers

computer science

neuroscience

psychophysics

The convergence of three fields

How the brain works

When computers perform as well as or better than biological brains

Falsifiable hypotheses

Attempt to test/falsify those hypotheses

New ideas, algorithm parametersNew phenomena

Common physical source (object) leads to many images

Poggio, Ullman, Grossberg, Edleman, Biederman, etc.DiCarlo and Cox, TICS (2007); Pinto, Cox, and DiCarlo, PLoS Comp Bio (2008)

“identity preserving image variation”

View: position, size, pose, illumination Clutter, occlusion, illumination

Intraclass

Deformation, articulation

computer science

neuroscience


How the brain works


psychophysics

• Examples:• Hubel & Wiesel (1962)• Fukushima (1980)• Perrett & Oram (1993)• Wallis & Rolls (1997)• LeCun et al. (1998)• Risenhuber & Poggio (1999)• Serre, Kouh, et al. (2005)

Brain-inspired computer algorithms

1. Selectivity 2. Tolerance

“AND” “OR”

Serre, Kouh, Cadieu, Knoblich, Kreiman & Poggio 2005

•Hierarchy•Spatially local filters•Convolution•Normalization•Threshold NL•Unsupervised learning•...

FROM BIOLOGY:

computer science

neuroscience

psychophysics


How the brain works



e.g. HMAX

Serre, Kouh, Cadieu, Knoblich, Kreiman & Poggio 2005

HMAX successes (~2005)

Serre Oliva & Poggio 2007

(under limited human viewing conditions)

HMAX successes (~2007)

pixels

Human levelIT population HMAX

Circa 2007

Perf

orm

ance

~2008: But HMAX and other models failed to explain neurons

HMAX model

Representational similarity analysis

Kriegeskorte, Frontiers in Neuroscience (2009)

Biological ventral stream Models of ventral stream

computer science

neuroscience

psychophysics

What went wrong?

How the brain works




Stringency of these “Brains vs. Machines” tests was far too weak

“V1-like” models

One problem was insufficient variation in the test sets.

~2008: Tests of performance were not stringent enough.

Pinto, Cox, and DiCarlo, PLoS Comp Bio (2008)

SLF (~HMAX)

Caltech 101 benchmark

Head

Clo

se-b

ody

Mediu

m-b

ody

Far-body

50

75

100

Perfo

rm

an

ce (

%)

“HMAX 2.0” (Serre et al. PNAS 2007)

Pinto, Majaj, Barhomi, Salomon, Cox, DiCarlo COSYNE 2010

Animal vs. Non-animal

Humans

V1-like

pixels

Human levelIT population HMAXV1-like

Perf

orm

ance

Example object recognition task: “car detection”

Pinto, Cox & DiCarlo, PLoS Comp Bol (2008), Pinto, DiCarlo and Cox, ECCV (2008); Pinto, Doukan, DiCarlo & Cox, PLoS Comp Biol (2009)

Image generation strategy:

2009: More stringent, but compact tests of “object recognition”

Example object recognition task: “car detection”

Pinto, Cox & DiCarlo, PLoS Comp Bol (2008), Pinto, DiCarlo and Cox, ECCV (2008); Pinto, Doukan, DiCarlo & Cox, PLoS Comp Biol (2009)

no variation more variation lots of variation

Image generation strategy:

- Parametric control of task demand (esp. invariance)- Few images needed to bring computer vision features to their knees

no variation more variation lots of variation

“car” not “car”

...... n>100 n>700

Basic car task, variation level: 3

2009: Toward more stringent tests of “object recognition”

Δ

Data merged here: 48 basic-level tasks (8 labels x 6 level of variation)

Machines lose to humans

2010: Machines vs. human brains

Machines beat humans!

0%0%0%0°0°

10%20%10%15°15°

20%40%20%30°30°

30%60%30%45°45°

40%80%40%60°60°

50%100%50%75°75°

60%120%60%90°90°

position (x-axis)position (y-axis)

scalein-plane rotationin-depth rotation

Increasing Composite Variation

Perfo

rman

ce (%

)

4 60 1 2 350

60

70

80

90

100

Pixels

V1-like

chance

SIFT

SLF

V1-like

a) “cars vs. planes” task b) controls

new draw0

25

more training

30

0

other objects

30

0

multi-class

35

0

25

0Perfo

rman

ce re

lativ

e to

Pix

els

(%)

Geo

met

ric B

lur

PHO

G

PHO

W

Pinto, Barhomi, Cox & DiCarlo, WACV(2010)

SLF

PHOWPHOG

SIFT

(~HMAX)

pixels

Human levelIT population HMAXV1-like

Perf

orm

ance

pixels

Human level

IT population HMAXV1-like

Perf

orm

ance

pixels

Human level

IT populationHMAX

V1-like

Perf

orm

ance

pixels

Human levelIT population

HMAXV1-like

Perf

orm

ance

simple decode

pixels


HMAXV1-like

Perf

orm

ance

V4 population

simple decode

pixels


HMAX

V1-like

Perf

orm

ance

V4 population

SuperVisionHMO

? Zeiler&Fergus

simple decode

IT neuronal unitsV2-like V4 neuronal units HMO modelV1-like PixelsAnimals (8)

Boats (8)

Cars (8)

Chairs (8)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cat

egor

y ge

nera

lizat

ion

Imagegeneralization

Objectgeneralization

Categorygeneralization

Animals (4)Boats (4)Cars (4)Chairs (4)Faces (4)Fruits (4)Planes (4)Tables (4)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

0.9

0.6

0.3

0.0

Spe

arm

an c

orre

latio

n co

effic

ient

b c

a

Pix

els

V1-

like

SIF

T

HM

AX

V2-

like

HM

OV

4 un

itsIT

uni

ts s

plit-

half


Boats (8)

Cars (8)

Chairs (8)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cat

egor

y ge

nera

lizat

ion

Imagegeneralization




Faces (8)

Fruits (8)

Planes (8)

Tables (8)

0.9

0.6

0.3

0.0

Spe

arm

an c

orre

latio

n co

effic

ient

b c

a

Pix

els

V1-

like

SIF

T

HM

AX

V2-

like

HM

OV

4 un

itsIT

uni

ts s

plit-

half

Neural population similarity of images along the ventral stream


Boats (8)

Cars (8)

Chairs (8)

Faces (8)

Fruits (8)

Planes (8)

Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cat

egor

y ge

nera

lizat

ion

Imagegeneralization




Faces (8)

Fruits (8)

Planes (8)

Tables (8)

0.9

0.6

0.3

0.0

Spe

arm

an c

orre

latio

n co

effic

ient

b c

a

Pix

els

V1-

like

SIF

T

HM

AX

V2-

like

HM

OV

4 un

itsIT

uni

ts s

plit-

half

other models

IT neuronal unitsV2-like model V4 neuronal units HMO modelV1-like modelAnimals (8)Boats (8)Cars (8)Chairs (8)Faces (8)Fruits (8)Planes (8)Tables (8)

Imag

e g

ener

aliz

atio

nO

bjec

tge

nera

lizat

ion

Cat

egor

y ge

nera

lizat

ion

Imagegeneralization




Faces (8)

Fruits (8)

Planes (8)

Tables (8)

0.9

0.6

0.3

0.0

Pop

ulul

atio

n si

mila

ritty

to IT

b c

a

Pix

els

V1-li

keS

IFT

HM

AX

V2-

like

HM

OV

4 un

itsIT

uni

ts s

plit-

half

HMAX Model

(RD

M c

orre

latio

n)

Explanatory power of HMO model

Current maximum expected explanatory power *

Yamins, Hong, Soloman, Seibert and DiCarlo (under review)

Inspired by N. Kriegeskorte et al. (2008, 2009)

a 0.8

0.6

0.4

0.2

0.0

Goo

dnes

s of

fit t

o IT

resp

onse

(r2 )

Animals Boats Cars Chairs Faces Fruits Planes Tables

Unit 1: r2 = 0.48

Pix

els

V1-li

ke

SIF

T

HM

AX

V2-

like

HM

O

HM

O (M

1 IT

onl

y)

HM

O (M

2 IT

onl

y)

IT s

plit-

half

25

20

15

10

5

00.2 0.4 0.6 0.8 1.0

Num

ber o

f uni

ts

n = 147b

0.5

0.4

0.3

0.2

0.1

0.0Goo

dnes

s of

fit t

o IT

resp

onse

(r2 )

Pix

els

V1-li

ke

SIF

T

HM

AX

V2-

like

HM

O

Imagegeneralization



cAnimals Boats Cars Chairs Faces Fruits Planes Tables

Unit 2: r2 = 0.55

Animals Boats Cars Chairs Faces Fruits Planes Tables

Unit 3: r2 = 0.34

d

Goodness of fit to individual IT unit’s response (r2)

Yamins, Hong, Soloman, Seibert and DiCarlo (under review)

Ability to predict IT responses to new images and new objects is dramatically better than previous models.

Predictions of single site IT responses from current best model

Response of neural site

Prediction of HMO model

Response of neural site

Prediction of HMO model

...

ĭ1

ĭ2

ĭk

��

�

NormalizePoolFilter Threshold &

Saturate

Neural-like basic operations

L2 L3

a Basic operations:

L1

O��

O��e��e��e��e�� e�� filter , thr , sat , pool , norm

O �� O ��

Hierarchical Stacking

Basic bio-constrained model component inside HMO

Hubel & Wiesel (1962), Fukushima (1980); Perrett & Oram (1993); Wallis & Rolls (1997); LeCun et al. (1998); Riesenhuber & Poggio (1999); Serre, Kouh, et al. (2005), etc....

Pinto, Doukan, DiCarlo & Cox, PLoS Comp Biol (2009)

“Output” is thousands of visual features

Exp

aine

d V

aria

nce

of IT

Neu

rons

0%

50%

Performance of artificial visual features(% correct)

Abili

ty o

f art

ifici

al v

isua

l fea

ture

s to

pre

dict

IT re

spon

ses

(% v

aria

nce

expl

aine

d)

Exploratio

n of

basic m

odel class

We are optimizing this way

The better a model performs, the better is explains IT responses.

(2013)

pixels


HMAX

V1-like

Perf

orm

ance

V4 population

SuperVision

Zeiler&Fergus

HMO

??simple decode

Today:

computer science

neuroscience

psychophysics

Follow the performance trail...

How the brain works




Stringency of these tests is crucial.

Must include “invariance”.

The power of stringent tests to elucidate biological brains

• Discover IT neuronal codes that can explain behavior• Demonstrate that other possible codes CANNOT• Demonstrate which computer vision features CANNOT

1)

• Driving discovery (“learning?”) of new CV features• These are becoming more and more capable of

explaining what the brain is doing

2)

Dan Yamins Ha Hong Charles Cadieu Dave Cox Nicolas Pinto

Dan Yamins Ha Hong Ethan Soloman

Date post:	24-Apr-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Going after object recognition peformance to discover how...

Documents