Download - A High-Throughput Screening Approach to Discovering Good ... · A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual Representation Nicolas

A High-Throughput Screening Approach to Discovering Good Forms of Biologically-Inspired Visual RepresentationNicolas Pinto1 , David Cox 2 and James DiCarlo1

1. The McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology2. The Rowland Institute at Harvard

Take home message:

GPU Metaprogramming dramatically accelerates the discovery of bio-inspired vision models that beat state-of-the-art computer vision systems

Abstract1While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model - e.g. the number of units per layer, the size of pooling kernels, expo-nents in normalization operations, etc. Since the number of such parameters (explicit or implicit is typically large, and the computational cost of evaluating one particular parameter set is high, the space of possible model instantia-tions goes largely unexplored. Thus, when a model fails to approach the abili-ties of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea, or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training.

Here, we present a high-throughput approach to the exploration of such pa-rameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and pa-rameter instantiations, screening those that show promising object recogni-tion performance for further analysis. We show that this approach can yield significant, reproducible gains in performance in across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art vision systems from the literature.

As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both ar-tificial vision and our understanding of the computational underpinning of biological vision.

The Rowland Institute at HarvardHARVARD UNIVERSITY

High-throughput screening for good models5

Analysis: Why are these models good ?7

Metaprogramming3

A Vast space of models to explore2

Recipe:1) Use scripting (e.g. Python) and templates,2) Instrumentalize your code,3) Let the computer auto-tune it!

Unsupervised Learning Test with a “screening”object recognition task Validate on other tasksChoose the best modelsGenerate Random Models

Unsupervised Learning Screen Distribution

Law & Order “World”

Performance / Cost

Hardware

Relative Speedup 1x 80x4x 544x 222x 1544x 2712x

Relative Perf. / $ 1x 120x2x 272x 833x 772x 1356x

Full System Cost(approx.)

$1,500** $1,000$2,700** $3,000* $400 $3,000* $3,000*

Manufacturer Intel IntelIntel NVIDIA Sony, IBM, Toshiba NVIDIA NVIDIA

Model Q9450 Q9450Q9450 7900 GTX PlayStation 3 8800 GTX GTX 280

Year 2008 20082008 2006 2007 2007 2008

Implementation MATLAB SSE2MATLAB Cg Cell SDK CUDA CUDA

# cores used 1 44 4x96 2+6 4x128 4x240

CPUs GPUs

10-1

100

101

102

103

theoreticalobserved

Proc

essi

ng P

erfo

rman

ce in

GFL

OPS

(lo

g-sc

ale)

Software

LanguagesExtensions / Libraries

Performance / Cost

Relative Speedup$ / GFLOPS

Year of implementation

Full System Cost(approx.)

Hardware

ManufacturerModel# cores used

GFLOPS (max)

5.3

(max

)

0.5

(max

) {

IntelQ9450

MATLAB / CMEX

1

2008

$1,500**

3000

1

0.5

1350

Q9450

21 (m

ax)

2 (m

ax)

{

4

Intel

MATLAB / C

4

2008

$2,700**

MEX

2

25

Q9450

85 (m

ax)

40 (m

ax)

{

4

Intel

CSSE2 / pthread

80

2008

$1,00040

22-11*

187

(max

)

68 (m

ax)

{

NVIDIA7900 GTX

Python / C++OpenGL / Cg

136-544*

2006

$1,500-$3,000*

96

68-272*

4

154

(max

)

111

(max

)

76 (m

ean)

9 (m

in)

{

Sony, IBM, ToshibaPlayStation 3 (Cell)

Python / CCell SDK

222

2007

$400

8 (2+6)

111

8-4*

518

(max

)

193

(max

) {

NVIDIA8800 GTX

Python / CPyCUDA

386-1544*

2007

$1,500-$3,000*

128

193-772*

240

4-2*

933

(max

)

339

(max

) {

NVIDIAGTX 280

Python / CPyCUDA

678-2712*Relative GFLOPS / $ 1 2 120 136-272* 833 386-772* 678-1356*

2008

$1,500-$3,000*339-1356*

num

ber o

f mod

els

50 60 70 80 90 1000

50

100

150

200

250

N=2500

top 5 models

Percent Correct

• bio-inspired architecture• 52 free parameters• >10 models25

Best Models

50

60

70

80

90

100

Perc

ent C

orre

ct

top 5 models (screen)

5 4 3 2 1 blend

di�erent initial weights

50

100di�erent

videos

Cars

vs.

Planes

• Why are the best models better? • How can we adjust the parameter space to get farther?

False True

Preprocessing / Norm / Zero-mean

tcerroc %

Parameter value

3 5 7 9

Layer 2 / Linear Filtering / RF size

Parameter value

16 32 64 128

1 3 5 7

50

60

70

80

90

Layer 3 / Learning / Neighborhood size

9

50

60

70

80

90

50

60

70

80

90

50

60

70

80

90

32.5

coun

t

median of Hamming distances (from 5-choose-2 binarized parameter vectors)

N = 100,000p = 0.136

Parameters similarity (top-�ve vs rest)

coun

t

12828

N = 100,000p = 0.082

median of Euclidean distances(from 5-choose-2 similarity matrices)

Representation similarity (top-�ve vs rest)

>=

coun

t

-ln(p)

Parameter Signi�cance

102 4 6 8

10

20

0.25

0.15

0.05

Frac

tion expected fraction

Filtering ActivationNormalizationPooling

Learning

Parameters by Category

low signi�cance

medium signi�cance

high signi�cance

conv_kernel_template.cu

...mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1mov.b32 $r1, c0[$ofs2+0x0008]mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4mov.b32 $r1, c0[$ofs2+0x000c]mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4mov.b32 $r1, c0[$ofs2+0x0010]mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4

...

version A2x faster!version B

...mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output) {

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1 __shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory#for i in xrange($LOAD_ITERATIONS)#if $i==($LOAD_ITERATIONS-1) if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)#end if {

input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; ...

Cheetah Templating

conv_kernel_4x4x4.cu

20 kB

conv_kernel_8x8x4.cu

64 kB

#include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)extern "C" {

__global__ void convolve_beta_j0(float4 *input, float4 *output) {

__shared__ float shared_in[131][4+1];

// -- input/output offsets const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x; const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x; float4 input_v4;

// -- load input to shared memory {

input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);shared_in[threadIdx.x+128*0][0] = input_v4.x;shared_in[threadIdx.x+128*0][1] = input_v4.y;shared_in[threadIdx.x+128*0][2] = input_v4.z;shared_in[threadIdx.x+128*0][3] = input_v4.w;

} if((threadIdx.x+128*1)<131) {

input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);shared_in[threadIdx.x+128*1][0] = input_v4.x;shared_in[threadIdx.x+128*1][1] = input_v4.y;shared_in[threadIdx.x+128*1][2] = input_v4.z;shared_in[threadIdx.x+128*1][3] = input_v4.w;

} __syncthreads();

// -- compute dot products float v, w;

float sum0 = 0; float sum1 = 0; float sum2 = 0; float sum3 = 0; v = shared_in[threadIdx.x+0][0]; w = constant[0][0][0]; sum0 += v*w; w = constant[0][0][1]; sum1 += v*w; w = constant[0][0][2]; sum2 += v*w; w = constant[0][0][3]; sum3 += v*w; v = shared_in[threadIdx.x+1][0]; w = constant[0][1][0]; sum0 += v*w; w = constant[0][1][1]; sum1 += v*w; w = constant[0][1][2]; sum2 += v*w; w = constant[0][1][3]; sum3 += v*w; v = shared_in[threadIdx.x+2][0]; w = constant[0][2][0]; sum0 += v*w; ...

4 GPU performance

Result6Validation

50

60

70

80

90

100

Perc

ent C

orre

ctPe

rcen

t Cor

rect

50

60

70

80

90

100

50

60

70

80

90

100

Cars vs. Planes (validation)

Synthetic Faces MultiPIE Hybrid

v1-like(control)

state-of-the-art(from literature)

top 5 models(high-throughput search)

5SIFTregular + GB PHOG PHOW SLF 4 3 2 1 blend50

60

70

80

90

100

Boats vs. Animals

v1-like(control)



regular + blendSIFT GB PHOG PHOW SLF

v1-like(control)



SIFTregular + GB PHOG PHOW SLF blend

v1-like(control)



regular + blendSIFT GB PHOG PHOW SLF

5 4 3 2 1

5 4 3 2 15 4 3 2 1

We discovered models that beat state-of-the-art computer vision(object and face recognition)

L1

L2

L3

input

Read-out (standard “back-end” classi�er)

number of �lters

kernel

kernel

number of �lters

number of �lters

Learning

kernel

normalization

normalization

normalization

norm strengththresh/sat



RateTrace“Temp. Adv.”“Auto-reset”

...

Learning


...

Learning


...