+ All Categories
Home > Documents > Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect...

Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect...

Date post: 31-May-2020
Category:
Upload: others
View: 18 times
Download: 0 times
Share this document with a friend
55
RSSI 2008 ©2008 SRC Computers, Inc. ALL RIGHTS RESERVED www.srccomputers.com Algorithm Optimization Case Study Edge Detection David Caliga SRC Computers, Inc.
Transcript
Page 1: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Algorithm Optimization Case

Study

Edge Detection

David Caliga

SRC Computers, Inc.

Page 2: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Edge Detect Algorithm

Median Filter to remove noise– 3 x 3 stencil

Prewitt or Sobel Edge Detect– Uses 3 x 3 stencil for X and Y templates to calculate

gradient values

• Prewitt templates

• Prewitt gradient

– SQRT (X*X + Y*Y) / 4

X Template Y Template Pixel data

-1 0 1 0 0 0 a00 a01 a02

1 1 1 -1 0 1 a10 a11 a12

-1 0 1 -1 -1 -1 a20 a12 a22

X = -1*a00 + 1*a02 - 2*a10 + 2*a12 - 1*a20 + 1*a22

Y = 1*a00 + 2*a01 + 1*a02 - 1*a20 - 2*a21 - 1*a22

Page 3: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Sample Code

Median Filter Code Edge Detect Code

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 9 input values in 3x3 stencil

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

// compute median filter

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// write output value to memory

REF (BL, i, j) = px; }

for (i=0; i<py-2; i++)

for (j=0; j<px-2; j++) {

// get 8 input values in 3x3 stencil

b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

// apply template

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

// compute gradient

px = sqrtf (hz*hz+vt*vt);

// write output value to memory

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px>255?255:px; }

Page 4: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization Process

Iterative process– Maximize computation per loop iteration

– Flatten nested loops

– Do things in parallel

– Overlap DMAs and compute

– Maximize use of communication bandwidth

Keep going until you run out of a resource– Off-chip Memory (OBM) accesses per clock

– Computational logic

– Internal memory blocks

– Multipliers

– Etc.

Page 5: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Firing Rate

Goal: Make computational loops fire every clock

Things that can prevent getting to the goal

– Multiple accesses per clock to memories

• Example of median code loop

– Loop carried scalar problems

• Eg: sxloc1++;

if (sxloc1 == px-1) sxloc1 = 0;

################## INNER LOOP SUMMARY ####################

loop on line 41:

clocks per iteration: 9

multiple reads of 'OBM bank A' required 8 additional clocks

################## INNER LOOP SUMMARY ####################

loop on line 52:

clocks per iteration: 3

loop-carried var 'sxloc1' required 2 additional clocks

######################################################################

Page 6: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Flatten Nested Loops

Nested Compute time

– Time = Outer_cnt-1 * ((Inner_cnt-1 + pipeline depth)

+ outer_work)

– Not optimal if the pipeline depth is “large” relative to inner

trip count

– Not optimal if “outer_work” is large

Flattened Compute time

– Time = (Outer_cnt * Inner_cnt) – 1 + “new pipeline depth”

Page 7: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Eliminate Loop Carried Scalars

Carte™ supplies many functions for users to

implement in their codes

– Accumulators

– Counters

– Bitwise operations

– Min, Max

– Etc.

Page 8: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Flattened Loops

Median Filter Code Edge Detect Code

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

a00 = REF (AL, i, j);

a01 = REF (AL, i, j+1);

a02 = REF (AL, i, j+2);

a10 = REF (AL, i+1, j);

a11 = REF (AL, i+1, j+1);

a12 = REF (AL, i+1, j+2);

a20 = REF (AL, i+2, j);

a21 = REF (AL, i+2, j+1);

a22 = REF (AL, i+2, j+2);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

REF (BL, i, j) = px; }

for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);b00 = REF (BL, i, j);

b01 = REF (BL, i, j+1);

b02 = REF (BL, i, j+2);

b10 = REF (BL, i+1, j);

b12 = REF (BL, i+1, j+2);

b20 = REF (BL, i+2, j);

b21 = REF (BL, i+2, j+1);

b22 = REF (BL, i+2, j+2);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (CL, i-2, j-2) = px > 255 ? 255 : px; }

Page 9: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

640 x 480 Image

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

1.0 1.02

Page 10: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Remove Multiple Reads to OBM

Major loop slowdown noted by compiler is

because compute loops want 9 or 8 array values

from a single memory per loop iteration

Solution: Use delay queue feature in Carte™

compiler

Page 11: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

9 points in stencil.

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

Page 12: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

Page 13: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

Page 14: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

6 have been seen

before.

Page 15: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

Data access input(i)

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

8 have been seen

before.

Page 16: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

SRC Confidential - DO NOT DUPLICATERSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Pipelining Stencil Code

Compute Process

Move a window through the image

x3

x6

x9

x1

x4

x7

x2

x5

x8

Compute f(x1,x2,..x9)

Data Storage f(x)

9 points in stencil.

8 have been seen

before.

The leading point

should be the only

data access.

Page 17: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

x3

x6

x9

x1

x4

x7

x2

x5

x8

9 Scalars

(16 –unit Shift Register,

remembers previous row)

Data access f(x)Data access input(x)Compute f(x1,x2,..x9)

Data Storage f(x)

Page 18: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i0

Compute f(x1,x2,..x9)delayq(in,&out);

Page 19: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1i0

i0 i1

Compute f(x1,x2,..x9)delayq(in,&out);

Page 20: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i2i0 i1

i0 i1 i2

Compute f(x1,x2,..x9)delayq(in,&out);

Page 21: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i3i1 i2

i0 i1 i2 i3

Compute f(x1,x2,..x9)delayq(in,&out);

Page 22: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i4i2 i3

i0 i1 i2 i3 i4

Compute f(x1,x2,..x9)delayq(in,&out);

Page 23: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i5i3 i4

i0 i1 i2 i3 i4 i5

Compute f(x1,x2,..x9)delayq(in,&out);

Page 24: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i6i4 i5

i0 i1 i2 i3 i4 i5 i6

Compute f(x1,x2,..x9)delayq(in,&out);

Page 25: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i15i13 i14

i1 i2 i3 i4 i5 i6 i7 i7 i9 i10 i11 i12 i13 i14 i15i0

Compute f(x1,x2,..x9)delayq(in,&out);

Page 26: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i16i14 i15

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

i0

i1

Compute f(x1,x2,..x9)delayq(in,&out);

Page 27: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1

i17i15

i0

i16

i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17

i0 i1

i2

Compute f(x1,x2,..x9)delayq(in,&out);

Page 28: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i17

i31

i15

i29

i16

i30

i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31

i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

i16

i0

Compute f(x1,x2,..x9)delayq(in,&out);

Page 29: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i0

i17

i32

i15

i30

i16

i31

i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32

i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16

i17

i1

Compute f(x1,x2,..x9)delayq(in,&out);

Page 30: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i1

i18

i33

i16

i31

i0

i17

i32

i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33

i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17

i18

i2

Compute f(x1,x2,..x9)delayq(in,&out);

Page 31: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i2

i18

i34

i0

i16

i32

i1

i17

i33

i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34

i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18

i19

i3

Compute f(x1,x2,..x9)delayq(in,&out);

Page 32: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i3

i19

i35

i1

i17

i33

i2

i18

i34

i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31 i32 i33 i34 i35

i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19

i20

i4

Compute f(x1,x2,..x9)delayq(in,&out);

Page 33: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Stencil Data Flow

Data access input(i)Output(i)

i15

i31

i47

i13

i29

i45

i14

i30

i46

Compute f(x1,x2,..x9)

i33 i34 i35 i36 i37 i38 i39 i40 i41 i42 i43 i44 i45 i46 i47

i17 i18 i19 i20 i21 i22 i23 i24 i25 i26 i27 i28 i29 i30 i31

i32

i16

delayq(in,&out);

Page 34: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Delay Queues

Median Filter Code Edge Detect Codefor (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

mvalue = REF (AL, i, j);

a20 = a21;

a21 = a22;

a22 = mvalue;

a10 = a11;

a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01;

a01 = a02;

delay_queue_8_var (a12,1,n==0,px, &a02);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22,

&px);

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;

delay_queue_8_var (b12,1,n==0,px, &b02);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

Page 35: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

1.0 1.029.04

Page 36: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Overlap DMA and Compute

Use “streams” feature in Carte™

A stream is an efficient communication mechanism between producer and consumer loops

Producer and consumer loops are executing in “parallel”– Consumer loop will use a value from the producer loop as

soon as it is generated

– Loops that are inherently sequential can execute in parallel

DMAs can produce streams that are consumed in a compute loop

Page 37: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Use of Streams

Median Filter Code Edge Detect Code

#pragma src parallel sections {

#pragma src section {

streamed_dma_cpu (&S0, PORT_TO_STREAM,

PATH_0, image_in, 1, nbytes); }

#pragma src section {for (n=0; n<(py-2)*(py-2); n++)

cg_count_ceil_32 (1, 1, n==0, px-2, &i);

cg_count_ceil_32 (i==1, 1, n==0, 0xffffffff, &j);

get_stream (&S0, &mvalue);

a20 = a21; a21 = a22; a22 = mvalue;

a10 = a11; a11 = a12;

delay_queue_8_var (a22,1,n==0,px, &a12);

a00 = a01; a01 = a02;

delay_queue_8_var (a12,1,n==0,px, &a02);

median_9 (a00, a01, a02,

a10, a11, a12,

a20, a21, a22, &px);

// continued on column to right

// continued from column on left

b20 = b21;

b21 = b22;

b22 = px;

b10 = b11;

b11 = b12;

delay_queue_8_var (b22,1,n==0,px, &b12);

b00 = b01;

b01 = b02;

delay_queue_8_var (b12,1,n==0,px, &b02);

hz = (b00 + b01 + b02) –

(b20 + b21 + b22);

vt = (b00 + b10 + b20) –

(b02 + b12 + b22);

px = sqrtf (hz*hz+vt*vt);

if ((i>=2) & (j>=2))

REF (BL, i-2, j-2) = px > 255 ? 255 : px;

}

} } // end parallel section and region

Page 38: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

1.0 1.029.04

12.05

Page 39: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop unrolling

Do more work during a loop iteration

Take advantage of the fact that the pixels are 8bit

data packed into 64b values

Unroll by 8 will reduce the compute time by 8x

Page 40: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Median Filter Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px8, &i); cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream (&S0, &w1);

/* | row

|| word

||| byte

vvv

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;delay_queue_64_var (w1,1,n==0,px8, &w2);

split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2,1,n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a018=v0; a019=v1;

Page 41: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Median Filter Code

median_8_9 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

median_8_9 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);

median_8_9 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);

median_8_9 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);

median_8_9 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);

median_8_9 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);

median_8_9 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);

median_8_9 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

put_stream (&S1, b1, 1);

} // end parallel section

// continued into edge detect

Page 42: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Loop Unrolling by 8

Edge Detect Code#pragma src section {

px8 = px/8;for (n=0; n<((px8)*(py-2)); n++) {

cg_count_ceil_32 (1,0,n==0,px8,&j); cg_count_ceil_32 (j==0,0,n==0,py,&i);

get_stream (&S1, &w1);

/* | row

|| word

||| byte

vvv

v111 */

a210=v0; a211=v1; a212=v2; a213=v3; a214=v4; a215=v5; a216=v6; a217=v7;

split_64to8 (w1, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a218=v0; a219=v1;

a110=v0; a111=v1; a112=v2; a113=v3; a114=v4; a115=v5; a116=v6; a117=v7;

delay_queue_64_var (w1, 1, n==0, px8, &w2);

split_64to8 (w2, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);a128=v0; a129=v1;

a010=v0; a011=v1; a012=v2; a013=v3; a014=v4; a015=v5; a016=v6; a017=v7;delay_queue_64_var (w2, 1, n==0, px8, &w3);

split_64to8 (w3, &v7, &v6, &v5, &v4, &v3, &v2, &v1, &v0);

a018=v0; a019=v1;

Page 43: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Used Inlining of edge_detect_8

Edge Detect Codeedge_detect_8 (a010, a011, a012, a110, a111, a112, a210, a211, a212, &p0);

edge_detect_8 (a011, a012, a013, a111, a112, a113, a211, a212, a213, &p1);

edge_detect_8 (a012, a013, a014, a112, a113, a114, a212, a213, a214, &p2);

edge_detect_8 (a013, a014, a015, a113, a114, a115, a213, a214, a215, &p3);

edge_detect_8 (a014, a015, a016, a114, a115, a116, a214, a215, a216, &p4);

edge_detect_8 (a015, a016, a017, a115, a116, a117, a215, a216, a217, &p5);

edge_detect_8 (a016, a017, a018, a116, a117, a118, a216, a217, a218, &p6);

edge_detect_8 (a017, a018, a019, a117, a118, a119, a217, a218, a219, &p7);

comb_8to64 (p7,p6,p5,p4,p3,p2,p1,p0,&b1);

ix = (i-4)*px8 + j-2;

if ((i>=4) & (j>=2)) DL[ix] = b1;

} // end parallel section

} // end parallel region

Page 44: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

1.0 1.029.04

12.05

92.9

Page 45: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in two 64b values

every clock

Unroll by 16

Two output DMA examples

– DMA to microprocessor after compute

– Streaming DMA to Global Common Memory

Page 46: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px16 = px/16;for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect calls

ix = (i-4)*px16 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

} // end parallel section

} // end parallel region

Page 47: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

1.0 1.029.04

12.05

92.9

158

Page 48: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Stream Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_128 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px16, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_128 (&S0, &w1, &w2);

// 16 median_8_9 calls

put_stream_128 (&S1, b1, b2 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px16 = px/16;

for (n=0; n<((px16)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px16, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_128 (&S1, &w1, &w2);

// 16 edge_detect call

iput = ((i>=4) & (j>=2)) ? 1 : 0 ;

put_stream_128 (&S2, b11, b12, iput);

} // end parallel section

#pragma src section {

streamed_dma_gcm_128 (&S0, STREAM_TO_PORT,

PATH_0, image_out, 1, nbytes); }

} // end parallel region

Page 49: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

1.0 1.029.04

12.05

92.9

158

277

Page 50: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Maximize Input DMA Bandwidth

Use streaming DMA to bring in roughly four 64b

values every clock

Unroll by 32

Page 51: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Optimization

Stream Input / Bulk DMA Output

Median Filter Code Edge Detect Code#pragma src parallel sections {

#pragma src section {

streamed_dma_gcm_256 (&S0, PORT_TO_STREAM, PATH_0,

image_in, 1, nbytes); }

#pragma src section

{

px16 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

cg_count_ceil_32 (1, 1, n==0, px32, &i);

cg_count_ceil_32 (i==1, 1, n==0, py, &j);

get_stream_256 (&S0, &w1, &w2, &w3, &w4);

// 32 median_8_9 calls

put_stream_256 (&S1, b1, b2, b3, b4 1);

} // end parallel section

// continued on column to right

// continued from column on left

#pragma src section

{

px32 = px/32;

for (n=0; n<((px32)*(py-2)); n++) {

cg_count_ceil_32 (1, 0, n==0, px32, &j);

cg_count_ceil_32 (j==0, 0, n==0, py, &i);

get_stream_256 (&S1, &w1, &w2, &w3, &w4);

// 32 edge_detect calls

ix = (i-4)*px32 + j-2;

if ((i>=4) & (j>=2)) {

CL[ix] = b11;

DL[ix] = b12;

EL[ix] = b13;

FL[ix] = b14;} // end parallel section

} // end parallel region

Page 52: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance Gains

400

350

300

250

200

150

100

50

Original Opt:

Flatten

Loops

Opt:

Delay

Queues

Opt:

Streaming

DMAs

Opt:

Unroll by 8

Opt:

Unroll by 16

Stream DMA in

Bulk DMA out

Opt:

Unroll by 16

Stream DMA in

Stream DMA out

Opt:

Unroll by 32

1.0 1.029.04

12.05

92.9

158

277

345

Page 53: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

How does Microprocessor

Performance* Compare

* Intel IPPLIB 5.1 running on 3GHz Xeon

Optimization MAP Speedup

640 x 480

MAP Speedup

1024 x 1024

Original .56 .69

Opt: Flattened Loops .56 .70

Opt: Delay Queues 5.0 6.2

Opt: Streaming DMAs 6.7 8.3

Opt: Unroll by 8 51.6 56.7

Opt: Unroll by 16 154 195

Opt: Unroll by 32 191 256

Page 54: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Logic Utilization

Optimization

Level

Percentage

Logic Utilization

Original 16

Opt: Flattened Loops 16

Opt: Delay Queues 16

Opt: Streaming DMAs 16

Opt: Unrolling by 8 37

Opt: Unrolling by 16 47

Opt: Unrolling by 32 75

Page 55: Edge Detect Algorithm - RSSIrssi.ncsa.illinois.edu/proceedings/academic/Caliga.pdf · Edge Detect Algorithm Median Filter to remove noise –3 x 3 stencil Prewitt or Sobel Edge Detect

RSSI 2008

©2008 SRC Computers, Inc. ALL RIGHTS RESERVED

www.srccomputers.com

Performance is for the taking

Standard code optimization techniques work

Ability to get massive compute parallelism is

straight forward

Easy to “dial” amount of DMA bandwidth to

match compute parallelism requirements


Recommended