Supervisor: Oded Green Ami Galperin Lior David. Introduction Building the covariance matrix The...

Final Presentation

Parallel Covariance Matrix Creation

Supervisor:Oded Green

Ami Galperin Lior David

2Parallel Covariance Matrix Creation - Final Presentation

Table of Contents - Overview Introduction Building the covariance matrix

The naïve algorithm Our algorithm

Terminology The Algorithm Optimizations Results

MVM on Plurality The MVM algorithm Plurality Platform Results

Future Projects Conclusions

April 18, 2010


Table of Contents Introduction Building the covariance matrix





April 18, 2010


Project’s Goals

Developing a parallel algorithm for the creation of a covariance matrixCompatibility with Plurality’s HAL platformMaximized parallelization and core utilizationIntegrating the algorithm into Elta’s MVM (Minimum Variance Method) algorithm implementation

April 18, 2010


MVM Algorithm MVM is a modern 2-D spectral estimation algorithm used by Elta’s Synthetic Aperture Radar (SAR).The MVM algorithm:

Improves resolution Removes side lobe artifacts (noise)Reduces speckle compared to what is possible with conventional Fourier transform SAR imaging techniques

One of MVM’s main building blocks is the creation of a covariance matrix

April 18, 2010

6Parallel Covariance Matrix Creation - Final PresentationApril 18, 2010

Plurality PlatformPlurality’s HyperCore Architecture Line (HAL) family of massively parallel manycore processors features:

Unique task-oriented programming modelNear-serial programmability High performance at low cost per watt per square millimeterUnique shared memory architecture - 2 MB cache size







April 18, 2010


Implementing the Naïve Algorithm

April 18, 2010

Implementing the naïve algorithm will give us a greater understanding of the parallelization problem.

Motivation:


The Naïve Algorithm

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …

CN,1 CN,2 CN,3 CN,4 CN,5 … CN,M

Chip [NxM]

April 18, 2010



C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Sub aperture [N1xM1]



April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …




April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


C2,4

C3,4

C4,4

C2,3

C3,3

C4,3

C2,3* C3,3* C4,4*C2,2* C3,2* C4,2* C2,4* C3,4* C4,4*

C2,2

C3,2

C4,2



April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

Every Sub-aperture holds its covariance matrix Cov



April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


The covariance matrix Cov is the sum of all Sub-apertures Cov matrixes

1 1N-N*

xx0 0

Cov ~M M

pq pqp q

V V



April 18, 2010

Shortcomings

Each multiplication is executed many timesFor a 32x32 chip, the total number of multiplies is 11.4M when the optimal number of multiplications is 208K (x28!)

The naïve algorithm is difficult to parallelize. Two main difficulties:

Simultaneous writing to the same Rcells – requires mutexesMemory cost of holding a Cov matrix for every permutation (each is 250 KB) is too expensive



April 18, 2010

Disadvantages

Mutexes - adds complexity Memory space - cache size is only 2 MB

Plurality Platform

The problem requires different solution!


A Whole different Ball Game!

Our Algorithm


But first …

Before presenting the algorithm there is a need to create a common language for the terms we have created.

April 18, 2010







April 18, 2010


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Permutation

M1

M2

Examples• Permutation [1,0] • Permutation [1,1]


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Permutation

M1

M2



Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Permutation

M1

M2



Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Block

M1

M2Block


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Block


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

BNW

BNW


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Shifting

M2

M1Block

Shift only upwardsand leftwards

The block is always inside the shifted window


Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Shifting

M2

M1Block




Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Shifting

M2

M1Block




Terminology

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


April 18, 2010

Shifting

M2

M1Block



Shift of (0,0) is named Zero iteration


Terminology

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

April 18, 2010

Cov- The covariance matrix[M N, M N]∙ ∙


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

Terminology

April 18, 2010

Rcell







April 18, 2010


Our Algorithm – Key Features

April 18, 2010

ParallelEach multiplication is executed once (208k for 32x32 chip)

Memory efficientGeneric

Each Rcell in Cov is calculated by one specific permutation. This enables different permutations to work simultaneously.

Concept:


Our Algorithm (simplified)

April 18, 2010

1. For each permutation (1:313)

1.1 For each legal BNW

1.1.1. Multiply the two multipliers

1.1.2. For each legal shift (including the zero iteration)

1.1.2.1. Add the multiplication product to thematching Rcell in Cov


Our algorithm (simplified)Finding all unique permutations

Iterative algorithm1. Initialize Delta (x,y) set and Permutation(x,y) set2. For each pair of cells (M1,M2) in a N1xM1 matrix

2.1. If |M1-M2| is not in D2.1.1. Add |M1-M2| to D2.1.2. Add (M1,M2) to P

Unique permutation count is 313 ( for Sub-aperture [13x13])

Executed off-line


Our algorithm (simplified)

April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


Chip [NxM]Cov- The covariance matrix[M N, M N]∙ ∙



April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


For a given Permutation [1,1]

M2

M1



April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


There’s a Block

M2

M1Block



April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


Leagal BNWs for this Block

M2

M1Block

BNW



April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


For a given BNW

M2

M1Block


Block


April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4 C4,5 … C4,M

C5,1 C5,2 C5,3 C5,4 C5,5 … C5,M

… … … … … … …


RES=M1 M2* ∙

M2

M1

RES



April 18, 2010

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4

C5,1 C5,2 C5,3 C5,4

… … … …


The multipliers Numbering

1 4 7

2 5 8

3 6 9

Block


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N


April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4

C5,1 C5,2 C5,3 C5,4

… … … …


The Zero Iteration

1 4 7

2 5 8

3 6 9

Block 5

1

RESRcell (1,5)

RES

Diag(5-1)

Main Diag


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 … R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N


April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4 C3,5 … C3,M

C4,1 C4,2 C4,3 C4,4

C5,1 C5,2 C5,3 C5,4

… … … …


Shifting

1 4 7

2 5 8

3 6 9

Block


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N


April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4

C4,1 C4,2 C4,3 C4,4

C5,1 C5,2 C5,3 C5,4

… … … … … … …


Shifting

1 4 7

2 5 8

3 6 9Block

6

2

RES

Rcell (2,6)

Diag(5-1)

Main Diag

RES

RES



April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3 C3,4

C4,1 C4,2 C4,3 C4,4

C5,1 C5,2 C5,3 C5,4

… … … … … … …


Shifting

1 4 7

2 5 8

3 6 9Block

6

2

R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 … R3,M∙N

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N


R1,1 R1,2 R1,3 R1,4 R1,5 … R1,M∙N

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,M∙N

R3,1 R3,2 R3,3 R3,4 R3,5 …

R4,1 R4,2 R4,3 R4,4 R4,5 … R4,M∙N

R5,1 R5,2 R5,3 R5,4 R5,5 … R5,

M N∙

… … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

… RM N,M∙ ∙

N


April 18, 2010

C1,1 C1,2 C1,3 C1,4 C1,5 … C1,M

C2,1 C2,2 C2,3 C2,4 C2,5 … C2,M

C3,1 C3,2 C3,3

C4,1 C4,2 C4,3

C5,1 C5,2 C5,3

… … … … … … …


Shifting

1 4 7

2 5 8

3 6 9Block

9

5

RES

Diag(5-1)

Main Diag

RES

RES

RES



April 18, 2010

We came across a regularity in the offset of the Rcell coordinates when shifting:

Leftwards (+Sub-ap size, +Sub-ap size)Upwards (+1,+1)

R1,1 R1,2 R1,3 R1,4 R1,5 R1,6 R1,7 … R1,M N∙

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,7 … R2,M N∙

R3,1 R3,2 R3,3 R3,4 R3,5 R3,6 R3,7 … R3,M N∙

R4,1 R4,2 R4,3 R4,4 R4,5 R4,6 R4,7 s… R4,M N∙

R5,1 R5,2 R5,3 R5,4 R5,5 R5,6 R5,7 … R5, M N∙

R6,1 R6,2 R6,3 R6,4 R6,5 R6,6 R6,7 … R6, M N∙

R7,1 R7,2 R7,3 R7,4 R7,5 R7,6 R7,7 … R7, M N∙

… … … … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

RM N∙ ,6

RM N∙ ,7

…RM N,M N∙ ∙



April 18, 2010

Each color represents a different permutation

R1,1 R1,2 R1,3 R1,4 R1,5 R1,6 R1,7 … R1,M N∙

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,7 … R2,M N∙

R3,1 R3,2 R3,3 R3,4 R3,5 R3,6 R3,7 … R3,M N∙

R4,1 R4,2 R4,3 R4,4 R4,5 R4,6 R4,7 s… R4,M N∙

R5,1 R5,2 R5,3 R5,4 R5,5 R5,6 R5,7 … R5, M N∙

R6,1 R6,2 R6,3 R6,4 R6,5 R6,6 R6,7 … R6, M N∙

R7,1 R7,2 R7,3 R7,4 R7,5 R7,6 R7,7 … R7, M N∙

… … … … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

RM N∙ ,6

RM N∙ ,7

…RM N,M N∙ ∙


Our Algorithm (simplified)

April 18, 2010

SummaryFor a given permutation:

RES is always written into the same group of Rcells

All on the same diagonalNot necessarily all diagonal cells

There is no overlapping between Rcells of different permutations.The basis for parallelism!Each shift writes to one unique Rcell. Theoretically enables parallelism of Rcell granularity (an instance per Rcell)

R1,1 R1,2 R1,3 R1,4 R1,5 R1,6 R1,7 … R1,M N∙

R2,1 R2,2 R2,3 R2,4 R2,5 R2,6 R2,7 … R2,M N∙

R3,1 R3,2 R3,3 R3,4 R3,5 R3,6 R3,7 … R3,M N∙

R4,1 R4,2 R4,3 R4,4 R4,5 R4,6 R4,7 s… R4,M N∙

R5,1 R5,2 R5,3 R5,4 R5,5 R5,6 R5,7 … R5, M N∙

R6,1 R6,2 R6,3 R6,4 R6,5 R6,6 R6,7 … R6, M N∙

R7,1 R7,2 R7,3 R7,4 R7,5 R7,6 R7,7 … R7, M N∙

… … … … … … … … …

RM N∙ ,1

RM N∙ ,2

RM N∙ ,3

RM N∙ ,4

RM N∙ ,5

RM N∙ ,6

RM N∙ ,7

…RM N,M N∙ ∙


Permutations Execution Times


Permutations Execution Times

Different workload for different permutations, therefore changing the order of permutations’ execution may improve core utilization.


Parallelization Opportunities

Different permutations work simultaneouslyDifferent chips can work simultaneouslyFiner grain parallelism of Rcell granularity (an instance per Rcell)


Platform Comparison

Our algorithm is optimal for shared memory platforms since Cov is shared by all coresWorking on distributed memory platforms will damage its efficiency as a result of communication overhead Plurality provides much higher performance-power utilization than Elta's grid computing

Plurality vs. Distributed Systems







April 18, 2010


Reduces calculation at run time by 50%Same tables used for all chips

April 18, 2010

Look-up Tables

Execute many data-independent calculations off-line and storing results as a memory efficient static look-up tables.

Concept:

Advantages:


Holds relevant permutation info:

Permutations Table Look-up tables

Optimal table size: (4 6 + 8 2) bit 313 = ∙ ∙ ∙ 1.5 KBytes

Multipliers’ indexesBlock bordersZero iteration coordinates


Maps each shift to a Rcell

Uses the regularity in the offset of the Rcell coordinates when shifting upwards (+1,+1) or leftwards(+Sub-ap size, +Sub-ap size)

Concept:

Offsets Table Look-up tables

Optimal table size: 2 (13 13 8) bit 313 = ∙ ∙ ∙ ∙ 106 KBytes


Cov is an Hermitian matrix.

April 18, 2010

Concept:Using matrix characteristics to reduce calculations

Important observation:

†R R , ,R i j R j i

Using matrix Characteristics


Highlight:Building Cov’s upper triangle only and, if necessary,generate the lower triangle inexpensively

Advantages:Reduces calculations by half Requires less space for storing the Cov matrixMost eigendecomposition algorithms requires upper triangle only

Using matrix Characteristics







April 18, 2010


Results (x86)

April 18, 2010

6 8 16 26 32 33 360

0.02

0.04

0.06

0.08

0.1

0.12

NaiveOurs

Chip Size

Run

Tim

e [s

econ

ds]

Not optimized for x86

Different Chip Sizes


3 4 6 8 11 13 150

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

naiveOurs

Sub-ap Size

Run

Tim

e [s

econ

ds]

April 18, 2010

Results (x86)Different Sub-aparture Sizes

Not optimized for x86







April 18, 2010


Elta’s MVM AlgorithmPreliminary Algorithm

April 18, 2010

, ,2D FFT x YK K X YS ��

Elta’s Algorithm

32 32,, Fragmentation xX YX Y ��

32 32 32 32, , 2D IFFT x xX Y X YS ��

32 32 32 32, , x x

MVMX Y X YS MVM ��

32 32, , Attachment x

MVMX Y X Y ��

Original SAR Image is Segmented into Chips (32X32 Chip) . The chips overlap.

MVM is Applied to Each Chip. The Various Chips are Attached to Each Other and Forms a

Full Size MVM Image

Input image

Output image The chips overlap


2D-IFFT

INIT

Segmentation

Covarince

Eigenvalues

FFT

Main effort

1

2 FINISH

Attachment

Task Map - MVM







April 18, 2010


Plurality’s HyperCore Architecture Line (HAL) family of massively parallel manycore processors includes:

16 to 256 32-bit RISC cores4-64 co-processors that include a Floating Point unit and a Multiplier/Divider. Each co- processor is shared by four RISC processorsShared memory architecture - 2 MB size. No level one cache.Hardware-based scheduler that supports a task-oriented programming modelA cycle accurate simulator that runs on a x86 platformIntegrated into Eclipse IDEAn emulator supporting Linux and Windows native environments

Plurality Platform


Plurality’s Platform

The emulator mimics the behavior of HAL's hardware scheduler while still running on a X86 processor and working on Linux/Windows-based environments.

April 18, 2010

Emulator

No need to change to new hardware and a new programming model The emulator is written in ANSI-C. (almost all compilers can compile it)It comes with a prebuilt Makefile and a Visual Studio solutionThe emulator calls each task with all its required information: its right task instance, right timing, and right core IDHowever, not cycle-accurate!

Advantages



April 18, 2010

A cycle-accurate hardware simulator, that simulates the exact behavior of real HAL hardware. The simulator is integrated into eclipse IDE, but is very hard to debug with.

Simulator

Cycle accurate simulation.Uses GNU’s well known binutils and GDB debuggerIntegrated into Eclipse IDEEase transition to hardware

Advantages



April 18, 2010

Implementations

Compilation of the whole MVM algorithm using Plurality's emulatorCompilation of our covariance matrix creation program using Plurality's simulatorUsing the Eclipse development environment to measure the cycle-accurate performance


MotivationOvercoming Plurality’s unimplemented featureAllow manual scheduling in order to preserve processing time

For a given task, limit the number of concurrent instances out of its defined quotaImplemented in Perl

April 18, 2010

Added FeaturesN of M pre-compiler







April 18, 2010

The naïve algorithm


2D-IFFT

INIT

Segmentation

Covarince

Eigenvalues

2D-FFT

FINISH

Attachment

Task Map - MVM on Plurality

X86

X86

Emulator

Simulator


Results (complete MVM on emulator)

2D(I)FFT and eigendecomposition using Intel’s MKL as black-box on the X86Compiled to native x86 code, but not fully optimized

Catego

ry 1

Catego

ry 2

Catego

ry 3

Catego

ry 4

0

1

2

3

4

5

6

Series 1Series 2Series 3

Placeholder


Results (building covariance on the simulator)

2 4 8 16 32 64 128 2560

1

2

3

4

5

6

7

8

9

10

11

Speedup for 61 Permutations

Cores

Cycle

s spe

edup

Chip size: 15x15Sub-Aparture size: 6x6


Results (building covariance on the simulator)

2 4 8 16 32 64 128 2560123456789

10111213141516171819

Speedup for 113 Permutations

Cores

Cycle

s spe

edup

Chip size: 20x20Sub-Aparture size: 8x8







April 18, 2010


Future Projects

April 18, 2010

Completing MVM on Plurality

Implement a parallel algorithm for finding eigenvalues and vectors of a dense Hermitian matrix2D(I)FFT on Plurality using Plurality’s 1-D LibraryTask map Optimizations


Solving The Eigenvalues Problem

High complexityMany Algorithems: QR, SVD, D&C, Jacobi, etc.Many OTS solutions: Intel, AMD, IBM, GNU, LAPACK, NAG, FEAST, etc.Shared memory ∩ Parallel ∩ Open source C = ф

April 18, 2010


MRRR (Multiple Relatively Robust Representations)

Main features:Fast – O(n2) for nxn MatrixParallelMemory efficient – O(n2) for nxn MatrixComplex data structuresImplementation unavailable

Optimal for plurality’s Platform

April 18, 2010







April 18, 2010


Opportunities

April 18, 2010

Our algorithm is unique: no parallel solution has been available to date. This solution may be applied to other signal processing problemsImplementation of MRRR is possible, therefore, enabling the complete MVM algorithm to work on plurality's platformUsing our solution on plurality's platform may be very appealing since plurality provides higher performance-power utilization than Grid Computing and faster run time


Practical Implications

Plurality’s low power platform may enable integrating SAR

On satellitesOn Unmanned Aerial Vehicles (UAV’s)More implications …

April 18, 2010


Thank you


Back-up Slides


Elta’s MVM AlgorithmAssembling SAR radar picture consists of 2 phases:

April 18, 2010

DATA ManipulationRMC, Adaptive Pre-Sum, MOCOMP,

Autofocus, Polar to Rectangular Interpolation

Filtering and2D IFFT

MVMProcess

Incoming radarEchoes SAR Image2D FFT

1. Conventional SAR

2. MVM method in SAR

Identify Target of InterestUpon a SAR Image

Obtain Virtual SARRaw DATA Corresponding

to The Selected Target

MVM SAR Imageof the selected target

Elta’s MVM Algorithm

Preliminary Algorithm

The MVM algorithm


Greatly reduces calculation at run timeSame table used for all chips

April 18, 2010

Our algorithm (Optimizations)

Concepts:Execute many calculations in advance, saving them in a memory efficient static look-up tables.

Look-up tables

Advantages:

Our algorithm


Our algorithm (Optimizations)Look-up tables Permutation Table

M1x M1y M2x M2y Bx By REFx REFy

1 … … … … … … … …

… … … … … … … … …

… … … … … … … … …

313 … … … … … … … …

Holds relevant info for permutation

Our algorithm


Our algorithm (Optimizations)Look-up tables Permutation Table

[M1x, M1y] are coordinates of first multiplier[M1x, M1y] are coordinates of second multiplierBx is the number of rows of the permutation blockBy is the number of cols of the permutation block [REFx, REFy] are the coordinates of the pixel at REF matrix

(at Zero iteration)

Optimal table size: (4*6+8*2)bit*313=1.565KBytes

Our algorithm


Our algorithm (Optimizations)Look-up tables Offsets Table

Maps each shift to a pixel

We came across a regularity in the offset of the pixel coordinates when shifting upwards (+1,+1) or leftwards(+Sub-ap size, +Sub-ap size)

Concept:

Our algorithm



First we create a general Matrix containing all possible pixel offsets

Matrix[i,j]- the offset when shifting i steps upwards and j steps leftwards

Table’s construction156

143

130

117

104

91 78 65 52 39 26 13 0

157

144

131

118

105

92 79 66 53 40 27 14 1

158

145

132

119

106

93 80 67 54 41 28 15 2

159

146

133

120

107

94 81 68 55 42 29 16 3

160

147

134

121

108

95 82 69 56 43 30 17 4

161

148

135

122

109

96 83 70 57 44 31 18 5

162

149

136

123

110

97 84 71 58 45 32 19 6

163

150

137

124

111

98 85 72 59 46 33 20 7

164

151

138

125

112

99 86 73 60 47 34 21 8

165

152

139

126

113

100

87 74 61 48 35 22 9

166

153

140

127

114

101

88 75 62 49 36 23 10

167

154

141

128

115

102

89 76 63 50 37 24 11

168

155

142

129

116

103

90 77 64 51 38 25 12

Our algorithm



Then, we add each permutation’s Zero Iteration coordinates (x,y) to the matrix to form each permutation offsets table

Table’s construction

Coodrszero-iteration (x,y) +

156

143

130

117

104

91 78 65 52 39 26 13 0

157

144

131

118

105

92 79 66 53 40 27 14 1

158

145

132

119

106

93 80 67 54 41 28 15 2

159

146

133

120

107

94 81 68 55 42 29 16 3

160

147

134

121

108

95 82 69 56 43 30 17 4

161

148

135

122

109

96 83 70 57 44 31 18 5

162

149

136

123

110

97 84 71 58 45 32 19 6

163

150

137

124

111

98 85 72 59 46 33 20 7

164

151

138

125

112

99 86 73 60 47 34 21 8

165

152

139

126

113

100

87 74 61 48 35 22 9

166

153

140

127

114

101

88 75 62 49 36 23 10

167

154

141

128

115

102

89 76 63 50 37 24 11

168

155

142

129

116

103

90 77 64 51 38 25 12

Our algorithm


Our algorithm (Optimizations)Look-up tables Offsets TableTable’s construction

Coodrszero-iteration (x,y) +

156

143

130

117

104

91 78 65 52 39 26 13 0

157

144

131

118

105

92 79 66 53 40 27 14 1

158

145

132

119

106

93 80 67 54 41 28 15 2

159

146

133

120

107

94 81 68 55 42 29 16 3

160

147

134

121

108

95 82 69 56 43 30 17 4

161

148

135

122

109

96 83 70 57 44 31 18 5

162

149

136

123

110

97 84 71 58 45 32 19 6

163

150

137

124

111

98 85 72 59 46 33 20 7

164

151

138

125

112

99 86 73 60 47 34 21 8

165

152

139

126

113

100

87 74 61 48 35 22 9

166

153

140

127

114

101

88 75 62 49 36 23 10

167

154

141

128

115

102

89 76 63 50 37 24 11

168

155

142

129

116

103

90 77 64 51 38 25 12

313X

313X

Our algorithm

Optimal table size: (13*13*8)bit*313=52.9KBytes


REF is an Hermitian matrix.

April 18, 2010

Our algorithm (Optimizations)

Concept:Using matrix characteristics to reduce calculations

Using Matrix Characteristics

Important observation:

†R R

Our algorithm

, ,R i j R j i

Date post:	19-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

Supervisor: Oded Green Ami Galperin Lior David. Introduction Building the covariance matrix The...

Documents