Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Parallelization Techniques for the 2D Parallelization Techniques for the 2D F i M t h d Filt i dF i M t h d Filt i dFourier Matched Filtering and Fourier Matched Filtering and Interpolation SAR AlgorithmInterpolation SAR AlgorithmInterpolation SAR AlgorithmInterpolation SAR Algorithm

Fisnik Kraja, Georg Acher, Arndt Bodej , g ,Chair of Computer Architecture, Technische Universität Mü[email protected], [email protected], [email protected]

2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana

The main points will be:The main points will be:pp

• The motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked architectures• Parallelization techniques and results onq

– shared-memory and– distributed-memory architectures

• Specific optimizations for distributed memory environments

• Summary and conclusions

2February 24, 2012

MotivationMotivation

C t d f t li ti ith b d hi h• Current and future space applications with onboard high-performance requirement– Observation satellites with increased

• Image resolutions• Data sets• Computational requirementsp q

• Novel and interesting research based on many-cores for space (Dependable Multiprocessor and Maestro)(Dependable Multiprocessor and Maestro)

• The tendence to fly COTS products to spacey p p

• Performance/power ratio depends directly on the scalability of li iapplications.

3February 24, 2012

SAR 2DFMFI ApplicationSAR 2DFMFI Applicationpppp

SAR Sensor Processing (SSP)

Synthetic Data Generation(SDG):

Reconstructed SAR image is obtained by applying the 2DFourier Matched

Synthetic SAR returns from a uniform grid of

fl

Raw Data Reconstructed Image

Fourier Matched Filtering and Interpolation

point reflectors

Raw Data Reconstructed ImageSCALE mc n m nx

10 1600 3290 3808 247420 3200 6460 7616 492630 4800 9630 11422 738060 9600 19140 22844 14738

February 24, 2012 4

SAR Sensor Processing ProfilingSAR Sensor Processing Profilingg gg g

SSP Processing Step ComputationType

ExecutionTime in %

Size &Layout

1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed 0.3 [n x mc]p [ ]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]6 Transform-back the zero padded spatial spectrum 1d Bw FFT 5 2 [n x m]6. Transform back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]9. Generate the Doppler domain representation the

f i l' l j tCEXP, MAC 3.4 [n x m]

reference signal's complex conjugate10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area:

input[n x m] -> output[nx x m]MAC,Sin,Cos 69 [nx x m]

12. Transform from the doppler domain image into a spatial domainimage.IFFT[nx x m]-> Transpose -> FFT[m x nx]

1d_Bw_FFT1d_Bw_FFT

10 [m x nx]

13 Transform into a viewable image CABS 1.1 [m x nx]g [ ]

February 24, 2012 5

The benchmarked The benchmarked ccNUMAccNUMA(distributed shared memory)(distributed shared memory)

The ccNUMA machine consists of:• 2 Nehalem CPUs: Intel(R)

M (6GB) M (6GB)Xeon(R) CPU X5670– 2.93 GHz – 12 MB L3 Smart Cache

/

Memory (6GB)

Memory (6GB)

Memory (6GB)

Memory (6GB)

– 6 Cores/CPU– TDP=95 Watt – 6.4 Giga Transfers/s QPI (25.6

GB/s)

Memory (6GB) Memory (6GB)

CPU CPU GB/s)

– DDR3 1066 memory interfacing• 36 Gigabytes of RAM

(18 GB/memory controller)

(6Cores)(6Cores)

– (18 GB/memory controller) I/O Controller

February 24, 2012 6

Parallelization techniques on the Parallelization techniques on the NUMANUMA hihiccNUMAccNUMA machinemachine

February 24, 2012 7

Results on the Results on the ccNUMAccNUMA machinemachine

10

12

8

10

dup Scale=60

4

6

Spee

d Scale=10

0

2

1 2 3 4 5 6 7 8 9 10 11 121 2 3 4 5 6 7 8 9 10 11 12

Number of Cores

February 24, 2012 8

The benchmarked distributed The benchmarked distributed hit thit tmemory architecturememory architecture

Peak 62 TFlops

Nehalem cluster @HLRS.de

Peak Performance

62 TFlops

Number of d

700 Dual Socket Quad CNodes Core

Processor Intel Xeon (X5560) Nehalem @ 2.8 GHz, 8MB Cache

Memory/node 12 GBDisk 80 TB shared scratch

(lustre)Node-node interconnect

Infiniband, Gigabit Ethernet

February 24, 2012 9

interconnect Ethernet

MPI MasterMPI Master--Worker ModelWorker Model• In MPI: row-by-row send-and-receive• In MPI2: send and receive chunks of rows• No more than 4 processes/node(8 cores) because of memory overhead

8

9

10

5

6

7

Spee

dup MPI

MPI2 MPI(2Proc/Node)

2

3

4

S MPI(2Proc/Node)MPI2(2Proc/Node)MPI(4Proc/Node)MPI2(4Proc/Node)

0

1

1 2 4 8 12 16

February 24, 2012 10

Number of Nodes ( 8 Cores/Node )

MPI Memory OverheadMPI Memory Overheadyy• This overhead comes from the data replication and reduction needed

in the Interpolation Loop• To improve the scalability without increasing memory consumption a

hybrid (MPI+OpenMP) version is implemented.y p p

27.6ytes

Worker_mem Master_mem Total_mem

1820.4

22.925.1

on in

Gig

a By

13

8.26 5

13 1415.9

cons

umpt

io

0

5.8 4.7 4.1 3.8 3.6 3.4 3.3

6.5 5.7 5.8 4.9 4.7 4.5

1 2 3 4 5 6 7 8

Mem

ory


Number of Processes

Hybrid(Hybrid(MPI+OpenMPMPI+OpenMP) Versions) Versionsy (y ( pp ))

Hyb1: Hyb1 Hyb2 Hyb3Hyb1:– 1Process(8-OpenMP

threads)/Node.20

Hyb1 Hyb2 Hyb3

Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr)

Hyb2: – OpenMP FFTW +

HyperThreading 14

16

18

eedu

p

HyperThreading.Hyb3:

– Non-Computationally 10

12

14

Spe

p yintensive work is done only by the Master process. 4

6

8

pHyb4:

– Send and Receive h k f

0

2

1 2 4 8 12 16

chunks of rows.February 24, 2012 12

Number of Nodes (8 Cores/Node)

MasterMaster--Worker BottlenecksWorker Bottlenecks

• In some steps of SSP, the data is collected by the Master process and then distrib ted again to theMaster process and then distributed again to the Workers after the respective step.

• Such steps are:– The 2-D FFT_SHIFT– Transposition Operations– The Reduction Operation after the Interpolation Loop


InterInter--process Communication in process Communication in th FFT SHIFTth FFT SHIFTthe FFT_SHIFTthe FFT_SHIFT

Notional depiction of the fftshift operation PID 0 A1 B1

A BC D

D CB A

Notional depiction of the fftshift operation PID 0 A1 B1PID 1 A2 B2PID 2 C1 D1PID 3 C2 D2PID 3 C2 D2

PID 0 C1 D1C2 D2

• New Communication P tt PID 1 C2 D2

PID 2 A1 B1PID 3 A2 B2

Pattern– Nodes communicate in

couplesN d h h h d f– Nodes that have the data of the first and second quadrant send and receive data only to and from nodes with the third

PID 0 D1 C1PID 1 D2 C2PID 2 B1 A1

and fourth quadrant respectively.


PID 3 B2 A2

InterInter--Process TranspositionProcess Transpositionpp

Data Partitioning (Tiling) and Buffering

PID 0 D0PID 1 D1

PID 0 D00 D01 D02 D03PID 1 D10 D11 D12 D13

Data Partitioning (Tiling) and Buffering

PID 1

PID 2 D2PID 3 D3

PID 2 D20 D21 D22 D23PID 3 D30 D31 D32 D33

T iti

The Resulting PID 0

D00

D10

D20

D30

Transposition

The Resulting Communication Pattern

PID 1

D01

D11

D21

D31

D0 D1 D2 D3

PID 2

02 12 22 32

PID 3

D03

D13

D23

D33


Reduction in the Interpolation LoopReduction in the Interpolation Loopp pp p

• To avoid a collective reduction a local reduction is applied ppbetween neighbor processes.

• This reduces only the overlapped regions. R d ti i h d l d i d d• Reduction is scheduled in an ordered way:– the first process will send the data to the second process, which

accumulates the new values with the old ones and send the results back to the first process.


Pipelining the SSP StepsPipelining the SSP Stepsp g pp g p

• Each node processes a single p gimage:– less inter-process

communicationscommunications–

• It takes longer to reconstruct h fi ithe first image, – but less time for the other

imagesg


Speedup and Execution TimeSpeedup and Execution Timep pp p90

50

60

70

80

up

Hyb4

Hyb5

Pipelined

20

30

40

50

Spee

d

100

0

10

1 8 16 32 64 96 128

N b f C (8 C N d ) 60

70

80

90

100

dsNumber of Cores(8 Cores per Node)

20

30

40

50

60ps

ed T

ime

in S

econ

8 16 32 64 96 128

Hyb4 92.49 62.6 44.5 34.44 34.14 34.12

0

10

20

Ella

p

Number of Cores


Hyb5 92.49 50.56 28.84 18.41 15.13 13.97

Pipelined 92.49 46.43 24.8 13.88 10.325 8.42

Summary and ConclusionsSummary and Conclusionsyy

• In shared memory systems, the application can be efficiently parallelized, but the performance will always be limited by hardware resources.

• In distributed memory systems, hardware resources on non-local nodes become available with the cost of communication overhead.

• Performance improves with the number of resources, – Efficiency is not on the same scale.

• The duty of each designer is to find the perfect compromise between performance and other factors like – power consumption– size – heat dissipation


Th k Y !Th k Y !Thank You!Thank You!

Questions?

Fisnik KrajaChair of Computer Architecture

Technische Universität MünchenTechnische Universität Mü[email protected]

Date post:	19-Jun-2015
Category:	Technology
Upload:	fisnik-kraja
View:	279 times
Download:	1 times

Parallelization Techniques for the 2D Fourier Matched Filtering and Interpolation SAR Algorithm

Technology