Date post: | 19-Jun-2015 |
Category: |
Technology |
Upload: | fisnik-kraja |
View: | 279 times |
Download: | 1 times |
Parallelization Techniques for the 2D Parallelization Techniques for the 2D F i M t h d Filt i dF i M t h d Filt i dFourier Matched Filtering and Fourier Matched Filtering and Interpolation SAR AlgorithmInterpolation SAR AlgorithmInterpolation SAR AlgorithmInterpolation SAR Algorithm
Fisnik Kraja, Georg Acher, Arndt Bodej , g ,Chair of Computer Architecture, Technische Universität Mü[email protected], [email protected], [email protected]
2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
The main points will be:The main points will be:pp
• The motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked architectures• Parallelization techniques and results onq
– shared-memory and– distributed-memory architectures
• Specific optimizations for distributed memory environments
• Summary and conclusions
2February 24, 2012
MotivationMotivation
C t d f t li ti ith b d hi h• Current and future space applications with onboard high-performance requirement– Observation satellites with increased
• Image resolutions• Data sets• Computational requirementsp q
• Novel and interesting research based on many-cores for space (Dependable Multiprocessor and Maestro)(Dependable Multiprocessor and Maestro)
• The tendence to fly COTS products to spacey p p
• Performance/power ratio depends directly on the scalability of li iapplications.
3February 24, 2012
SAR 2DFMFI ApplicationSAR 2DFMFI Applicationpppp
SAR Sensor Processing (SSP)
Synthetic Data Generation(SDG):
Reconstructed SAR image is obtained by applying the 2DFourier Matched
Synthetic SAR returns from a uniform grid of
fl
Raw Data Reconstructed Image
Fourier Matched Filtering and Interpolation
point reflectors
Raw Data Reconstructed ImageSCALE mc n m nx
10 1600 3290 3808 247420 3200 6460 7616 492630 4800 9630 11422 738060 9600 19140 22844 14738
February 24, 2012 4
SAR Sensor Processing ProfilingSAR Sensor Processing Profilingg gg g
SSP Processing Step ComputationType
ExecutionTime in %
Size &Layout
1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed 0.3 [n x mc]p [ ]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]6 Transform-back the zero padded spatial spectrum 1d Bw FFT 5 2 [n x m]6. Transform back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]9. Generate the Doppler domain representation the
f i l' l j tCEXP, MAC 3.4 [n x m]
reference signal's complex conjugate10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area:
input[n x m] -> output[nx x m]MAC,Sin,Cos 69 [nx x m]
12. Transform from the doppler domain image into a spatial domainimage.IFFT[nx x m]-> Transpose -> FFT[m x nx]
1d_Bw_FFT1d_Bw_FFT
10 [m x nx]
13 Transform into a viewable image CABS 1.1 [m x nx]g [ ]
February 24, 2012 5
The benchmarked The benchmarked ccNUMAccNUMA(distributed shared memory)(distributed shared memory)
The ccNUMA machine consists of:• 2 Nehalem CPUs: Intel(R)
M (6GB) M (6GB)Xeon(R) CPU X5670– 2.93 GHz – 12 MB L3 Smart Cache
/
Memory (6GB)
Memory (6GB)
Memory (6GB)
Memory (6GB)
– 6 Cores/CPU– TDP=95 Watt – 6.4 Giga Transfers/s QPI (25.6
GB/s)
Memory (6GB) Memory (6GB)
CPU CPU GB/s)
– DDR3 1066 memory interfacing• 36 Gigabytes of RAM
(18 GB/memory controller)
(6Cores)(6Cores)
– (18 GB/memory controller) I/O Controller
February 24, 2012 6
Parallelization techniques on the Parallelization techniques on the NUMANUMA hihiccNUMAccNUMA machinemachine
February 24, 2012 7
Results on the Results on the ccNUMAccNUMA machinemachine
10
12
8
10
dup Scale=60
4
6
Spee
d Scale=10
0
2
1 2 3 4 5 6 7 8 9 10 11 121 2 3 4 5 6 7 8 9 10 11 12
Number of Cores
February 24, 2012 8
The benchmarked distributed The benchmarked distributed hit thit tmemory architecturememory architecture
Peak 62 TFlops
Nehalem cluster @HLRS.de
Peak Performance
62 TFlops
Number of d
700 Dual Socket Quad CNodes Core
Processor Intel Xeon (X5560) Nehalem @ 2.8 GHz, 8MB Cache
Memory/node 12 GBDisk 80 TB shared scratch
(lustre)Node-node interconnect
Infiniband, Gigabit Ethernet
February 24, 2012 9
interconnect Ethernet
MPI MasterMPI Master--Worker ModelWorker Model• In MPI: row-by-row send-and-receive• In MPI2: send and receive chunks of rows• No more than 4 processes/node(8 cores) because of memory overhead
8
9
10
5
6
7
Spee
dup MPI
MPI2 MPI(2Proc/Node)
2
3
4
S MPI(2Proc/Node)MPI2(2Proc/Node)MPI(4Proc/Node)MPI2(4Proc/Node)
0
1
1 2 4 8 12 16
February 24, 2012 10
Number of Nodes ( 8 Cores/Node )
MPI Memory OverheadMPI Memory Overheadyy• This overhead comes from the data replication and reduction needed
in the Interpolation Loop• To improve the scalability without increasing memory consumption a
hybrid (MPI+OpenMP) version is implemented.y p p
27.6ytes
Worker_mem Master_mem Total_mem
1820.4
22.925.1
on in
Gig
a By
13
8.26 5
13 1415.9
cons
umpt
io
0
5.8 4.7 4.1 3.8 3.6 3.4 3.3
6.5 5.7 5.8 4.9 4.7 4.5
1 2 3 4 5 6 7 8
Mem
ory
February 24, 2012 11
Number of Processes
Hybrid(Hybrid(MPI+OpenMPMPI+OpenMP) Versions) Versionsy (y ( pp ))
Hyb1: Hyb1 Hyb2 Hyb3Hyb1:– 1Process(8-OpenMP
threads)/Node.20
Hyb1 Hyb2 Hyb3
Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr)
Hyb2: – OpenMP FFTW +
HyperThreading 14
16
18
eedu
p
HyperThreading.Hyb3:
– Non-Computationally 10
12
14
Spe
p yintensive work is done only by the Master process. 4
6
8
pHyb4:
– Send and Receive h k f
0
2
1 2 4 8 12 16
chunks of rows.February 24, 2012 12
Number of Nodes (8 Cores/Node)
MasterMaster--Worker BottlenecksWorker Bottlenecks
• In some steps of SSP, the data is collected by the Master process and then distrib ted again to theMaster process and then distributed again to the Workers after the respective step.
• Such steps are:– The 2-D FFT_SHIFT– Transposition Operations– The Reduction Operation after the Interpolation Loop
February 24, 2012 13
InterInter--process Communication in process Communication in th FFT SHIFTth FFT SHIFTthe FFT_SHIFTthe FFT_SHIFT
Notional depiction of the fftshift operation PID 0 A1 B1
A BC D
D CB A
Notional depiction of the fftshift operation PID 0 A1 B1PID 1 A2 B2PID 2 C1 D1PID 3 C2 D2PID 3 C2 D2
PID 0 C1 D1C2 D2
• New Communication P tt PID 1 C2 D2
PID 2 A1 B1PID 3 A2 B2
Pattern– Nodes communicate in
couplesN d h h h d f– Nodes that have the data of the first and second quadrant send and receive data only to and from nodes with the third
PID 0 D1 C1PID 1 D2 C2PID 2 B1 A1
and fourth quadrant respectively.
February 24, 2012 14
PID 3 B2 A2
InterInter--Process TranspositionProcess Transpositionpp
Data Partitioning (Tiling) and Buffering
PID 0 D0PID 1 D1
PID 0 D00 D01 D02 D03PID 1 D10 D11 D12 D13
Data Partitioning (Tiling) and Buffering
PID 1
PID 2 D2PID 3 D3
PID 2 D20 D21 D22 D23PID 3 D30 D31 D32 D33
T iti
The Resulting PID 0
D00
D10
D20
D30
Transposition
The Resulting Communication Pattern
PID 1
D01
D11
D21
D31
D0 D1 D2 D3
PID 2
02 12 22 32
PID 3
D03
D13
D23
D33
February 24, 2012 15
Reduction in the Interpolation LoopReduction in the Interpolation Loopp pp p
• To avoid a collective reduction a local reduction is applied ppbetween neighbor processes.
• This reduces only the overlapped regions. R d ti i h d l d i d d• Reduction is scheduled in an ordered way:– the first process will send the data to the second process, which
accumulates the new values with the old ones and send the results back to the first process.
February 24, 2012 16
Pipelining the SSP StepsPipelining the SSP Stepsp g pp g p
• Each node processes a single p gimage:– less inter-process
communicationscommunications–
• It takes longer to reconstruct h fi ithe first image, – but less time for the other
imagesg
February 24, 2012 17
Speedup and Execution TimeSpeedup and Execution Timep pp p90
50
60
70
80
up
Hyb4
Hyb5
Pipelined
20
30
40
50
Spee
d
100
0
10
1 8 16 32 64 96 128
N b f C (8 C N d ) 60
70
80
90
100
dsNumber of Cores(8 Cores per Node)
20
30
40
50
60ps
ed T
ime
in S
econ
8 16 32 64 96 128
Hyb4 92.49 62.6 44.5 34.44 34.14 34.12
0
10
20
Ella
p
Number of Cores
February 24, 2012 18
Hyb5 92.49 50.56 28.84 18.41 15.13 13.97
Pipelined 92.49 46.43 24.8 13.88 10.325 8.42
Summary and ConclusionsSummary and Conclusionsyy
• In shared memory systems, the application can be efficiently parallelized, but the performance will always be limited by hardware resources.
• In distributed memory systems, hardware resources on non-local nodes become available with the cost of communication overhead.
• Performance improves with the number of resources, – Efficiency is not on the same scale.
• The duty of each designer is to find the perfect compromise between performance and other factors like – power consumption– size – heat dissipation
February 24, 2012 19
Th k Y !Th k Y !Thank You!Thank You!
Questions?
Fisnik KrajaChair of Computer Architecture
Technische Universität MünchenTechnische Universität Mü[email protected]