Performance Analysis of AMD Multi-core Processor and Graphical Processing Units
Mohammad Ashraf BhuiyanMelissa C. SmithVivek K. Pallipuram
June 2011
This work supported in part by NSF Grant No. CCF-0916387
Motivation
The recent trend of computingMulticore and many-core processorsMany-core GPUs
Various types of Accelerators availableNumber of cores, threadsMemory hierarchyProgramming modelsCode optimization techniques
Parallel program development requires knowledge ofAcceleration techniques and optimizationsApplication characteristics
This calls forPerformance analysis of Accelerators for ApplicationsUnderstanding the match between Accelerators and Applications
2
Outline
Experimental SystemAMD 8 core and 32 core CPUAMD 1600 core GPU
Spiking Neural NetworkBiological ModelsNetwork Design
Preliminary ResultsEffect of problem sizeEffect of optimizationsEffect of threads/cores
Future Work
3
Experimental Systems
4
Utilizing several leading architecturesAMD 8-core (Opteron 2356)AMD 32-core (Opteron 6134)AMD 1600-core GPU (Radeon 5870)
Case Study: Neuron Models & Network
5
Two Layer Network:
SNN Model FLOPs per neuron update
Memory Accessper Neuron (Byte)
FLOP/ByteRatio
Izhikevich 13 20 0.65
Wilson 38 44 0.86
Morris-Lecar 132 28 4.71
Hodgkin-Huxley 246 44 6.02
Image
Level 1 neurons
Level 2 neurons
Network (Problem Size) Scaling
6
Image Size Level 1Neurons
Level 2 Neurons
Total Neurons
96×96 9216 48 9264
192×192 36864 48 36912
240×240 57600 48 57648
…… …… …… ……
2400×2400 5,760,000 48 5,760,048
3120×3120 9,734,400 48 9,734,448
Preliminary Results
Accelerator performance studyProblem sizeOptimization techniquesAccelerator configuration
Number of threads for CPULocal work group size for GPU
7
Problem Size Variation
8
Izhikevich Wilson
Speedup over a serial implementation on Intel core 2 quad, 2.66 GHz, using all compiler optimizations
Problem Size Variation Cont.
9
Morris-Lecar Hodgkin-Huxley
Optimization Techniques Used
10
AMD Multi-core
1. pth: POSIX thread, 2. SSE: Streaming SIMD
Extension 3, 3. SP: Software Prefetching
AMD Radeon GPU
1. MT: Multithread 2. SP: Software Prefetching3. LM: Local Memory4. MW: Memory Write 5. MAT: Unsafe Math and
Native Math 6. RCS: Reducing
Conditional Statement7. VEC: Vector Calculation
Optimization: AMD 8 core
11
Izhikevich Wilson
pth: POSIX thread, SSE: Streaming SIMD Extension 3, SP: Software Prefetching
Optimization : AMD 8 core Cont.
12
Morris-Lecar Hodgkin-Huxley
pth: POSIX thread, SSE: Streaming SIMD Extension 3, SP: Software Prefetching
Optimization: AMD 32 core
13
Izhikevich Wilson
pth: POSIX thread, SSE: Streaming SIMD Extension 3, SP: Software Prefetching
Optimization : AMD 32 core Cont.
14
Morris-Lecar Hodgkin-Huxley
pth: POSIX thread, SSE: Streaming SIMD Extension 3, SP: Software Prefetching
Optimization : AMD 1600 core GPU
15
Izhikevich Wilson
MT: multithread, SP: software prefetching, LM: local memory, MW: memory write, RCS: reducing conditional statement, MAT: Unsafe and Native math, VEC: Vector Calculation
Optimization: AMD 1600 core GPU
16
Morris-Lecar Hodgkin-Huxley
MT: multithread, SP: software prefetching, LM: local memory, MW: memory write, RCS: reducing conditional statement, MAT: Unsafe and Native math VEC: Vector Calculation
Thread Effect: AMD 8 core
17
Thread Effect: AMD 32 core
18
Thread Effect: AMD 1600 core GPU
19
Performance Observations
20
Problem Size EffectGenerally performance improves with problem sizeIzhikevich model on AMD 8 core CPU
Speedup of 9x for 9000 neurons; 16x for 9.7 million neurons
HH model on AMD 1600 core GPUSpeedup of 11x for 9000 neurons;603x for 9.7 million neurons
Flop:byte Ratio EffectsHigher value provides better performanceIzhikevich (0.65): 12xHH (6.02) : 603x
Performance Observations
21
Architecture Specific Optimizations Generally performance improves with optimizationsAlso depends on
Problem sizeFlop:byte ratio
Threading EffectGenerally performance improves with threadsAlso depends on
Problem sizeOverhead for Intra-processor communications
Future Work
22
Extend the experimentHeterogeneous architecture (multi-core + GPU)Multi-node accelerators (Supercomputers)Accelerators from other vendorsOther application kernels such as
BioinformaticsMolecular DynamicsOptimization problems (Simulated Annealing)
Related Publications
23
JournalMohammad Bhuiyan, Melissa C. Smith, Vivek K. Pallipuram, “Performance, Optimization and Fitness: Connecting Applications to Architectures”, in Journal of Concurrency and Computation: Practice and Experience, Wiley, December 2010, DOI: 10.1002/cpe.1688Vivek K. Pallipuram, Mohammad Bhuiyan, and Melissa C. Smith, “A Comparative Study of GPU Programming Models and Architectures”, in Journal of Supercomputing, Springer, May 2011, DOI: 10.1007/s11227-011-0631-3
ConferenceMohammad Bhuiyan, Ananth Nallamuthu, Melissa C. Smith, and Vivek K. Pallipuram, “Optimization and Performance Study of Large-scale Biological Networks For Reconfigurable Computing,” in proceedings of HPRCTA, SC 10, New Orleans, April 2010Mohammad Bhuiyan, Vivek K. Pallipuram and Melissa C. Smith, “Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures,” in IEEE proceedings HiCOMB, IPDPS, Atlanta, GA, April 2010Kenneth Rice, Mohammad Bhuiyan, Tarek M. Taha, Christopher N. Vutsinas, Melissa C Smith, “FPGA Implementation of Izhikevich Spiking Neural Networks for Character Recognition,” in proceedings ReConFig09, pp. 451 – 456, Dec. 2009
Thank you
24
Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.
25