NOT APPROVED FOR PUBLIC RELEASE
Michael Wolf, MIT Lincoln Laboratory
11 September 2012
LLMORE: Mapping and Optimization Framework
This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012).
Michael Wolf - 2 MMW 09/11/2012
Overview of Mapping and Optimization Challenges
Challenges: • Realistic simulations of applications • Support for diverse languages/numerical libraries • Support for diverse devices and architectures
Supercomputers Data warehouses
Small hybrid systems Clusters FPGA Chips Arc
hite
ctur
es
App
licat
ions
SAR Secure Communication Cyber Database operations
Mapping and Optimization
Michael Wolf - 3 MMW 09/11/2012
• LLMORE is MIT Lincoln Laboratory’s Mapping and Optimization Runtime Environment
• Parallel framework/environment for – Optimizing data to processor mapping for parallel applications – Simulating and optimizing new (and existing) architectures – Generating performance data (runtime, power, etc.) – Code generation and execution for target architectures
LLMORE
LLMORE: multiple language support, sparse/dense operations,
architecture model, executor, parallel, robust software
pMapper: Matlab, dense operations, simple architecture model,
executor, serial
SMaRT/MORE: Matlab, sparse operations
(limited data size), architecture model, no executor, serial
LLMORE SMaRT/MORE pMapper
2012 2004 2006 2011
Three generations of mapping and optimization
pMapper patent issued: “Method and apparatus performing automatic mapping for multiprocessor system”
Michael Wolf - 4 MMW 09/11/2012
Key Features of LLMORE
• Support for multiple languages and numerical libraries • Ability to solve large problems
– Written in C++ and runs in parallel – Fit larger problems into memory, reduces time to solution
• Support for dense and sparse linear algebra operations • Production quality research software
– Easy to use interfaces – Designed to support future algorithms/packages/languages
• LLMORE is NOT an autoparallelizing compiler – Will not generate optimized parallel code for any set of (serial or
parallel) instructions – Data layouts optimized in context of maps
Michael Wolf - 5 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 6 MMW 09/11/2012
LLMORE Framework Overview
LLMORE input LLMORE output
Architectural Model
LLMORE Parameters
User Code 1. Set of op=mized maps
or 2. Performance data
or 3. Op=mized architecture(s)
or 4. Generated code
or 5. Results from run on target architecture Key/Novel Features
• Application code to simulator • Data mapping optimization for user code • Support for multiple languages and libraries • Ability to solve large problems • Production quality software
LLMORE
Output: One or more
Michael Wolf - 7 MMW 09/11/2012
LLMORE Design Overview
Core Functionality
Language Interface
Run=me Engine
Code Generator
Machine Interface
LLMORE Input
LLMORE Output
LLMORE Output
Analyzer and Op=mizer
Map Converter Parser
Parse Manager
Map Manager
Map Builder
AST Builder
AST = abstract syntax tree
Michael Wolf - 8 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 9 MMW 09/11/2012
LLMORE Usage 1: Map Optimization
LLMORE optimizes data mapping to improve parallel performance of key computational kernels
• LLMORE produces set of optimized maps for parallel variables specified in user code
• Matrix-vector product example – LLMORE computes map for dense matrix – LLMORE computes map for two vectors
= A x y = A x y LLMORE
Michael Wolf - 10 MMW 09/11/2012
Map Optimization: Input
y=Ax
= A x y
Input from application: user code for dense matrix-vector product
Dense Matrix-Vector Product
LLMORE
Optimized Maps
Analyzer and Optimizer
Mapper
Mapped AST
Parser User code
User Code:
AST
Michael Wolf - 11 MMW 09/11/2012
Map Optimization: AST Representation
Parser converts user code into abstract syntax tree (AST), which is input language/numerical library neutral
SB
MV
PVar: A PVar: x PVar: y
AST
y=Ax
LLMORE
Optimized Maps
Analyzer and Optimizer
Mapper
Mapped AST
Parser
AST User code
Dense Matrix-Vector Product
Michael Wolf - 12 MMW 09/11/2012
Map Optimization: Mapping
LLMORE computes map for each parallel variable in AST
SB
MV
PVar: A PVar: x PVar: y
Vector Map P0: {0,1} P1: {2,3}
Vector Map P0: {0,1} P1: {2,3}
Matrix Map P0: {0,1} P1: {2,3}
LLMORE
Optimized Maps
Analyzer and Optimizer
Mapper
Mapped AST
Parser
AST User code
Dense Matrix-Vector Product
Michael Wolf - 13 MMW 09/11/2012
Map Optimization: Output
A y x
• LLMORE output: optimized maps for parallel variables • New maps used to redistribute vector and matrix data • Optimized matrix-vector product calculated with new data
distributions
A y x
LLMORE
Optimized Maps
Analyzer and Optimizer
Mapper
Mapped AST
Parser
AST User code
Optimized y=Ax
= A y x
Dense Matrix-Vector Product
Michael Wolf - 14 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 15 MMW 09/11/2012
LLMORE Usage 2: Performance Evaluation
LLMORE simulates user code on specified architecture to produce performance evaluation metrics
= A x y
3040
5060
7080
90 23
45
67
8
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Memory: Static Power(W)
GFLOPs/W Nehalem Model − Dense
Processor: Static Power(W)
Performance data
LLMORE
M2 P1 P2 M1
Michael Wolf - 16 MMW 09/11/2012
Performance Evaluation: Input
y=Ax
= A x y
LLMORE
Performance Data
Analyzer and Optimizer
Mapper
Mapped AST
User Code, Arch. Model
Parser AST
Performance Evaluator
MI Code Generator
MI code
Simulator
MI = Machine Independent M2 P1 P2 M1
Dense Matrix-Vector Product
Input from application: user code for dense matrix-vector product, architecture model
User Code:
Architecture Model:
Michael Wolf - 17 MMW 09/11/2012
Performance Evaluation: AST Representation
y=Ax
LLMORE
Performance Data
Analyzer and Optimizer
Mapper
Mapped AST
User Code, Arch. Model
Parser AST
Performance Evaluator
MI Code Generator
MI code
Simulator
MI = Machine Independent
SB
MV
PVar: A PVar: x PVar: y
AST
Dense Matrix-Vector Product
Parser converts user code into abstract syntax tree (AST), which is input language/numerical library neutral
Michael Wolf - 18 MMW 09/11/2012
Performance Evaluation: Mapping
LLMORE computes map for each parallel variable in AST
SB
MV
PVar: A PVar: x PVar: y
Vector Map P0: {0,1} P1: {2,3}
Vector Map P0: {0,1} P1: {2,3}
Matrix Map P0: {0,1} P1: {2,3}
LLMORE
Performance Data
Analyzer and Optimizer
Mapper
Mapped AST
User Code, Arch. Model
Parser AST
Performance Evaluator
MI Code Generator
MI code
Simulator
MI = Machine Independent
Dense Matrix-Vector Product
Michael Wolf - 19 MMW 09/11/2012
Performance Evaluation: Machine Independent Code
Mapped AST and architecture model used to generate machine independent code
LLMORE
Output
Analyzer and Optimizer
Mapper
Mapped AST
Input Parser AST
Performance Evaluator
MI Code Generator
MI code
Simulator
Mapped AST
SB
MV
PVar: A PVar: y Vector Map
P0: {0,1} P1: {2,3}
PVar: x Vector Map
P0: {0,1} P1: {2,3}
Matrix Map P0: {0,1} P1: {2,3}
MI Code
read A1,1 read A0,0 read x0 read y0
y0 = A0,0x0
read A0,1
y0 += A0,1x1
write y0
send x1 send x0
read x1 read y1
y1 = A1,1x1
read A1,0
y1 += A1,0x0
write y1
MI = Machine Independent
M2 P1 P2 M1
Architecture Model
MI Code Generator
Michael Wolf - 20 MMW 09/11/2012
Machine Independent Code
# rows = 2 # processors = 2
read A1,1 read A0,0 read x0 read y0
y0 = A0,0x0
read A0,1
y0 += A0,1x1
write y0
send x1 send x0
read x1 read y1
y1 = A1,1x1
read A1,0
y1 += A1,0x0
write y1
Machine independent code for matrix-vector product
P0
P1
Computation
Memory access
Communication
Flow graph of operations
Michael Wolf - 21 MMW 09/11/2012
Performance Evaluation: Output
Simulation of user code on target architecture
LLMORE
Performance Data
Analyzer and Optimizer
Mapper
Mapped AST
Input Parser AST
Performance Evaluator
MI Code Generator
MI code
Simulator
MI Code
read A1,1 read A0,0 read x0 read y0
y0 = A0,0x0
read A0,1
y0 += A0,1x1
write y0
send x1 send x0
read x1 read y1
y1 = A1,1x1
read A1,0
y1 += A1,0x0
write y1
3040
5060
7080
90 23
45
67
8
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Memory: Static Power(W)
GFLOPs/W Nehalem Model − Dense
Processor: Static Power(W)
Performance Data
Simulator
Michael Wolf - 22 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 23 MMW 09/11/2012
POEM
Architecture Design Application Analysis
POEM will bridge the gap between innovations in chip-scale photonics and fielded military-critical systems. It will offer a complete architecture design
and analysis for numerous real-world military-critical applications.
Photonic Innovation
Fielded Systems
P hotonically O ptimized E mbedded M icroprocessor
Handheld DNA Sequencing Device UAV Surveillance
Ring Resonator Technology
Photonics Layered on Silicon
Photonic Switching Technology
Handheld PDA Field Devices
Proposed Photonic Architectures
Processing Chains for Military-Critical Applications
Simulation, Mapping, and Optimization
Framework
Architecture Parameter Characterization
Michael Wolf - 24 MMW 09/11/2012
LLMORE’s Role in POEM Program
Architecture Design
LLMORE used to study chip-scale photonics and its impact on applications
Photonic Innovation
Fielded Systems
Handheld DNA Sequencing Device UAV Surveillance
Ring Resonator Technology
Photonics Layered on Silicon
Photonic Switching Technology
Handheld PDA Field Devices
Proposed Photonic Architectures
Architecture Parameter Characterization
Application Analysis
P hotonically O ptimized E mbedded M icroprocessor
Processing Chains for Military-Critical Applications
Simulation, Mapping, and Optimization
Framework
Michael Wolf - 25 MMW 09/11/2012
LLMORE and POEM: Applications
• LLMORE provides framework for analyzing POEM applications – LLMORE supports many key numerical kernels (FFT, sparse
matrix-vector product, vector updates, etc.) – Applications supported through composition of these kernels – Easy to extend to analyze new applications
• Initially analyzing synthetic aperture radar (SAR) application
LLMORE enables the analysis of many applications on many different architectures (existing and proposed)
SAR Secure Communication Cyber Database operations
LLMORE
Michael Wolf - 26 MMW 09/11/2012
LLMORE and POEM: Architectures
• LLMORE supports simulation of applications on POEM architectures (e.g., electronic mesh and photonic bus)
• Framework for simulating user code – LLMORE simulator for understanding big picture trends – Interface to third party simulators (e.g., PhoenixSim) for higher
fidelity performance data
LLMORE enables the analysis of many applications on many different architectures (existing and proposed)
LLMORE
Electronic Mesh
Photonic Bus
Application
Michael Wolf - 27 MMW 09/11/2012
LLMORE and POEM: Optimization of Maps
Optimization of maps – Good application data to processor mapping crucial to
achieving peak parallel performance on target machines – Not difficult for SAR applications (simple maps sufficient) – Challenging for applications with irregular communication
(DNA sequence analysis and sparse matrix computations)
LLMORE provides automatic map optimization
= A x y = A x y LLMORE
Michael Wolf - 28 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 29 MMW 09/11/2012
LLMORE and Performance Evaluation
LLMORE
LLMORE output
Analyzer and Optimizer
LLMORE input
Mapper
Mapped AST
Performance Evaluator
MI Code Generator
MI code
Simulator Performance data: time (s), GFLOPS
Architectural Model
User Code
LLMORE used to produce performance evaluation data for 2D FFT, an important kernel in SAR processing chain
Parser
AST
Michael Wolf - 30 MMW 09/11/2012
POEM Architecture Models
LLMORE framework allows direct comparison of electronic mesh and photonic bus architectures
M2 P1 LM
P2 LM
P3 LM
P11 LM
P10 LM
P9 LM
P7 LM
P6 LM
P5 LM
P4 LM
P8 LM
P12 LM
P13 LM
P14 LM
P15 LM
P16 LM
M1
M3 M4
P1 LM
P2 LM
P3 LM
P11 LM
P10 LM
P9 LM
P7 LM
P6 LM
P5 LM
P4 LM
P8 LM
P12 LM
P13 LM
P14 LM
P15 LM
P16 LM
M1 M2 M3 M4
Electronic Mesh Photonic Bus
• Two architectures modeled (electronic mesh and photonic bus • Processors same, shared memory same • Network parameters (latency, bandwidth) set to allow for apple
to apples comparison between networks
P=processor/core M=Shared memory LM=local memory
Michael Wolf - 31 MMW 09/11/2012
Preliminary POEM Simulations
Read FFT Write
Read FFT Write
Read Transposed Write
Read FFT
Read FFT Write
SCA Write
FFT
CT
FFT
2D FFT: FFT, CT, FFT 2D FFT: FFT, SCA, FFT
FFT
CT
FFT
Architectures simulated: EM Architectures simulated: PB EM=Electronic mesh PB=Photonic bus SCA = Synchronous coalesced access
1D FFT for each row of dense matrix
1D FFT for each column of dense matrix
For data locality of FFT
• Initially, simulated 2D FFT kernel • Important kernel for SAR applications
Michael Wolf - 32 MMW 09/11/2012
• Utilizes distance independence to rapidly re-organize spatially separate data
• Large gains in efficiency even when bandwidth is equalized
• SCA write – Synthesizes matrix row-to-column
transpose/write into a single transaction by interleaving data from spatially separate data producers
– Novel ISA construct enabled by a highly synchronous photonic waveguide
Photonic Synchronous Coalesced Access Network (PSCAN)
R R
R R
R R
R R
R R
R R
R R
R MI M1
PSCAN leverages the differences between electronic and photonic interconnect to achieve large efficiency gains in critical operations
0 0 0 0 1 1Address info driven by proc 3
00
0
0
11
Data is not driven
1
DRAM P-CLK
DRIVE3 DRIVE2
DRIVE1
DRIVE0
time
Credit: Dave Whelihan, MITLL
Michael Wolf - 33 MMW 09/11/2012
Preliminary LLMORE Results
Experiment: • Number of memory controllers: 4 • Matrix size: 4096 x 4096 • Equalize bandwidth to memory in
photonic and electronic models • Run traditional 2D FFT with block
transpose for electronic mesh • Run 2D FFT with SCA for PSCAN • Scale number of processors
Examine performance of photonics and electronics in an increasingly work-
starved environment
PSCAN architecture with SCA yields significant speed-up over electronic mesh architectures using block transpose
4 8 16 32 64 128 256 512 1024 2048 4096109
1010
1011Performance of 2D FFT Operation
Number of Cores
GO
PS
ElectronicPSCANIdeal
43% of Ideal
Michael Wolf - 34 MMW 09/11/2012
Data Reorganization Bottleneck in Modern Manycore Systems
As the number of cores in modern CMPs grows, the data reorganization becomes an
increasing performance bottleneck. Performance is limited by the inability to
keep processors supplied with data.
FFT
CT
FFT
Read FFT Write Read Transpose Write Read FFT Write
Block Transpose
FFT
FFT
CT
SCA
Read FFT SCA Read FFT Write
SCA operation performs data reorganization step of 2D FFT more efficiently as number of processors grows, alleviating traditional block transpose bottleneck
4 8 16 32 64 128 256 512 1024 2048 40960
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of Cores
Perc
enta
ge
Time Spent Reorganizing Data for 2D FFT
Block TransposeSCA
Dat
a R
eorg
aniz
atio
n
Michael Wolf - 35 MMW 09/11/2012
• Motivation/Overview • Design and Usage
– Usage 1: Map Optimization – Usage 2: Performance Evaluation
• LLMORE and POEM • Preliminary POEM Results: 2D FFT • Next Steps and Summary
Outline
Michael Wolf - 36 MMW 09/11/2012
• Extend architecture model, LLMORE simulator – Support memory hierarchy – Model network contention more accurately
• Support for external simulators – E.g., PhoenixSim (Columbia), SST (Sandia) – Needed for higher fidelity simulations
• Better power modeling (e.g., dynamic power) • Additional parallel numerical library/language support
– Additional languages: Matlab, Python – Additional libraries: e.g., VSIPL++/PVTOL – Additional kernels: e.g., Sparse matrix operations
• Code generator and runtime engine for execution/emulation on target architectures
LLMORE Next Steps
Michael Wolf - 37 MMW 09/11/2012
• LLMORE: Parallel framework/environment for – Optimizing data to processor mapping for parallel applications – Simulating and optimizing new (and existing) architectures
• LLMORE used to compare photonic and electronic architectures of interest to POEM project – Support for many applications (through composition of kernels) – Support for different architectures
• Preliminary results for 2D FFT kernel – Support thesis that photonics can improve performance for these
applications – SCA instruction mitigates performance impact of corner turn in 2D
FFT operation
Summary
LLMORE allows for exploration of architecture design space in context of real application constraints
Michael Wolf - 38 MMW 09/11/2012
• LLMORE – Michelle Beard – Anna Klein – Sanjeev Mohindra – Julie Mullen – Eric Robinson – Nadya Bliss (Manager) – Minna Song (MIT student intern)
• POEM – Dave Whelihan – Jeff Hughes – Scott Sawyer
Acknowledgements