Date post: | 30-Mar-2015 |
Category: |
Documents |
Upload: | david-gitt |
View: | 215 times |
Download: | 1 times |
WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS
ODES-9
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
64-bit datapath64-bit addressing and high precision computing
64-bit adder
64bit
64bit
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
64-bit datapath64-bit addressing and high precision computing
16-bit adder
16-bit adder
16-bit adder
16-bit adder
64bit
64bit
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
16-bit integer datapath64-bit addressing and high precision computing
40% of computations need only a 16-bit datapath
Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT)
16-bit adder
16-bit adder
16-bit adder
16-bit adder
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
Contributions and conclusions
1. Narrow ISA offers more opportunities to remove non-productive memory operations
2. 50 % of dynamic narrow operations are non-productive
3. Memory Productiveness Pruning: profile-guided, dynamic optimization
ENERGY EFFICIENT CODE GENERATIONFOR PROCESSORS WITH EXPOSED DATAPATH
DONGRUI SHE, YIFAN HE, BART MESMAN, HENK CORPORAAL (TUE)
Exposed datapath: software controls every movement in the data pathExample: transport-triggered architecture (Henk Corporaal)
Register file access reduction
REGISTER REUSE SCHEDULING
GERGÖ BARANY
ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation
MotivationSpill code generated by the compiler has crucial effect on program performance
MethodImplicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG)
Results8.9% less spilling, 3.4% smaller static spill costs
Register Allocation and spilling
REGISTER REUSE SCHEDULING
Virtual registersPhysical registers
Memory
Register Allocation with reuse candidates
REGISTER REUSE SCHEDULING
basic block
interference graph
definitely overlap
definitely NO overlappossible overlap
data dependence graph
Register Allocation with reuse candidates
REGISTER REUSE SCHEDULING
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
MOUNIRA BACHIR, SID-AHMED-ALI TOUATI, ALBERT COHEN
ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II)
MotivationCode size related with memory requirements and I-cache performance
MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones
Results“Good” if enough functional units to perform the additional move operations and acceptable execution time
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File
R
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations
d-1 MOVs/iteration
d : iteration span of variables
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling
3 * code size
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion
a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]
using 9 registers instead of 8
MAXLIVE = 8
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion• Meeting Graph
lifetime in cycles
lifetime interval of c ends when interval of b begins
Meeting Graph
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]a[i+3]b[i+3]c[i+3]a[i+4]b[i+4]c[i+4]a[i+5]b[i+5]c[i+5]a[i+6]b[i+6]c[i+6]a[i+7]b[i+7]c[i+7]
Circuit Decomposition
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
2011 INTERNATIONAL SYMPOSIUM ONCODE GENERATION AND OPTIMIZATION
Main Conference
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Micro-architectural: not always documentedProprietary compilers at advantage!
SPEC2000 int
Loop
SPEC2000 int
Loop
NOP+ 1 NOP instruction
- 7% execution time
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Micro-architectural: not always documentedExample: instruction decoding in Core 2 in chunks of 16 bytes
SPEC2000 int
Loop
SPEC2000 int
Loop
NOP16-byte alignment boundary
16-byte alignment boundary
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Contributions and conclusions
1. Extensible assembly to assembly optimizer
2. Does not fit in GCC flow, because after RTL level not enough information preserved
3. Discover micro-architectural details semi-automatically through generation of micro-benchmarks
DYNAMIC REGISTER PROMOTION OF STACK VARIABLES
JIANJUN LI, CHENGGANG WU, WEI-CHUNG HSU
Use DBT to let x86 binaries use the extra registers on x86-64recompiling is not always an option (legacy binaries)compute-intensive applications gain speed when using 64-bit
Challenge: implicit stack accessesSolved using page protection and stack switching (with shadow stack)
LANGUAGE AND COMPILER SUPPORT FORAUTO-TUNING VARIABLE-ACCURACY ALGORITHMS
JASON ANSEL, YEE LOK WONG, CY CHAN, MAREK OLSZEWSKI, ALAN EDELMAN, SAMAN AMARASINGHE (MIT)
PetaBricks: language extensions to expose trade-offsbetween time and accuracy to the compiler
1. New programming language, toolchain and run-time environment2. Technique for mapping variable accuracy code to enable auto-
efficient tuning
PRACTICAL MEMORY CHECKING WITH DR. MEMORY
DEREK BRUENING (GOOGLE), QIN ZHAO (MIT)
x86
Existing memory checking tools (e.g. Valgrind)slowmany false positives
A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER
HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)
Extend the compilation scope from methods to tracesTraces span multiple method invocationsMore powerful than method inlining
A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER
HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)
Claim: current trace-JITs are immatureKeep the advanced optimization infrastructure by retrofitting
PHASE-BASED TUNING FOR BETTER UTILIZATION OFPERFORMANCE-ASYMMETRIC MULTICORE PROCESSORS
TYLER SONDAG AND HRIDESH RAJAN
ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores
MotivationTrend towards performance asymmetry among cores of a single chip
MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster
Results36% average process speedup with negligible overheads
Phase-based tuning
PHASE-BASED TUNING FOR BETTER UTILIZATION OF PERFORMANCE-ASYMMETRICMULTICORE PROCESSORS
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
DORIT NUZMAN, SERGEI DYSHEL, ERVEN ROHOU, IRA ROZEN, ALBERT COHEN, AYAL ZAKS
ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one
MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse
MethodMix-and-match existing open compilation tools, namely GCC and MONO
ResultsComparable to specialized monolithic offline compilers
Vectorizing for different platforms
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
Split vectorization scheme
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
Interoparable compilation flows
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
This is not a bullet slide.