Structure Layout Optimizations in the
Open64 Compiler: Design, Implementation
and Measurements
Gautam Chakrabarti
and
Fred Chow
PathScale, LLC.
Open64 Workshop 2008 2
Outline
Motivation
Types of structure layout optimizations
Criteria for structure layout optimizations
Implementation details
Performance results
Future work
Conclusion
Open64 Workshop 2008 3
Motivation
Poor data locality in many applications
High data cache miss rates
Growing gap between processor and memory speeds
Our Approach
Change layout of data structures
Requires whole-program optimization
Use Inter-Procedural Analysis and Optimizations (IPA)
Our Aim
Make applications more cache-friendly
Open64 Workshop 2008 4
IPA
Summarization
Analysis
Optimization
Open64 Workshop 2008 5
Types of Structure Layout Optimizations
Structure splitting Structure peeling
struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};
struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};
Open64 Workshop 2008 6
Structure Splitting Example
struct new_struct_A{ double d1; int i; long long l; struct new_struct_A * next; struct cold_sub_struct_A * p;};
struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};
struct cold_sub_struct_A{ double d2; float f; char c;};
Open64 Workshop 2008 7
Structure Peeling Example
struct new_struct_A{ double d1; int i; long long l;};
struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};
struct cold_sub_struct_A{ double d2; float f; char c;};
Open64 Workshop 2008 8
Criteria for structure layout optimizations
Legality Analysis Type cast Address of a field is
taken Escaped types Parameter types Full visibility to IPA Alignment restrictions
Profitability Analysis Hotness Affinity Field accesses at loop
level Size
Open64 Workshop 2008 9
Implementation Details
Step 1: Type information summarization (IPL)
Step 2: Symbol table merging (IPA)
Step 3: Legality and profitability analysis (IPA analysis)
Step 4: Transforming the program (IPA optimization)
Open64 Workshop 2008 10
Implementation Details: Type information
summarization Information summarization in IPL Framework for computing static profiles using heuristics New TY flag TY_NO_SPLIT SUMMARY_TY_INFO SUMMARY_LOOP
For each DO_LOOP, WHILE_DO, DO_WHILE Bit-vector to track field accesses of up to N structure for each loop Considers field accesses immediately inside loop
These fields are considered affine to each other
Execution count of statements immediately inside loop
From statically estimated profiles or from runtime feedback
Open64 Workshop 2008 11
Implementation Details: IPA Analysis
Inter-procedurally update statically estimated execution count of
PUs
Update statically estimated loop frequencies in
SUMMARY_LOOP
Consider SUMMARY_LOOP from the hottest P PUs
Determine candidates for structure-layout transformation
Determine new layout of structures
Open64 Workshop 2008 12
Implementation Details: IPA Analysis Example
F4 F3 F2 F1 BV
L1 22 22 0101
L2 14 0010
L3 12 12 0101
L4 8 8 1100
L5 6 6 0101
F4 F3 F2 F1
AG1 40 40
AG2 14
AG3 8 8
Li — Loops
Fj — Fields in a struct
AGk — Affinity groups
Open64 Workshop 2008 13
Implementation Details: Transforming the
program
struct S struct T{ { // N fields // AG1 fields struct T * p; // AG2 fields // M fields };}; // peel T
struct S{ // N fields struct T1 * p1; struct T2 * p2; // M fields};
New type definitions
Field table update
Field access statements
New symbols
Assignment statements
Example:
struct T1 struct T2{ { // AG1 fields // AG2 fields}; };
Open64 Workshop 2008 14
Implementation Details: Transforming the
program (continued)
Function calls to memory management routinesExample:
p = (T *) malloc (N * sizeof (T))
if (p == NULL)
exit (1);
Detect memory management routine calls involving transformed type T
Replicate call, assignment statements Update size of memory being allocated Handle comparisons involving pointer p
Open64 Workshop 2008 15
Performance Results
Compilations options: -Ofast at 32-bit ABI
Speedup due to structure layout optimizations
Benchmarks AMD
Opteron™
(2.8GHz,
4GB, 1MB)
AMD
Barcelona(2.
0GHz, 8GB,
512KB)
Intel®
EM64T(3.4G
Hz, 4GB,
1MB)
Intel®
Core™(3.0
GHz, 4GB,
4MB)
SiCortex
MIPS®(500MHz,
4GB, 256KB)
Geometric
Mean
179.art 134% 66% 56% 47% 41% 62.5%
181.mcf 24% 23% 23% 31% 13% 22.0%
462.libquantum 32% 17% 40% 72% 62% 39.6%
Geometric Mean 46.9% 29.6% 37.2% 47.2% 32.1% 37.9%
Open64 Workshop 2008 16
Performance Results (continued)
Compilations options: -Ofast at 64-bit ABI
Speedup due to structure layout optimizations
Benchmarks AMD
Opteron™
(2.8GHz,
4GB, 1MB)
AMD
Barcelona(2.
0GHz, 8GB,
512KB)
Intel®
EM64T(3.4G
Hz, 4GB,
1MB)
Intel®
Core™(3.0
GHz, 4GB,
4MB)
SiCortex
MIPS®(500MHz,
4GB, 256KB)
Geometric
Mean
179.art 169% 66% 53% 60% 45% 69.3%
181.mcf 25% 35% 12% 30% 7% 18.6%
462.libquantum 82% 51% 75% 70% 69% 68.6%
Geometric Mean 70.2% 49.0% 36.3% 50.1% 27.9% 44.6%
Open64 Workshop 2008 17
Performance Results (continued)
Compilations options: -Ofast at 64-bit ABI
Multiple copies of 462.libquantum running on multi-core chip
Platform: Quad-core AMD Barcelona (2.0 GHz, 8GB, 512KB, 2MB)
3rd level cache shared among 4 cores
Speedup from structure layout optimizations
Benchmark 1 copy 2 copies 4 copies
462.libquantum 51% 69% 123%
Open64 Workshop 2008 18
Future Work
Tune static profile estimation
Less restrictions
Integrate with field-reordering
Open64 Workshop 2008 19
Conclusion
A framework for performing structure layout transformations
is now available in the Open64 compiler.
The superior infrastructure in the Open64 compiler helped us
implement the optimizations cleanly and with relatively less
effort.
Substantial speedups are possible on some of the CPU2000
and CPU2006 SPEC benchmarks.
Structure layout optimization is a required feature for a
compiler to remain competitive.