Download - Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements

Structure Layout Optimizations in the

Open64 Compiler: Design, Implementation

and Measurements

Gautam Chakrabarti

and

Fred Chow

PathScale, LLC.

Open64 Workshop 2008 2

Outline

Motivation

Types of structure layout optimizations

Criteria for structure layout optimizations

Implementation details

Performance results

Future work

Conclusion


Motivation

Poor data locality in many applications

High data cache miss rates

Growing gap between processor and memory speeds

Our Approach

Change layout of data structures

Requires whole-program optimization

Use Inter-Procedural Analysis and Optimizations (IPA)

Our Aim

Make applications more cache-friendly


IPA

Summarization

Analysis

Optimization


Types of Structure Layout Optimizations

Structure splitting Structure peeling

struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};


Structure Splitting Example

struct new_struct_A{ double d1; int i; long long l; struct new_struct_A * next; struct cold_sub_struct_A * p;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c; struct struct_A * next;};

struct cold_sub_struct_A{ double d2; float f; char c;};


Structure Peeling Example

struct new_struct_A{ double d1; int i; long long l;};

struct struct_A{ double d1; double d2; int i; float f; long long l; char c;};

struct cold_sub_struct_A{ double d2; float f; char c;};


Criteria for structure layout optimizations

Legality Analysis Type cast Address of a field is

taken Escaped types Parameter types Full visibility to IPA Alignment restrictions

Profitability Analysis Hotness Affinity Field accesses at loop

level Size


Implementation Details

Step 1: Type information summarization (IPL)

Step 2: Symbol table merging (IPA)

Step 3: Legality and profitability analysis (IPA analysis)

Step 4: Transforming the program (IPA optimization)


Implementation Details: Type information

summarization Information summarization in IPL Framework for computing static profiles using heuristics New TY flag TY_NO_SPLIT SUMMARY_TY_INFO SUMMARY_LOOP

For each DO_LOOP, WHILE_DO, DO_WHILE Bit-vector to track field accesses of up to N structure for each loop Considers field accesses immediately inside loop

These fields are considered affine to each other

Execution count of statements immediately inside loop

From statically estimated profiles or from runtime feedback


Implementation Details: IPA Analysis

Inter-procedurally update statically estimated execution count of

PUs

Update statically estimated loop frequencies in

SUMMARY_LOOP

Consider SUMMARY_LOOP from the hottest P PUs

Determine candidates for structure-layout transformation

Determine new layout of structures


Implementation Details: IPA Analysis Example

F4 F3 F2 F1 BV

L1 22 22 0101

L2 14 0010

L3 12 12 0101

L4 8 8 1100

L5 6 6 0101

F4 F3 F2 F1

AG1 40 40

AG2 14

AG3 8 8

Li — Loops

Fj — Fields in a struct

AGk — Affinity groups


Implementation Details: Transforming the

program

struct S struct T{ { // N fields // AG1 fields struct T * p; // AG2 fields // M fields };}; // peel T

struct S{ // N fields struct T1 * p1; struct T2 * p2; // M fields};

New type definitions

Field table update

Field access statements

New symbols

Assignment statements

Example:

struct T1 struct T2{ { // AG1 fields // AG2 fields}; };


Implementation Details: Transforming the

program (continued)

Function calls to memory management routinesExample:

p = (T *) malloc (N * sizeof (T))

if (p == NULL)

exit (1);

Detect memory management routine calls involving transformed type T

Replicate call, assignment statements Update size of memory being allocated Handle comparisons involving pointer p


Performance Results

Compilations options: -Ofast at 32-bit ABI

Speedup due to structure layout optimizations

Benchmarks AMD

Opteron™

(2.8GHz,

4GB, 1MB)

AMD

Barcelona(2.

0GHz, 8GB,

512KB)

Intel®

EM64T(3.4G

Hz, 4GB,

1MB)

Intel®

Core™(3.0

GHz, 4GB,

4MB)

SiCortex

MIPS®(500MHz,

4GB, 256KB)

Geometric

Mean

179.art 134% 66% 56% 47% 41% 62.5%

181.mcf 24% 23% 23% 31% 13% 22.0%

462.libquantum 32% 17% 40% 72% 62% 39.6%

Geometric Mean 46.9% 29.6% 37.2% 47.2% 32.1% 37.9%


Performance Results (continued)


Speedup due to structure layout optimizations

Benchmarks AMD

Opteron™

(2.8GHz,

4GB, 1MB)

AMD

Barcelona(2.

0GHz, 8GB,

512KB)

Intel®

EM64T(3.4G

Hz, 4GB,

1MB)

Intel®

Core™(3.0

GHz, 4GB,

4MB)

SiCortex

MIPS®(500MHz,

4GB, 256KB)

Geometric

Mean

179.art 169% 66% 53% 60% 45% 69.3%

181.mcf 25% 35% 12% 30% 7% 18.6%

462.libquantum 82% 51% 75% 70% 69% 68.6%

Geometric Mean 70.2% 49.0% 36.3% 50.1% 27.9% 44.6%


Performance Results (continued)


Multiple copies of 462.libquantum running on multi-core chip

Platform: Quad-core AMD Barcelona (2.0 GHz, 8GB, 512KB, 2MB)

3rd level cache shared among 4 cores

Speedup from structure layout optimizations

Benchmark 1 copy 2 copies 4 copies

462.libquantum 51% 69% 123%


Future Work

Tune static profile estimation

Less restrictions

Integrate with field-reordering


Conclusion

A framework for performing structure layout transformations

is now available in the Open64 compiler.

The superior infrastructure in the Open64 compiler helped us

implement the optimizations cleanly and with relatively less

effort.

Substantial speedups are possible on some of the CPU2000

and CPU2006 SPEC benchmarks.

Structure layout optimization is a required feature for a

compiler to remain competitive.