LLVM
Tobias GrosserENS - INRIAJuly 23, 2012
1 / 132
Tobias Grosser
I PhD student with Prof. Albert Cohenat INRIA / Ecole Normale Superieure, France
I Interests: High-Level Compiler Optimizations, SIMDization,Accelerators, Polyhedral Model
I Open SourceI LLVM: 2 1
2 years, Polly Code OwnerI GCC: 5 years, Graphite ReviewerI Others: CLooG, isl, clang complete, ppcg, . . .
I Worked at: AMD, ARM, Ohio State University
I Google Europe Fellowship in Efficient Computing
I http://www.grosser.es
2 / 132
Outline
The LLVM Project
LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution
Polly
ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code
3 / 132
Outline
The LLVM Project
LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution
Polly
ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code
4 / 132
LLVM
I A Compiler Infrastructure ProjectI Sub-Projects:
I LLVM-Core: Optimizer and Target Code GeneratorI clang: C/C++/Objective-C Front-EndI libc++: C++ Standard LibraryI dragonegg: GCC based Front-EndI Polly: High Level OptimizerI . . .
I BSD-like License
I Developed as a Set of Modular Libraries
I Modern C++ Code Base
5 / 132
LLVM Users and Contributors
I IndustryAdobe, AMD, Apple, ARM, Google, IBM, Intel, Mozilla,Qualcomm, Samsung, Xilinx, . . .
I Research3440 publications on Google Scholar
I OpenSource CommunityMany (I did not count)
6 / 132
Classical Compilers
I clangI Modern C/C++/Objective-C CompilerI Default for Apple’s OS-X and iOS
I dragoneggI C/C++/FORTRAN/ADA/Go/D/. . . CompilerI GCC as Front-endI LLVM as Optimizer and Back-end
7 / 132
Emscripten - An exotic Back-end
I LLVM-IR to JavaScript Compiler
I Developed by Mozilla
I Translate C/C++ as well as the needed Run-time Libraries
I Translate an Interpreter written in C/C++I Applications:
I SQLiteI h264 decoderI Sauerbraten Ego ShooterI Python/Lua/Ruby Shell
I http://www.emscripten.org
8 / 132
GHC - Functional Languages
I LLVM Back-end for the Glasgow Haskell Compiler
I Simplified Back-end Implementation
I Performance for Computation Intensive Code
Graphics taken from David Terei
9 / 132
The Python Experience
I unladen-swallow: A C-Python LLVM Back-end
I Goal: A faster pythonI Difficulties:
I LLVM JIT never used for such Kind of LanguagesI Expressing High-level Language yields long compile Time
“A simple � def add(x, y): return x + y � was close to100 basic blocks due to all the implicit method calls and fallbackpaths to the interpreter”
Pymothoa:
I Explicit types provided by user
I Optimizations on type-instantiated code
I Non-core code still uses normal python JIT
10 / 132
The Python Experience
I unladen-swallow: A C-Python LLVM Back-end
I Goal: A faster pythonI Difficulties:
I LLVM JIT never used for such Kind of LanguagesI Expressing High-level Language yields long compile Time
“A simple � def add(x, y): return x + y � was close to100 basic blocks due to all the implicit method calls and fallbackpaths to the interpreter”
Pymothoa:
I Explicit types provided by user
I Optimizations on type-instantiated code
I Non-core code still uses normal python JIT
10 / 132
Accelerators
I OpenCLI OpenCL Compilers from AMD, NVIDIA, Intel, RapidMind and
ARM are based on LLVMI clang OpenCL support is largely Open SourceI Open Source Back-ends for NVIDIA and AMD GPUs
I Google RenderScriptI clang as a Front-endI LLVM-IR as Exchange FormatI RenderScript Compiler is OpenSource
11 / 132
High Level Synthesis
LLVM-IR to Hardware Description Language
I Leg-Updeveloped by University of Toronto and Altera Inc.
I C-To-Verilogdeveloped by IBM Research
I AutoESLdeveloped by UCLA, bought by Xilinx
I TridentImperial Collage London and 2x US National Labs
12 / 132
Tobi’s Personal Observations
I LLVM is part of many different compilers
I Optimizations for LLVM have a large impact
I LLVM runs on (high performance) embedded devices
I Having CPU, GPU and FPGA back-ends in a single compileris great when targetting heterogeneous architectures.
13 / 132
Outline
The LLVM Project
LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution
Polly
ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code
14 / 132
A static compile flow
main.cppmoduleA.f90moduleB.c
main.llmoduleA.llmoduleB.ll
main.opt.llmoduleA.opt.llmoduleB.opt.ll
program.ll
program.opt.ll program.s program.o program.exeprogram.ll
clangllvm-gcc
llvm-gfortran
Frontends
llvm-opt
Optimizerllvm-link
Linker
llvm-opt
Optimizerllvm-mc
Assemblerllvm-llc
Target Code Generator
ld
SystemLinker
15 / 132
LLVM-IR
I Base of all analyses and optimization passes
I Input/Output of most tools
I Target independent (but front-ends can make it targetdependent)
16 / 132
LLVM-IR - Three equivalent representations
I In Memory: C++ data structures
I Bitcode: Binary file (.bc)
I Human Readable: Text file (.ll)
Translate one representation to another:
> opt -S program.bc -o program.ll
> opt program.ll -o program.bc
17 / 132
LLVM-IR - Generation from C code
> clang -S -emit-llvm main.c -o main.ll
main.c
int main() {
return 42;
}
main.ll
; ModuleID = ’main.ll’
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-[...]"
target triple = "x86_64-unknown-linux-gnu"
define i32 @main() nounwind {
ret i32 42
}
18 / 132
LLVM-IR - Generate target code
> llc main.ll -o main.s && gcc main.s -o main.exe
> ./main.exe
> echo $?
42
> lli main.ll
> echo $?
42
I llc - LLVM to Assembly compiler
I lli - Just in time compiler
19 / 132
Our first function
Create a function that calculates n2
define i32 @pow(i32 %number) {
%pow = mul i32 %number, %number
ret i32 %sqrt
}
define i32 @main() nounwind {
%result = call i32 @pow(i32 7)
ret i32 %result
}
20 / 132
Optimize our first function
> opt -O3 pow.ll -o pow.opt.ll -S
Optimized function
define i32 @pow(i32 %number) nounwind readnone {
%sqrt = mul i32 %number, %number
ret i32 %sqrt
}
define i32 @main() nounwind readnone {
ret i32 49
}
21 / 132
Multiple modules
Combining multiple modules to enable inter-module optimizations:
> llvm-link main.ll pow.ll -o combined.ll -S
22 / 132
bugpoint
I Automatic test case reduction tool
I Works for crashes in opt and llc
I Can also extract miscompiles
> link main.ll pow.ll -o combined.ll -S
> opt -O3 combined.ll
!! CRASH !!
> bugpoint -O3 combined.ll -S
[...]
Reduced test case created.
You can reproduce the bug with
’opt -basicaa -indvars bugpoint-reduced-simplified.ll’
23 / 132
LLVM-IR - Overview
I Low level assembly like language
I Register machine, infinite number of (named) registers
I Each instruction defines a new (named) register
I Load/Store Architecture
I Defined at http://llvm.org/docs/LangRef.html
24 / 132
LLVM-IR - Types
I LLVM-IR is strongly typed
I Each register/pointer/function has an associated type
I No implicit type casts
I A program without casts is typesafe in the absence of memoryaccess errors (e.g. array overflow)
25 / 132
LLVM-IR - Type classes
I Primitiveinteger, floating point, label, metadata, void, x86mmx,
I Derivedarray, function, pointer, structure, packed structure, vector,opaque
First class types - Non first class types
26 / 132
LLVM-IR - Integer types
I Fixed bitwidth
I Any bit width from 1 bit to 232 − 1
I Larger types as function parameters → backend dependent
I Signedness not defined
i1 ; Boolean type
i8 ; char
i32 ; 32 bit integer
i64 ; 64 bit integer
i121212 ; Very large integer
27 / 132
LLVM-IR - Floating point types
float ; 32 bit
double ; 64 bit
fp128 ; 128 bit (112-bit mantissa)
x86_fp80 ; 80 bit (X87)
ppc_fp128 ; 128 bit (two 64-bits)
28 / 132
LLVM-IR - Label type
I A (named) reference for a basic block
define i1 @foo() {
start:
br label %next
next:
br label %return
return:
ret i1 0
}
29 / 132
LLVM-IR - Values
I Created through an instruction
I Global values
I Constants
I Undefined
%result = add i32 5, 10
i1 0
i32 15
float undef
30 / 132
LLVM-IR - Constants
i1 true ; Boolean constants
i1 false
i32 -1 ; equal to 232 − 1
float 123.421 ; Exact decimal notation
; ! 1.3 has infinite
; binary representation
float 1.23421e+2 ; Exponential notation
double 0x432ff973cafa8000 ; Hexadecimal notation
<type> zeroinitializer ; <type> can be any type
31 / 132
LLVM-IR - Instructions
I Calculations
I Vector/Structure management
I Type conversion
I Memory management
I Control flow instructions
32 / 132
LLVM-IR - Computational instructions
I Side effect free
I Take values as input
I Create a new register value
%sum = add i32 %a, %b
%product = fmul float %a, %b
%unsigned_div = udiv i32 %a, %b
%signed_div = sdiv i32 %a, %b
%division = fdiv float %a, %b
33 / 132
LLVM-IR - Computational instructions - Comparisons
%equal = icmp eq i32 %a, %b
%not_equal = float ne i5 %c, %d
%signed_less_than = icmp slt i3 %a, %b
%unsigned_less_than = icmp ult i5 %a, %b
34 / 132
LLVM-IR - Control flow instructions
I (Un)Conditional branch
I Switch
I Return
I Indirect branch, Invoke, Unwind
start:
br i1 true, label %left, label %right
left:
br label %join
right:
br label %join
merge:
ret i32 %joinedValue
35 / 132
LLVM-IR - PHI instruction
I Implements the Φ SSA instruction
start:
br i1 true, label %left, label %right
left:
%plusOne = add i32 0, i32 1
br label %join
right:
br label %join
merge:
%joinedValue = phi i32 [ %plusOne, %left],
[ -1, %right]
ret i32 %joinedValue
36 / 132
LLVM-IR - Call instruction
I Calls a function
I Saves the return value in a new register
%result = call i32 @pow(i32 7)
37 / 132
LLVM-IR - Type classes
I Primitiveinteger, floating point, label, metadata, void, x86mmx,
I Derivedarray, function, pointer, structure, packed structure, vector,opaque
First class types - Non first class types
38 / 132
LLVM-IR - Array type
I Set of elements arranges sequential in memory.
I Takes a type and a constant size
I Only fixed sized multi dimensional arrays.
I No indexing restrictions by type system.
[20 x i1] ; Array of 20 boolean elements
[100 x float] ; Array of 100 float elements
[20 x [100 x i32]] ; Array of 20 arrays of
; 100 i32 elements
[0 x float] ; Zero elemement array. Can be used
; to implement variable sized arrays
39 / 132
LLVM-IR - Struct Type
I Collection of data elements in memory.
I Packaging matches the ABI of the underlying processor.
I Use a packed structure to remove padding.
{float, i64}
{float, {double, i3}}
{float, [2 x i3]}
<{float, [2 x i3]}> ; Packed structure.
; Removes padding
40 / 132
LLVM-IR - Vector type
I Vector of elements
I Used to apply a single instruction on various elements
I Arbitrary width
<4 x float>
<2 x double>
<123 x i3> ; Probably generates inefficient code
41 / 132
LLVM-IR - Pointer type
I Gives a location in memory
I void pointer or pointer to labels not permitted. Use i8*.
I Optional address space qualifier
float* ; Pointer to a float
[5 x float]* ; Pointer to an array
<2 x float>* ; Pointer to a vector
float addrspace(5)* ; Pointer to a float in
; address space 5
42 / 132
LLVM-IR - Named Type
I Types can be named
I Names are aliases for types
I Names are not part of the types
%intv4 = type <4 x i32>
%intv8 = type <8 x i32>
%floatptr = type float*
%mytype = type { %mytype*, i32 }
43 / 132
LLVM-IR - Constants
[i1 true, i1 false] ; Constant array
<i3 5, i3 10> ; Constant vector
{i1 true, float 15} ; Constant structure
<2 x i1> zeroinitializer ; Zero vector
44 / 132
LLVM-IR - Instructions
I Computational instructions
I Vector/structure management
I Type conversion
I Memory management
I Control flow instructions
45 / 132
LLVM-IR - Computational instructions
I Are applied elemente wise on vector types
%sum = add <2 x i32> %a, %b
%product = fmul <4 x float> %a, %b
%equal = icmp eq <2 x i32> %a, %b
%not_equal = float ne <3 x i5> %c, %d
46 / 132
LLVM-IR - Vector management
I Get and set an element
I Shuffle elements by a constant shuffle mask
extractelement <4 x float> %vec, i32 0
; yields float
insertelement <4 x float> %vec, float 1, i32 0
; yields <4 x float>
shufflevector <4 x float> %v1, <4 x float> %v2,
<4 x i32> <i32 0, i32 4, i32 1, i32 5>
; yields <4 x float>
47 / 132
LLVM-IR - Array/Structure management
I Extract an element from a structure/array
I Indexes need to be in bounds
I Indexes are constants
extractvalue {i32, float} %agg, 0
; yields i32
extractvalue {i32, {float, double}} %agg, 0, 1
; yields double
extractvalue [2 x i32] %array, 0
; yields i32
48 / 132
LLVM-IR - Array/Structure management II
I Insert an element into a structure/array.
%agg1 = insertvalue {i32, float} undef, i32 1, 0
; yields {i32 1, float undef}
%agg2 = insertvalue {i32, float} %agg1, float %val, 1
; yields {i32 1, float %val}
%aggA = insertvalue {i32, float} zeroinitializer,
i32 1, 0
; yields {i32 1, float 0}
49 / 132
LLVM-IR - Allocate memory
I alloca - Allocate memory on the stack
I malloc - Use C stdlib memory allocator
%ptr = alloca i32
%ptr = alloca i32, i32 4
%ptr = alloca i32, i32 4, align 1024
%ptr = alloca i32, align 1024
; All yield i32*
%mallocP = call i8* @malloc(i32 %objectsize)
; yields i8* (void pointer)
50 / 132
LLVM-IR - Load/Store memory
I The only operations that can access memory
%ptr = alloca i32
store i32 3, i32* %ptr
%val = load i32* %ptr
51 / 132
LLVM-IR - Select operation
I Select one value depending on a condition
I a = condition ? valueOne : valueTwo
I No branch (mis) prediction necessary
%X = select i1 true, i8 17, i8 42
; yields i8:17
52 / 132
LLVM-IR - Type conversion
I Size conversion int ↔ int
I Size conversion float ↔ float
I float ↔ int
I int ↔ ptr
I Bitcast - Do not change bit representation
trunc i32 257 to i8 ; yields i8:1
zext i32 257 to i64 ; yields i64:257
sext i8 -1 to i16 ; yields i16:65535
bitcast <2 x i32> %V to i64;; yields i64: %V
53 / 132
LLVM Passes
I Analysis Passes(-domtree, -regions, -basicaa)
I Transformation PassesI Canonicalization Passes
(-reg2mem, -indvars, -loop-simplify, -mergereturn)
I Optimization Passes(-mem2reg, -tailcallelim, -constprop, -gvn, -instcombine, -instsimplify)
I Other Passes(-view-cfg, -view-cfg-only, -view-dom, -instnamer, -verify)
Get a complete list with “opt -help”
54 / 132
LLVM Pass Types
I Module Pass
I CallGraphSCC Pass
I Function Pass
I Region Pass / Loop Pass
I Basic Block Pass
I Machine Function Pass
55 / 132
The LLVM Pass Philosopy
I Analysis passes provide high-level abstractions
I Canonicalication Passes create a canonical representation
I Transformation passes only work on a canonical representation
56 / 132
-instsimplify
I Remove redundant instructions
I Cannot create new instructions
(X +−1) + 1→ X
define i32 @add1(i32 %x) {
%l = add i32 %x, -1
%r = add i32 %l, 1
ret i32 %r
}
to
define i32 @add1(i32 %x) {
ret i32 %x
}
57 / 132
-instcombineI Combine redundant instructionsI Can create new instructions
(x & z) ˆ (y & z) → (x ˆ y) & z
define i32 @test1(i32 %x, i32 %y, i32 %z) {
%tmp1 = and i32 %z, %x
%tmp2 = and i32 %z, %y
%tmp3 = xor i32 %tmp1, %tmp2
ret i32 %tmp3
}
to
define i32 @test1(i32 %x, i32 %y, i32 %z) {
%tmp1 = xor i32 %x, %y
%tmp2 = and i32 %tmp1, %y
ret i32 %tmp2
}58 / 132
Regression tests
I Run with make check
I Stored in src-dir/test
I Each transformation has its own directory
59 / 132
lit.py / llvm-lit
I LLVM integrated tester
I Runs the LLVM and Clang test suite
I Used to run individual tests
I build/llvm-lit
~/llvm_build/bin/llvm-lit ~/llvm_git/test/Analysis/Dominators/
-- Testing: 4 tests, 4 threads --
PASS: LLVM :: Analysis/Dominators/2006-10-02-BreakCritEdges.ll
FAIL: LLVM :: Analysis/Dominators/2007-07-12-SplitBlock.ll
XPASS: LLVM :: Analysis/Dominators/2007-07-11-SplitBlock.ll
XFAIL: LLVM :: Analysis/Dominators/2007-01-14-BreakCritEdges.ll
UNSUPPORTED: LLVM :: Analysis/Dominators/other.ll
60 / 132
A single test file
I Run line specifies the test command
I %s is replaced with the test file itself
I Test fails if the command has a non-zero return value
; RUN: opt -mypass %s
define i32 @test1(i32 %A) {
%B = xor i32 %A, 12345
ret i32 %B
}
61 / 132
FileCheck / not
I Use ’FileCheck’ to check for expected transformations
; RUN: opt < %s -instcombine -S | FileCheck %s
define i1 @test1(i32 %A) {
; CHECK: @test1
; CHECK-NEXT: %C = icmp slt i32 %A, 0
%B = xor i32 %A, 12345
%C = icmp slt i32 %B, 0
ret i1 %C
}
I Use ’not’ to switch return codes
; RUN: not opt %s -S
define i1 @test1(i32 %A) {
; Expected failure because of type mismatch
ret i1 %A
}
62 / 132
Instcombine
I LLVM-IR level peephole optimization
I Run with opt -instcombine
I Source code in lib/Transforms/InstCombine
I FunctionPass
63 / 132
Instcombine - Architecture
while (optimizations found && not timeout) {I Search source code for known patterns
I Create simplification for them
I Replace original set of instructions with simplificiation
}
64 / 132
Instcombine - Instruction matching
(A & B) ˆ (A | B) → A ˆ B
%0 = and i32 %A, %B
%1 = xor i32 %A, %B
%3 = or i32 %0, %1
XOR AND
OR
%A %B %A %B
OR
%A %B
65 / 132
Support/PatternMatch.h
I Match a tree of LLVM instructions
I Capture parts of the instruction tree
Value *Exp = ...
Value *X; ConstantInt *C1;
// Exp == (X | C1)
if (match(Exp, m_Or(m_Value(X), m_ConstantInt(C1))) }
... Pattern is matched and variables are bound ...
}
66 / 132
Support/PatternMatch.h
I Match and ignoreI m Value() - Any valueI m ConstantInt() - Any integer constantI m Undef() - Any undefined valueI m Zero() - Any zeroinitializerI m One() - Integer of Vector with all elements = 1I m AllOne() - Integer of Vector with all bits = 1I m SignBit() - Integer of Vector with only the signbit set.I m Power2() - Integer of Vector with all elements are a power
of two.
67 / 132
Support/PatternMatch.h
I Match and captureI m Value(Value *&V) - Any valueI m ConstantInt(ConstantInt *&I) - Any integer constantI m Constant(Constant*&C) - Any constant
I Match and compareI m Specific(Value *V) - Any value
68 / 132
Support/PatternMatch.h
I Match binary operatorsI m Add(Type *LHS,Type *RHS) - Match an add instructionI m Sub(Type *LHS,Type *RHS) - Match a sub instructionI ...
I Match unary operatorsI m SExt(Type *Operand) - Match an sext instructionI m Neg(Type *Operand) - Match an integer negateI ...
I Matchers for control flow
69 / 132
How to create Instructions?
I include/llvm/InstrTypes.h defines functions to createinstructions
I Two constructorsI Just create the instructionI Create instruction and add it before another instructionI Create instruction and add it at the end of a basic block
Instruction *I1 = BinaryOperator::CreateOr(A, B);
Instruction *I2 = BinaryOperator::CreateAnd(A, B);
70 / 132
IRBuilder
I include/llvm/Support/IRBuilder.h
I A helper to automaticallyI Create and insert instructionsI Get common typesI Get common constants
Value *V = Builder->CreateOr(A, B);
Type *T = Builder->getInt32Ty();
ConstantInt *I = Builder->getInt32(512);
71 / 132
Loop Optimizations within LLVM
Analyses
I Loop Detection
I Scalar Evolution
Canonicalication
I Loop Simplification
I Induction Variable Simplification
Transformations
I Loop Rotation
I Loop Idiom Recognition (memcpy, . . . )
I Loop Deletion
I Loop Unrolling (also partial unrolling)
I Intra Basic Block Vectorization (in testing)
72 / 132
What is Scalar Evolution?
I A compiler analysis
I Calculates a closed form expression for the values of scalars atdifferent loop iterations.
I Used for loop trip counts, instruction combination, strengthreduction, loop canonicalization, . . . , Polly.
Example
scalar = A;
for (int j = 0; j < N; j++)
scalar = scalar + B;
scalar = A + iB = {A,+,B}j
73 / 132
What is Scalar Evolution?
I A compiler analysis
I Calculates a closed form expression for the values of scalars atdifferent loop iterations.
I Used for loop trip counts, instruction combination, strengthreduction, loop canonicalization, . . . , Polly.
Example
scalar = A;
for (int j = 0; j < N; j++)
scalar = scalar + B;
scalar = A + iB = {A,+,B}j
73 / 132
History
Research
I Bachmann 1994 Chains of recurrences - A method toexpedit the evaluation of closed-form functions
I Engelen 2000 Chains of recurrences for loop optimization
I Pop 2003 Analysis of induction variables using chains ofrecurrences
Compilers
I GCC First commit 20. June 2004
I LLVM First commit 2. April 2004
74 / 132
Chain of Recurrence / SCEV
Build blocks
I Operations: +, *, /, sext, zext, trunk, smax, umax
I Constant, Sizeof, Alignof
I Unknown Value, Parameter
I Add Recurrences: {expr ,+, expr}Loop
75 / 132
A simple example1
define void @foo(i64 %a, i64 %b, i64 %c) {
%t0 = add i64 %b, %a
%t1 = add i64 %t0, 7
%t2 = add i64 %t1, %c
ret i64 %t2
}
; SCEV: (7 + %a + %b +%c)
1Take from Dan Gohmans LLVM Meeting 2009 presentation76 / 132
Two dimensional array - Without any loops 2
double *bar(double a[10][10], long b, long c) {
return &a[b * 3 + 7][c + 5];
}
define double* @bar([10 x double]* %a, i64 %b, i64 %c) {
%bx3 = mul i64 %b, 3
%bx3a7 = add i64 %bx3, 7
%ca5 = add i64 %c, 5
%z = getelementptr [10 x double]* %a,
i64 %bx3a7, i64 %ca5
ret double* %z
}
; SCEV: (((75 + %c + (30 * %b)) * sizeof(double)) + %a)
; SCEV: (600 + (8 * %c) + (240 * %b) + %a)
2Take from Dan Gohmans LLVM Meeting 2009 presentation77 / 132
Add Recurrences
General form{base,+, stride}<loop>
void foo(long n, double *p) {
for (long i = 0; i < n; ++i)
p[i] = 0.0;
}
As a SCEV{%p,+, 8}<%for .body>
Optionally, without TargetData
{%p,+, sizeof (double)}<%for .body>
78 / 132
Pointer Loop - CFG
void pointer_loop () {
int *B = A;
while (B < &A[1024]) {
*B = 1;
++B;
}
}
CFG for 'pointer_loop' function
bb.nph:br label %while.body
while.body:%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %while.body ]%B.02 = getelementptr [1024 x i32]* @A, i64 0, i64 %indvarstore i32 1, i32* %B.02, align 4%tmp = shl i64 %indvar, 2%ptrincdec.idx = add i64 %tmp, 4%cmp = icmp slt i64 %ptrincdec.idx, 4096%indvar.next = add i64 %indvar, 1br i1 %cmp, label %while.body, label %while.end
T F
while.end:ret void
79 / 132
Pointer Loop - Scalar Evolution
%indvar:
{0,+,1}<while.body>Exits: 1023
%B.02:
{@A,+,4}<while.body>Exits: (4092 + @A)
%indvar.next:
{1,+,1}<while.body>Exits: 1024
backedge-taken count: 1023 CFG for 'pointer_loop' function
bb.nph:br label %while.body
while.body:%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %while.body ]%B.02 = getelementptr [1024 x i32]* @A, i64 0, i64 %indvarstore i32 1, i32* %B.02, align 4%tmp = shl i64 %indvar, 2%ptrincdec.idx = add i64 %tmp, 4%cmp = icmp slt i64 %ptrincdec.idx, 4096%indvar.next = add i64 %indvar, 1br i1 %cmp, label %while.body, label %while.end
T F
while.end:ret void
80 / 132
Pointer Loop II - CFG
void c(long* p, long n) {
long i;
for (i = 0; i < n; i++)
*(p+i) = i;
}
CFG for 'compute' function
entry:%cmp1 = icmp sgt i64 %n, 0br i1 %cmp1, label %for.body, label %for.end
T F
for.body:%0 = phi i64 [ %inc, %for.body ], [ 0, %entry ]%add.ptr = getelementptr i64* %p, i64 %0store i64 %0, i64* %add.ptr, align 8%inc = add nsw i64 %0, 1%exitcond = icmp eq i64 %inc, %nbr i1 %exitcond, label %for.end, label %for.body
T F
for.end:ret void
81 / 132
Pointer Loop II - Scalar Evolution
%0:
{0,+,1}<for.body>Exits: (-1 + n)
%add.ptr:
{p,+,8}<for.body>Exits: (-8 + (8 * n) + p)
%inc:
{1,+,1}<for.body>Exits: n
backedge-taken count:
(-1 + n)CFG for 'compute' function
entry:%cmp1 = icmp sgt i64 %n, 0br i1 %cmp1, label %for.body, label %for.end
T F
for.body:%0 = phi i64 [ %inc, %for.body ], [ 0, %entry ]%add.ptr = getelementptr i64* %p, i64 %0store i64 %0, i64* %add.ptr, align 8%inc = add nsw i64 %0, 1%exitcond = icmp eq i64 %inc, %nbr i1 %exitcond, label %for.end, label %for.body
T F
for.end:ret void
82 / 132
Linearized Multidimensional Array - CFG
void c(long* p, long n_row,
long n_col) {
long *ptr_row;
long *ptr_col;
long row, col;
ptr_row = p;
for (row=0; row<n_row; row++) {
ptr_col = ptr_row;
for (col=0; col<n_col; col++)
S: *(ptr_col++) = row+col;
ptr_row += n_col;
}
}CFG for 'pointer_loop_linearized_multidim' function
entry:%cmp4 = icmp sgt i64 %n_row, 0%cmp71 = icmp sgt i64 %n_col, 0%or.cond = and i1 %cmp4, %cmp71br i1 %or.cond, label %bb.nph.us, label %for.end18
T F
bb.nph.us:%row.06.us = phi i64 [ %inc17.us, %for.end.us ], [ 0, %entry ]%tmp10 = mul i64 %row.06.us, %n_colbr label %for.body8.us
for.end18:ret void
for.body8.us:%col.03.us = phi i64 [ 0, %bb.nph.us ], [ %inc.us, %for.body8.us ]%tmp11 = add i64 %tmp10, %col.03.us%ptr_col.02.us = getelementptr i64* %p, i64 %tmp11%add.us = add i64 %row.06.us, %col.03.usstore i64 %add.us, i64* %ptr_col.02.us, align 8%inc.us = add nsw i64 %col.03.us, 1%exitcond = icmp eq i64 %inc.us, %n_colbr i1 %exitcond, label %for.end.us, label %for.body8.us
T F
for.end.us:%inc17.us = add nsw i64 %row.06.us, 1%exitcond9 = icmp eq i64 %inc17.us, %n_rowbr i1 %exitcond9, label %for.end18, label %bb.nph.us
T F
83 / 132
Linearized Multidimensional Array - Scalar Evolution
%ptr_col.02.us:
{{p,+,(8 * n_col)}<bb.nph.us>,+,8}<for.body8.us>Exits: {(-8 + (8 * n_col) + p),+,(8 * n_col)}<bb.nph.us>
%col.03.us:
{0,+,1}<for.body8.us>Exits: (-1 + n_col)
backedge-taken count "for.body8.us":
(-1 + n_col)
backedge-taken count "bb.nph.us":
(-1 + n_row)
84 / 132
Use scalar evolution analysis - The opt binary
# Generate LLVM-IR
> clang -S -emit-llvm your-prog.c -o your-prog.ll
# Canonicalize the code. Most probably not all of -O3
# is needed, but -mem2reg is sufficient.
> opt -O3 your-prog.ll > your-prog.preopt.ll
# Run the analysis
> opt -scalar-evolition -analyze your-prog.preopt.ll
85 / 132
Use Scalar Evolution Analysis
void YourAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
AU.setPreservesAll();
AU.addRequired<ScalarEvolution>();
}
bool YourAnalysis::runOnFunction(Function &F) {
ScalarEvolution &SE = getAnalysis<ScalarEvolution>();
// Get SVEV for the first instruction of the Function.
Instruction *FirstInstruction = (*F.begin())->begin();
SCEV *evolution = SE->getSCEV(FirstInstruction);
if (isa<SCEVConstant>(evolution))
errs() << "The first instruction is a constant SCEV";
}
86 / 132
SCEVVisitor, SCEVExpander
SCEVVisitorAn iterator that walks over a SCEV. It allows to analyse or modifyit.
SCEVExpander
Recreate LLVM-IR from a modified SCEV. This is currently onlywell tested within the existing scalar loop optimizers.
87 / 132
Vectorization status
88 / 132
Conclusion - Loop Optimizations in LLVM
I Scalar evolution provides induction variable analysis
I Good set of scalar loop optimizations
I Almost no optimizations which change the loop structure
I First draft of basic block vectorization
89 / 132
Outline
The LLVM Project
LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution
Polly
ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code
90 / 132
The idea of Polly?
We want:
I Fast and power-efficient code
We have:
I SIMD, Caches, Multi-Core, Accelerators
But:
I Optimized code is needed
I Manual Optimization is complex and not performance portable
I Architectures are too diverse to optimize ahead of time
Goal:
I Automatic high-level optimizations for heterogeneousarchitectures
91 / 132
Get Polly
I Install Pollyhttp://polly.grosser.es/get_started.html
I Load Polly automatically
alias clang clang -Xclang -load -Xclang LLVMPolly.so
alias opt opt -load LLVMPolly.so
I Default behaviour preserved
I clang/opt now provide options to enable Polly
92 / 132
Optimize a program with Polly
gemm.c [1024 x 1024 (static size), double]
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++) {
C[i][j] = 0;
for (int k = 0; k < K; k++)
C[i][j] += A[i][k] + B[k][j];
}
$ clang -O3 gemm.c -o gemm.clang
$ time ./gemm.clang
real 0m15.336s
$ clang -O3 -mllvm -o gemm.polly -mllvm -polly
$ time ./gemm.polly
real 0m2.144s
93 / 132
The Architecture
LLVM IR LLVM IRPSCoP
SCoP Detection
Code Generation
JSCoP
* Loop transformations* Data layout optimizations* Expose parallelism
Transformations
Manual Optimization / PoCC+Pluto
DependencyAnalysis
Export Import
SIMD
OpenMP OpenCL
94 / 132
Can Polly analyze our code?
$ clang -O3 gemm.c \
-mllvm -polly-show-only \
-mllvm -polly-detect-only=gemm
I Highlight the detected Scops
I Only check in function ’gemm’
Scop Graph for 'gemm' function
entry
entry.split
for.cond1.preheader
for.body3
for.body8
for.inc22
for.inc25
for.end27
95 / 132
Some code can not be analyzed
$ clang -O3 gemm.c \
-mllvm -polly-show-only \
-mllvm -polly-detect-only=gemm
gemm (possible aliasing)
void gemm(double A[N][K],
double B[K][M],
double C[N][M]) {
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++) {
C[i][j] = 0;
for (int k = 0; k < K; k++)
C[i][j] += A[i][k] + B[k][j];
}
}
Scop Graph for 'gemm' function
entry
entry.split
for.cond1.preheader
for.body3
for.body8
for.inc22
for.inc25
for.end27
%B may possibly alias
%A may possibly alias
%A may possibly alias
96 / 132
How to fix it?
Add ’restrict’
void gemm(double A[restrict N][K],
double B[restrict K][M],
double C[restrict N][M]);
Other options:
I Inlining
I Improved alias analysis
I Run time checks
Scop Graph for 'gemm' function
entry
entry.split
for.cond1.preheader
for.body3
for.body8
for.inc22
for.inc25
for.end27
97 / 132
Extract polyhedral representation
gemm
for (int i = 0; i < 512; i++)
for (int j = 0; j < 512; j++) {
C[i][j] = 0; // Stmt1for (int k = 0; k < 512; k++)
C[i][j] += A[i][k] + B[k][j]; // Stmt2}
$ clang -O3 gemm.c \
-mllvm -polly-run-export-jscop \
-mllvm -polly-detect-only=gemm
Writing JScop ’for.cond1.preheader => for.end27’ in function ’gemm’ to
’./gemm___%for.cond1.preheader---%for.end27.jscop’.
Domain = {Stmt1[i , j] : 0 <= i , j < 512; Stmt2[i , j , k] : 0 <= i , j , k < 512}Schedule = {Stmt1[i , j] → [i , j , 0]; Stmt2[i , j , k] → [i , j , 1, k]}
Writes = {Stmt1[i , j] → C [i , j]; Stmt2[i , j , k] → C [i , j]}Reads = {Stmt2[i , j , k] → A[i , k]; Stmt2[i , j , k] → B[k, j]}
98 / 132
The SCoP - Classical Definition
for i = 1 to (5n + 3)
for j = n to (4i + 3n + 4)
A[i-j] = A[i]
if i < (n - 20)
A[i+20] = j
I Structured control flowI Regular for loopsI Conditions
I Affine expressions in:I Loop bounds, conditions, access functions
I Side effect free
99 / 132
AST based frameworks
What about:
I Goto-based loops
I C++ iterators
I C++0x foreach loop
Common restrictions
I Limited to subset of C/C++
I Require explicit annotations
I Only canonical code
I Correct? (Integer overflow, Operator overloading, ...)
100 / 132
Semantic SCoP
Thanks to LLVM Analysis and Optimization Passes:
SCoP - The Polly way
I Structured control flowI Regular for loops → Anything that acts like a regular for loopI Conditions
I Affine expressions→ Expressions that calculate an affine result
I Side effect free known
I Memory accesses through arrays → Arrays + Pointers
101 / 132
Valid SCoPs
do..while loop
i = 0;
do {
int b = 2 * i;
int c = b * 3 + 5 * i;
A[c] = i;
i += 2;
} while (i < N);
pointer loop
int A[1024];
void pointer_loop () {
int *B = A;
while (B < &A[1024]) {
*B = 1;
++B;
}
}
102 / 132
Applying transformations
I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}
I
I TStripMine = {[i , j ]→ [i , jj , j ] : jjmod 4 = 0 ∧ jj <= j < jj + 4}
I S ′ = S
for (i = 0; i < 32; i++)
for (j = 0; j < 1000; j++)
A[j][i] += 1;
103 / 132
Applying transformations
I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}I TInterchange = {[i , j ]→ [j , i ]}
I TStripMine = {[i , j ]→ [i , jj , j ] : jjmod 4 = 0 ∧ jj <= j < jj + 4}
I S ′ = S ◦ TInterchange
for (j = 0; j < 1000; j++)
for (i = 0; i < 32; i++)
A[j][i] += 1;
104 / 132
Applying transformations
I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}I TInterchange = {[i , j ]→ [j , i ]}I TStripMine = {[i , j ]→ [i , jj , j ] : jj
mod 4 = 0 ∧ jj <= j < jj + 4}I S ′ = S ◦ TInterchange ◦ TStripMine
for (j = 0; j < 1000; j++)
for (ii = 0; ii < 32; ii+=4)
for (i = ii; i < ii+4; i++)
A[j][i] += 1;
105 / 132
Polly takes advantage of available parallelism
It creates automatically:
I OpenMP callsfor loops that are not surrounded by any other parallel loops
I SIMD instructionsfor innermost loops with a constant number of iterations
→ Optimizing code becomes the problem of finding the rightschedule.
106 / 132
Optimizing of Matrix Multiply
0
1
2
3
4
5
6
7
8
9
clang -O3
gcc -ffast-math -O3
icc -fastPolly: Only LLVM -O3
Polly: + Strip mining
Polly: += Vectorization
Polly: += Hoisting
Polly: += Unrolling
Spe
edup
32x32 double, Transposed matric Multiply, C[i][j] += A[k][i] * B[j][k];
Intel R© Core R© i5 @ 2.40GH
107 / 132
Automatic optimization with the Pluto algorithm
Polly provides two automatic optimizers
PoCC
I -polly-optimizer=pocc
I Original implementation
I We call the pocc binary
I More mature
I Integrated with a large setof research tools
ISL
I -polly-optimizer=isl(default)
I Reimplementation
I ISL is already linked intoPolly, no additional libraryneeded
I Still untuned heuristics
I Will be used for production.
108 / 132
Polly on Polybench - Sequential execution times
2mm3mm
adiatax
bicgcholesky
correlation
covariance
doitgen
durbindynprog
fdtd-2d
fdtd-apml
gauss-filter
gemmgemver
gesummv
gramschmidt
jacobi-1d-imper
jacobi-2d-imper
lu ludcmp
mvtreg_detect
seidel
symmsyr2k
syrktrisolv
trmm0
1
2
3
4
5
Speedup r
ela
tive t
o "
clang -
O3
" clang -O3pollycc -ftilepollycc -ftile -fvector
2mm3mm
adiatax
bicgcholesky
correlation
covariance
doitgen
durbindynprog
fdtd-2d
fdtd-apml
gauss-filter
gemmgemver
gesummv
gramschmidt
jacobi-1d-imper
jacobi-2d-imper
lu ludcmp
mvtreg_detect
seidel
symmsyr2k
syrktrisolv
trmm0
2
4
6
8
10
12
14
16
Speedup r
ela
tive t
o "
clang -
O3
" clang -O3pollycc -ftilepollycc -ftile -fvector
109 / 132
Polly on Polybench - Parallel execution times
2mm3mm
adiatax
bicgcholesky
correlation
covariance
doitgen
durbindynprog
fdtd-2d
fdtd-apml
gauss-filter
gemmgemver
gesummv
gramschmidt
jacobi-1d-imper
jacobi-2d-imper
lu ludcmp
mvtreg_detect
seidel
symmsyr2k
syrktrisolv
trmm0
5
10
15
20
25
Speedup r
ela
tive t
o "
clang -
O3
" clang -O3pollycc -ftile -fparallelpollycc -ftile -fparallel -fvector
2mm3mm
adiatax
bicgcholesky
correlation
covariance
doitgen
durbindynprog
fdtd-2d
fdtd-apml
gauss-filter
gemmgemver
gesummv
gramschmidt
jacobi-1d-imper
jacobi-2d-imper
lu ludcmp
mvtreg_detect
seidel
symmsyr2k
syrktrisolv
trmm0
20
40
60
80
100
120
Speedup r
ela
tive t
o "
clang -
O3
" clang -O3pollycc -ftile -fparallelpollycc -ftile -fparallel -fvector
110 / 132
Current Status
LLVM IR LLVM IRPSCoP
SCoP Detection
Code Generation
JSCoP
* Loop transformations* Data layout optimizations* Expose parallelism
Transformations
Manual Optimization / PoCC+Pluto
DependencyAnalysis
Export Import
SIMD
OpenMP OpenCL
Usable for experiments
Planned
Under Construction
111 / 132
How to proceed? Where can we copy?
I Short Vector Instructions→ Vectorizing compiler
I Data Locality→ Optimizing compilers , Pluto
I Thread Level Parallelism→ Optimizing compilers , Pluto
I Vector Accelerators→ Par4All , C-to-CUDA , ppcg
The overall problem:
112 / 132
Polly
Idea: Integrated vectorization
I Target the overall problem
I Re-use existing concepts and libraries
113 / 132
Next Steps
My agenda:
I Data-locality optimizations for larger programs (productionquality)
I Expose SIMDization opportunities with the core optimizers
I Offload computations to vector accelerators
Your ideas?
I Use Polly to drive instruction scheduling for VLIWarchitectures
I . . .
114 / 132
Conclusion
Polly
I Language Independent
I Optimizations for Data-Locality & Parallelism
I SIMD & OpenMP code generation support
I Planned: OpenCL Generation
http://polly.grosser.es
115 / 132
Make Polly Production Quality
I Derive width of new induction variables
I Model integer wrapping correctly
I Support variable sized multi-dimensional arrays
I Bound compile-time
I Ensure there are no run-time regressions
I Testing and bug fixing
116 / 132
The size of induction variables
I D = {Stmt[i ] : 0 <= i < 32}I S = {Stmt[i ]→ [i ]}
I TScale = {[i ]→ [32i ]}
I S ′ = S
for (int_6 i = 0; i < 32; j++)
A[i] += 1;
117 / 132
The size of induction variables
I D = {Stmt[i ] : 0 <= i < 32}I S = {Stmt[i ]→ [i ]}I TScale = {[i ]→ [32i ]}I S ′ = S ◦ TScale
for (int_11 i = 0; i < 1024; i+=32)
A[i/32] += 1;
118 / 132
Correctly Model Integer Wrapping
I LLVM-IR instructions of integer types have modulo semantics
I They can be flagged with nsw or nuw(No signed wrap, no unsigned wrap)
I In the absence of these flags we have three choicesI Add run-time check, that proves absence of wrappingI Model wrapping in our polyhedral representationI Fix propagation of these flags (if possible)
119 / 132
Multi dimensional arrays
#define N;
void foo(int n, float A[][N], float **B, C[][n]) {
A[5][5] = 1;
B + 5 * n + 5 = 1;
C[5][5] = 1;
}
I A - Constant Size → already linearI B - Self-made made multi dimensional arrays
I Guess & ProveI Guess & Runtime Check
I C - C99 Variable Length Arrays / Fortran ArraysI Guess & ProveI Guess & Runtime CheckI Pass information from Clang / GFORTRAN
120 / 132
Outline
The LLVM Project
LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution
Polly
ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code
121 / 132
LLVM Development Model
I Development only on trunk
I No stable branches
I No stable API
I Small, incremental changes
I 6 month timed release cycle
→ trunk is normally well tested!→ Private branches are hard to maintain!
122 / 132
Preparing a Patch
I As small as possible
I Unrelated changes → Separate patches
I No unrelated style changes
I Include a test case
I Follow style of the surrounding code
I Run make check
Goal: Make patch review trivial
123 / 132
Contributing a Patch
I Post patch to [email protected]
I Wait for patch review
I Address possible comments
I Wait for OK to commit
I Commit (or ask for commit)
Trivial changes can be committed can be reviewed post-commit.
124 / 132
Getting Patch Reviews
This is a game: Know how to play it!
I Establish and maintain track recordI Continuously contribute simple changesI Perform code reviews regularly
I Give fast feedback to code reviews
I Play review ping-pong!If you review patches of people working in your area, they aremore likely to review yours.
125 / 132
Polly - Open Projects
I Run-Time Check for Absence of Aliasing
I SPEC 2000/2006 Analysis
I Register Tiling
I Loop Interchange Heuristic
I OpenSCoP Import/Export
I Connect Pluto with Polly
126 / 132
GSoC - Google Summer of Code
I Your Project in an Open Source Organization
I 3 Month (June/July/August)
I Earn 5000 US $ (≈ 270.000 Rs)
I 180 Organizations (including LLVM)
I http://code.google.com/soc
127 / 132
GSoC - Is it Research?
I No, but it is strongly related
I Transform research prototypes into production code
I Evaluate your research ideas in practice
I Have your ideas used by millions of users(Your code running on every iPhone?)
I Improve the infrastructure/tools you use for research
128 / 132
GSoC - Projects
I 7 GSoC Students have been working on GraphiteI Tobias Grosser - Transform GIMPLE to GRAPHITE (2008)I Li Feng - Automatic parallelization in Graphite (2009)
I 4 GSoC Students have been working on PollyI Hongbin Zheng - SCoP Detection (2010)I Ragesh Aloor - Memory Access Transformations (2011)I Yabin Hu - GPGPU Code Generation Infrastructure (2012)I Junqi Deng - A Data Prefetching Transformation (2012)
I Other interesting studentsI Justin Holewinski - LLVM PTX Backend (2011)
129 / 132
GSoC - Opportunities
I 7 GSoC Students have been working on GraphiteI Tobias Grosser - PhD. Ecole Normale Superieure, FranceI Li Feng - PhD. Ecole Normale Superieure, France
I 4 GSoC Students have been working on PollyI Hongbin Zheng - Internship at University of IllinoisI Ragesh Aloor - Invited to CGO and IMPACT 2011 in FranceI Yabin Hu - in progressI Junqi Deng - in progress
I Other interesting studentsI Justin Holewinski - Internship at NVIDIA
130 / 132
GSoC - Preparing an Application
I Get in touch early (November, December)
I Discuss your ideas and get feedback from core developers
I Get your hands dirty, prototype some ideas
I Start contributing small patches
→ A good application has a high chance of being accepted,but it also requires proper preparation.
131 / 132
Conclusion
I LLVM an interesting platform for High-Level Optimizations
I Good Scalar Optimizations and Infrastructure
I Polly provides modern, uniform infrastructure for LoopOptimizations
I Many Possibilities to Contribute
132 / 132