LLVM - Grosser · The Python Experience I unladen-swallow: A C-Python LLVM Back-end I Goal: A...

LLVM

Tobias GrosserENS - INRIAJuly 23, 2012

1 / 132

Tobias Grosser

I PhD student with Prof. Albert Cohenat INRIA / Ecole Normale Superieure, France

I Interests: High-Level Compiler Optimizations, SIMDization,Accelerators, Polyhedral Model

I Open SourceI LLVM: 2 1

2 years, Polly Code OwnerI GCC: 5 years, Graphite ReviewerI Others: CLooG, isl, clang complete, ppcg, . . .

I Worked at: AMD, ARM, Ohio State University

I Google Europe Fellowship in Efficient Computing

I http://www.grosser.es

2 / 132

http://www.grosser.es

Outline

The LLVM Project

LLVM Core Libraries and ToolsCore ToolsLLVM-IRLLVM PassesStatus Loop optimizations in LLVMScalar Evolution

Polly

ContributingSubmitting PatchesPolly - Open ProjectsGoogle Summer of Code

3 / 132

Outline

The LLVM Project


Polly


4 / 132

LLVM

I A Compiler Infrastructure ProjectI Sub-Projects:

I LLVM-Core: Optimizer and Target Code GeneratorI clang: C/C++/Objective-C Front-EndI libc++: C++ Standard LibraryI dragonegg: GCC based Front-EndI Polly: High Level OptimizerI . . .

I BSD-like License

I Developed as a Set of Modular Libraries

I Modern C++ Code Base

5 / 132

LLVM Users and Contributors

I IndustryAdobe, AMD, Apple, ARM, Google, IBM, Intel, Mozilla,Qualcomm, Samsung, Xilinx, . . .

I Research3440 publications on Google Scholar

I OpenSource CommunityMany (I did not count)

6 / 132

Classical Compilers

I clangI Modern C/C++/Objective-C CompilerI Default for Apple’s OS-X and iOS

I dragoneggI C/C++/FORTRAN/ADA/Go/D/. . . CompilerI GCC as Front-endI LLVM as Optimizer and Back-end

7 / 132

Emscripten - An exotic Back-end

I LLVM-IR to JavaScript Compiler

I Developed by Mozilla

I Translate C/C++ as well as the needed Run-time Libraries

I Translate an Interpreter written in C/C++I Applications:

I SQLiteI h264 decoderI Sauerbraten Ego ShooterI Python/Lua/Ruby Shell

I http://www.emscripten.org

8 / 132

http://www.emscripten.org

GHC - Functional Languages

I LLVM Back-end for the Glasgow Haskell Compiler

I Simplified Back-end Implementation

I Performance for Computation Intensive Code

Graphics taken from David Terei

9 / 132

The Python Experience

I unladen-swallow: A C-Python LLVM Back-end

I Goal: A faster pythonI Difficulties:

I LLVM JIT never used for such Kind of LanguagesI Expressing High-level Language yields long compile Time

“A simple � def add(x, y): return x + y � was close to100 basic blocks due to all the implicit method calls and fallbackpaths to the interpreter”

Pymothoa:

I Explicit types provided by user

I Optimizations on type-instantiated code

I Non-core code still uses normal python JIT

10 / 132

The Python Experience

I unladen-swallow: A C-Python LLVM Back-end

I Goal: A faster pythonI Difficulties:

I LLVM JIT never used for such Kind of LanguagesI Expressing High-level Language yields long compile Time

“A simple � def add(x, y): return x + y � was close to100 basic blocks due to all the implicit method calls and fallbackpaths to the interpreter”

Pymothoa:

I Explicit types provided by user

I Optimizations on type-instantiated code

I Non-core code still uses normal python JIT

10 / 132

Accelerators

I OpenCLI OpenCL Compilers from AMD, NVIDIA, Intel, RapidMind and

ARM are based on LLVMI clang OpenCL support is largely Open SourceI Open Source Back-ends for NVIDIA and AMD GPUs

I Google RenderScriptI clang as a Front-endI LLVM-IR as Exchange FormatI RenderScript Compiler is OpenSource

11 / 132

High Level Synthesis

LLVM-IR to Hardware Description Language

I Leg-Updeveloped by University of Toronto and Altera Inc.

I C-To-Verilogdeveloped by IBM Research

I AutoESLdeveloped by UCLA, bought by Xilinx

I TridentImperial Collage London and 2x US National Labs

12 / 132

Tobi’s Personal Observations

I LLVM is part of many different compilers

I Optimizations for LLVM have a large impact

I LLVM runs on (high performance) embedded devices

I Having CPU, GPU and FPGA back-ends in a single compileris great when targetting heterogeneous architectures.

13 / 132

Outline

The LLVM Project


Polly


14 / 132

A static compile flow

main.cppmoduleA.f90moduleB.c

main.llmoduleA.llmoduleB.ll

main.opt.llmoduleA.opt.llmoduleB.opt.ll

program.ll

program.opt.ll program.s program.o program.exeprogram.ll

clangllvm-gcc

llvm-gfortran

Frontends

llvm-opt

Optimizerllvm-link

Linker

llvm-opt

Optimizerllvm-mc

Assemblerllvm-llc

Target Code Generator

ld

SystemLinker

15 / 132

LLVM-IR

I Base of all analyses and optimization passes

I Input/Output of most tools

I Target independent (but front-ends can make it targetdependent)

16 / 132

LLVM-IR - Three equivalent representations

I In Memory: C++ data structures

I Bitcode: Binary file (.bc)

I Human Readable: Text file (.ll)

Translate one representation to another:

> opt -S program.bc -o program.ll

> opt program.ll -o program.bc

17 / 132

LLVM-IR - Generation from C code

> clang -S -emit-llvm main.c -o main.ll

main.c

int main() {

return 42;

}

main.ll

; ModuleID = ’main.ll’

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-[...]"

target triple = "x86_64-unknown-linux-gnu"

define i32 @main() nounwind {

ret i32 42

}

18 / 132

LLVM-IR - Generate target code

> llc main.ll -o main.s && gcc main.s -o main.exe

> ./main.exe

> echo $?

42

> lli main.ll

> echo $?

42

I llc - LLVM to Assembly compiler

I lli - Just in time compiler

19 / 132

Our first function

Create a function that calculates n2

define i32 @pow(i32 %number) {

%pow = mul i32 %number, %number

ret i32 %sqrt

}

define i32 @main() nounwind {

%result = call i32 @pow(i32 7)

ret i32 %result

}

20 / 132

Optimize our first function

> opt -O3 pow.ll -o pow.opt.ll -S

Optimized function

define i32 @pow(i32 %number) nounwind readnone {

%sqrt = mul i32 %number, %number

ret i32 %sqrt

}

define i32 @main() nounwind readnone {

ret i32 49

}

21 / 132

Multiple modules

Combining multiple modules to enable inter-module optimizations:

> llvm-link main.ll pow.ll -o combined.ll -S

22 / 132

bugpoint

I Automatic test case reduction tool

I Works for crashes in opt and llc

I Can also extract miscompiles

> link main.ll pow.ll -o combined.ll -S

> opt -O3 combined.ll

!! CRASH !!

> bugpoint -O3 combined.ll -S

[...]

Reduced test case created.

You can reproduce the bug with

’opt -basicaa -indvars bugpoint-reduced-simplified.ll’

23 / 132

LLVM-IR - Overview

I Low level assembly like language

I Register machine, infinite number of (named) registers

I Each instruction defines a new (named) register

I Load/Store Architecture

I Defined at http://llvm.org/docs/LangRef.html

24 / 132

http://llvm.org/docs/LangRef.html

LLVM-IR - Types

I LLVM-IR is strongly typed

I Each register/pointer/function has an associated type

I No implicit type casts

I A program without casts is typesafe in the absence of memoryaccess errors (e.g. array overflow)

25 / 132

LLVM-IR - Type classes

I Primitiveinteger, floating point, label, metadata, void, x86mmx,

I Derivedarray, function, pointer, structure, packed structure, vector,opaque

First class types - Non first class types

26 / 132

LLVM-IR - Integer types

I Fixed bitwidth

I Any bit width from 1 bit to 232 − 1

I Larger types as function parameters → backend dependent

I Signedness not defined

i1 ; Boolean type

i8 ; char

i32 ; 32 bit integer

i64 ; 64 bit integer

i121212 ; Very large integer

27 / 132

LLVM-IR - Floating point types

float ; 32 bit

double ; 64 bit

fp128 ; 128 bit (112-bit mantissa)

x86_fp80 ; 80 bit (X87)

ppc_fp128 ; 128 bit (two 64-bits)

28 / 132

LLVM-IR - Label type

I A (named) reference for a basic block

define i1 @foo() {

start:

br label %next

next:

br label %return

return:

ret i1 0

}

29 / 132

LLVM-IR - Values

I Created through an instruction

I Global values

I Constants

I Undefined

%result = add i32 5, 10

i1 0

i32 15

float undef

30 / 132

LLVM-IR - Constants

i1 true ; Boolean constants

i1 false

i32 -1 ; equal to 232 − 1

float 123.421 ; Exact decimal notation

; ! 1.3 has infinite

; binary representation

float 1.23421e+2 ; Exponential notation

double 0x432ff973cafa8000 ; Hexadecimal notation

<type> zeroinitializer ; <type> can be any type

31 / 132

LLVM-IR - Instructions

I Calculations

I Vector/Structure management

I Type conversion

I Memory management

I Control flow instructions

32 / 132

LLVM-IR - Computational instructions

I Side effect free

I Take values as input

I Create a new register value

%sum = add i32 %a, %b

%product = fmul float %a, %b

%unsigned_div = udiv i32 %a, %b

%signed_div = sdiv i32 %a, %b

%division = fdiv float %a, %b

33 / 132

LLVM-IR - Computational instructions - Comparisons

%equal = icmp eq i32 %a, %b

%not_equal = float ne i5 %c, %d

%signed_less_than = icmp slt i3 %a, %b

%unsigned_less_than = icmp ult i5 %a, %b

34 / 132

LLVM-IR - Control flow instructions

I (Un)Conditional branch

I Switch

I Return

I Indirect branch, Invoke, Unwind

start:

br i1 true, label %left, label %right

left:

br label %join

right:

br label %join

merge:

ret i32 %joinedValue

35 / 132

LLVM-IR - PHI instruction

I Implements the Φ SSA instruction

start:

br i1 true, label %left, label %right

left:

%plusOne = add i32 0, i32 1

br label %join

right:

br label %join

merge:

%joinedValue = phi i32 [ %plusOne, %left],

[ -1, %right]

ret i32 %joinedValue

36 / 132

LLVM-IR - Call instruction

I Calls a function

I Saves the return value in a new register

%result = call i32 @pow(i32 7)

37 / 132

LLVM-IR - Type classes

I Primitiveinteger, floating point, label, metadata, void, x86mmx,

I Derivedarray, function, pointer, structure, packed structure, vector,opaque

First class types - Non first class types

38 / 132

LLVM-IR - Array type

I Set of elements arranges sequential in memory.

I Takes a type and a constant size

I Only fixed sized multi dimensional arrays.

I No indexing restrictions by type system.

[20 x i1] ; Array of 20 boolean elements

[100 x float] ; Array of 100 float elements

[20 x [100 x i32]] ; Array of 20 arrays of

; 100 i32 elements

[0 x float] ; Zero elemement array. Can be used

; to implement variable sized arrays

39 / 132

LLVM-IR - Struct Type

I Collection of data elements in memory.

I Packaging matches the ABI of the underlying processor.

I Use a packed structure to remove padding.

{float, i64}

{float, {double, i3}}

{float, [2 x i3]}

<{float, [2 x i3]}> ; Packed structure.

; Removes padding

40 / 132

LLVM-IR - Vector type

I Vector of elements

I Used to apply a single instruction on various elements

I Arbitrary width

<4 x float>

<2 x double>

<123 x i3> ; Probably generates inefficient code

41 / 132

LLVM-IR - Pointer type

I Gives a location in memory

I void pointer or pointer to labels not permitted. Use i8*.

I Optional address space qualifier

float* ; Pointer to a float

[5 x float]* ; Pointer to an array

<2 x float>* ; Pointer to a vector

float addrspace(5)* ; Pointer to a float in

; address space 5

42 / 132

LLVM-IR - Named Type

I Types can be named

I Names are aliases for types

I Names are not part of the types

%intv4 = type <4 x i32>

%intv8 = type <8 x i32>

%floatptr = type float*

%mytype = type { %mytype*, i32 }

43 / 132

LLVM-IR - Constants

[i1 true, i1 false] ; Constant array

<i3 5, i3 10> ; Constant vector

{i1 true, float 15} ; Constant structure

<2 x i1> zeroinitializer ; Zero vector

44 / 132

LLVM-IR - Instructions

I Computational instructions

I Vector/structure management

I Type conversion

I Memory management

I Control flow instructions

45 / 132

LLVM-IR - Computational instructions

I Are applied elemente wise on vector types

%sum = add <2 x i32> %a, %b

%product = fmul <4 x float> %a, %b

%equal = icmp eq <2 x i32> %a, %b

%not_equal = float ne <3 x i5> %c, %d

46 / 132

LLVM-IR - Vector management

I Get and set an element

I Shuffle elements by a constant shuffle mask

extractelement <4 x float> %vec, i32 0

; yields float

insertelement <4 x float> %vec, float 1, i32 0

; yields <4 x float>

shufflevector <4 x float> %v1, <4 x float> %v2,

<4 x i32> <i32 0, i32 4, i32 1, i32 5>

; yields <4 x float>

47 / 132

LLVM-IR - Array/Structure management

I Extract an element from a structure/array

I Indexes need to be in bounds

I Indexes are constants

extractvalue {i32, float} %agg, 0

; yields i32

extractvalue {i32, {float, double}} %agg, 0, 1

; yields double

extractvalue [2 x i32] %array, 0

; yields i32

48 / 132

LLVM-IR - Array/Structure management II

I Insert an element into a structure/array.

%agg1 = insertvalue {i32, float} undef, i32 1, 0

; yields {i32 1, float undef}

%agg2 = insertvalue {i32, float} %agg1, float %val, 1

; yields {i32 1, float %val}

%aggA = insertvalue {i32, float} zeroinitializer,

i32 1, 0

; yields {i32 1, float 0}

49 / 132

LLVM-IR - Allocate memory

I alloca - Allocate memory on the stack

I malloc - Use C stdlib memory allocator

%ptr = alloca i32

%ptr = alloca i32, i32 4

%ptr = alloca i32, i32 4, align 1024

%ptr = alloca i32, align 1024

; All yield i32*

%mallocP = call i8* @malloc(i32 %objectsize)

; yields i8* (void pointer)

50 / 132

LLVM-IR - Load/Store memory

I The only operations that can access memory

%ptr = alloca i32

store i32 3, i32* %ptr

%val = load i32* %ptr

51 / 132

LLVM-IR - Select operation

I Select one value depending on a condition

I a = condition ? valueOne : valueTwo

I No branch (mis) prediction necessary

%X = select i1 true, i8 17, i8 42

; yields i8:17

52 / 132

LLVM-IR - Type conversion

I Size conversion int ↔ int

I Size conversion float ↔ float

I float ↔ int

I int ↔ ptr

I Bitcast - Do not change bit representation

trunc i32 257 to i8 ; yields i8:1

zext i32 257 to i64 ; yields i64:257

sext i8 -1 to i16 ; yields i16:65535

bitcast <2 x i32> %V to i64;; yields i64: %V

53 / 132

LLVM Passes

I Analysis Passes(-domtree, -regions, -basicaa)

I Transformation PassesI Canonicalization Passes

(-reg2mem, -indvars, -loop-simplify, -mergereturn)

I Optimization Passes(-mem2reg, -tailcallelim, -constprop, -gvn, -instcombine, -instsimplify)

I Other Passes(-view-cfg, -view-cfg-only, -view-dom, -instnamer, -verify)

Get a complete list with “opt -help”

54 / 132

LLVM Pass Types

I Module Pass

I CallGraphSCC Pass

I Function Pass

I Region Pass / Loop Pass

I Basic Block Pass

I Machine Function Pass

55 / 132

The LLVM Pass Philosopy

I Analysis passes provide high-level abstractions

I Canonicalication Passes create a canonical representation

I Transformation passes only work on a canonical representation

56 / 132

-instsimplify

I Remove redundant instructions

I Cannot create new instructions

(X +−1) + 1→ X

define i32 @add1(i32 %x) {

%l = add i32 %x, -1

%r = add i32 %l, 1

ret i32 %r

}

to

define i32 @add1(i32 %x) {

ret i32 %x

}

57 / 132

-instcombineI Combine redundant instructionsI Can create new instructions

(x & z) ˆ (y & z) → (x ˆ y) & z

define i32 @test1(i32 %x, i32 %y, i32 %z) {

%tmp1 = and i32 %z, %x

%tmp2 = and i32 %z, %y

%tmp3 = xor i32 %tmp1, %tmp2

ret i32 %tmp3

}

to

define i32 @test1(i32 %x, i32 %y, i32 %z) {

%tmp1 = xor i32 %x, %y

%tmp2 = and i32 %tmp1, %y

ret i32 %tmp2

}58 / 132

Regression tests

I Run with make check

I Stored in src-dir/test

I Each transformation has its own directory

59 / 132

lit.py / llvm-lit

I LLVM integrated tester

I Runs the LLVM and Clang test suite

I Used to run individual tests

I build/llvm-lit

~/llvm_build/bin/llvm-lit ~/llvm_git/test/Analysis/Dominators/

-- Testing: 4 tests, 4 threads --

PASS: LLVM :: Analysis/Dominators/2006-10-02-BreakCritEdges.ll

FAIL: LLVM :: Analysis/Dominators/2007-07-12-SplitBlock.ll

XPASS: LLVM :: Analysis/Dominators/2007-07-11-SplitBlock.ll

XFAIL: LLVM :: Analysis/Dominators/2007-01-14-BreakCritEdges.ll

UNSUPPORTED: LLVM :: Analysis/Dominators/other.ll

60 / 132

A single test file

I Run line specifies the test command

I %s is replaced with the test file itself

I Test fails if the command has a non-zero return value

; RUN: opt -mypass %s

define i32 @test1(i32 %A) {

%B = xor i32 %A, 12345

ret i32 %B

}

61 / 132

FileCheck / not

I Use ’FileCheck’ to check for expected transformations

; RUN: opt < %s -instcombine -S | FileCheck %s


; CHECK: @test1

; CHECK-NEXT: %C = icmp slt i32 %A, 0

%B = xor i32 %A, 12345

%C = icmp slt i32 %B, 0

ret i1 %C

}

I Use ’not’ to switch return codes

; RUN: not opt %s -S


; Expected failure because of type mismatch

ret i1 %A

}

62 / 132

Instcombine

I LLVM-IR level peephole optimization

I Run with opt -instcombine

I Source code in lib/Transforms/InstCombine

I FunctionPass

63 / 132

Instcombine - Architecture

while (optimizations found && not timeout) {I Search source code for known patterns

I Create simplification for them

I Replace original set of instructions with simplificiation

}

64 / 132

Instcombine - Instruction matching

(A & B) ˆ (A | B) → A ˆ B

%0 = and i32 %A, %B

%1 = xor i32 %A, %B

%3 = or i32 %0, %1

XOR AND

OR

%A %B %A %B

OR

%A %B

65 / 132

Support/PatternMatch.h

I Match a tree of LLVM instructions

I Capture parts of the instruction tree

Value *Exp = ...

Value *X; ConstantInt *C1;

// Exp == (X | C1)

if (match(Exp, m_Or(m_Value(X), m_ConstantInt(C1))) }

... Pattern is matched and variables are bound ...

}

66 / 132


I Match and ignoreI m Value() - Any valueI m ConstantInt() - Any integer constantI m Undef() - Any undefined valueI m Zero() - Any zeroinitializerI m One() - Integer of Vector with all elements = 1I m AllOne() - Integer of Vector with all bits = 1I m SignBit() - Integer of Vector with only the signbit set.I m Power2() - Integer of Vector with all elements are a power

of two.

67 / 132


I Match and captureI m Value(Value *&V) - Any valueI m ConstantInt(ConstantInt *&I) - Any integer constantI m Constant(Constant*&C) - Any constant

I Match and compareI m Specific(Value *V) - Any value

68 / 132


I Match binary operatorsI m Add(Type *LHS,Type *RHS) - Match an add instructionI m Sub(Type *LHS,Type *RHS) - Match a sub instructionI ...

I Match unary operatorsI m SExt(Type *Operand) - Match an sext instructionI m Neg(Type *Operand) - Match an integer negateI ...

I Matchers for control flow

69 / 132

How to create Instructions?

I include/llvm/InstrTypes.h defines functions to createinstructions

I Two constructorsI Just create the instructionI Create instruction and add it before another instructionI Create instruction and add it at the end of a basic block

Instruction *I1 = BinaryOperator::CreateOr(A, B);

Instruction *I2 = BinaryOperator::CreateAnd(A, B);

70 / 132

IRBuilder

I include/llvm/Support/IRBuilder.h

I A helper to automaticallyI Create and insert instructionsI Get common typesI Get common constants

Value *V = Builder->CreateOr(A, B);

Type *T = Builder->getInt32Ty();

ConstantInt *I = Builder->getInt32(512);

71 / 132

Loop Optimizations within LLVM

Analyses

I Loop Detection

I Scalar Evolution

Canonicalication

I Loop Simplification

I Induction Variable Simplification

Transformations

I Loop Rotation

I Loop Idiom Recognition (memcpy, . . . )

I Loop Deletion

I Loop Unrolling (also partial unrolling)

I Intra Basic Block Vectorization (in testing)

72 / 132

What is Scalar Evolution?

I A compiler analysis

I Calculates a closed form expression for the values of scalars atdifferent loop iterations.

I Used for loop trip counts, instruction combination, strengthreduction, loop canonicalization, . . . , Polly.

Example

scalar = A;

for (int j = 0; j < N; j++)

scalar = scalar + B;

scalar = A + iB = {A,+,B}j

73 / 132

What is Scalar Evolution?

I A compiler analysis

I Calculates a closed form expression for the values of scalars atdifferent loop iterations.

I Used for loop trip counts, instruction combination, strengthreduction, loop canonicalization, . . . , Polly.

Example

scalar = A;

for (int j = 0; j < N; j++)

scalar = scalar + B;

scalar = A + iB = {A,+,B}j

73 / 132

History

Research

I Bachmann 1994 Chains of recurrences - A method toexpedit the evaluation of closed-form functions

I Engelen 2000 Chains of recurrences for loop optimization

I Pop 2003 Analysis of induction variables using chains ofrecurrences

Compilers

I GCC First commit 20. June 2004

I LLVM First commit 2. April 2004

74 / 132

Chain of Recurrence / SCEV

Build blocks

I Operations: +, *, /, sext, zext, trunk, smax, umax

I Constant, Sizeof, Alignof

I Unknown Value, Parameter

I Add Recurrences: {expr ,+, expr}Loop

75 / 132

A simple example1

define void @foo(i64 %a, i64 %b, i64 %c) {

%t0 = add i64 %b, %a

%t1 = add i64 %t0, 7

%t2 = add i64 %t1, %c

ret i64 %t2

}

; SCEV: (7 + %a + %b +%c)

1Take from Dan Gohmans LLVM Meeting 2009 presentation76 / 132

Two dimensional array - Without any loops 2

double *bar(double a[10][10], long b, long c) {

return &a[b * 3 + 7][c + 5];

}

define double* @bar([10 x double]* %a, i64 %b, i64 %c) {

%bx3 = mul i64 %b, 3

%bx3a7 = add i64 %bx3, 7

%ca5 = add i64 %c, 5

%z = getelementptr [10 x double]* %a,

i64 %bx3a7, i64 %ca5

ret double* %z

}

; SCEV: (((75 + %c + (30 * %b)) * sizeof(double)) + %a)

; SCEV: (600 + (8 * %c) + (240 * %b) + %a)

2Take from Dan Gohmans LLVM Meeting 2009 presentation77 / 132

Add Recurrences

General form{base,+, stride}<loop>

void foo(long n, double *p) {

for (long i = 0; i < n; ++i)

p[i] = 0.0;

}

As a SCEV{%p,+, 8}<%for .body>

Optionally, without TargetData

{%p,+, sizeof (double)}<%for .body>

78 / 132

Pointer Loop - CFG

void pointer_loop () {

int *B = A;

while (B < &A[1024]) {

*B = 1;

++B;

}

}

CFG for 'pointer_loop' function

bb.nph:br label %while.body

while.body:%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %while.body ]%B.02 = getelementptr [1024 x i32]* @A, i64 0, i64 %indvarstore i32 1, i32* %B.02, align 4%tmp = shl i64 %indvar, 2%ptrincdec.idx = add i64 %tmp, 4%cmp = icmp slt i64 %ptrincdec.idx, 4096%indvar.next = add i64 %indvar, 1br i1 %cmp, label %while.body, label %while.end

T F

while.end:ret void

79 / 132

Pointer Loop - Scalar Evolution

%indvar:

{0,+,1}<while.body>Exits: 1023

%B.02:

{@A,+,4}<while.body>Exits: (4092 + @A)

%indvar.next:

{1,+,1}<while.body>Exits: 1024

backedge-taken count: 1023 CFG for 'pointer_loop' function

bb.nph:br label %while.body

while.body:%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %while.body ]%B.02 = getelementptr [1024 x i32]* @A, i64 0, i64 %indvarstore i32 1, i32* %B.02, align 4%tmp = shl i64 %indvar, 2%ptrincdec.idx = add i64 %tmp, 4%cmp = icmp slt i64 %ptrincdec.idx, 4096%indvar.next = add i64 %indvar, 1br i1 %cmp, label %while.body, label %while.end

T F

while.end:ret void

80 / 132

Pointer Loop II - CFG

void c(long* p, long n) {

long i;

for (i = 0; i < n; i++)

*(p+i) = i;

}

CFG for 'compute' function

entry:%cmp1 = icmp sgt i64 %n, 0br i1 %cmp1, label %for.body, label %for.end

T F

for.body:%0 = phi i64 [ %inc, %for.body ], [ 0, %entry ]%add.ptr = getelementptr i64* %p, i64 %0store i64 %0, i64* %add.ptr, align 8%inc = add nsw i64 %0, 1%exitcond = icmp eq i64 %inc, %nbr i1 %exitcond, label %for.end, label %for.body

T F

for.end:ret void

81 / 132

Pointer Loop II - Scalar Evolution

%0:

{0,+,1}<for.body>Exits: (-1 + n)

%add.ptr:

{p,+,8}<for.body>Exits: (-8 + (8 * n) + p)

%inc:

{1,+,1}<for.body>Exits: n

backedge-taken count:

(-1 + n)CFG for 'compute' function

entry:%cmp1 = icmp sgt i64 %n, 0br i1 %cmp1, label %for.body, label %for.end

T F

for.body:%0 = phi i64 [ %inc, %for.body ], [ 0, %entry ]%add.ptr = getelementptr i64* %p, i64 %0store i64 %0, i64* %add.ptr, align 8%inc = add nsw i64 %0, 1%exitcond = icmp eq i64 %inc, %nbr i1 %exitcond, label %for.end, label %for.body

T F

for.end:ret void

82 / 132

Linearized Multidimensional Array - CFG

void c(long* p, long n_row,

long n_col) {

long *ptr_row;

long *ptr_col;

long row, col;

ptr_row = p;

for (row=0; row<n_row; row++) {

ptr_col = ptr_row;

for (col=0; col<n_col; col++)

S: *(ptr_col++) = row+col;

ptr_row += n_col;

}

}CFG for 'pointer_loop_linearized_multidim' function

entry:%cmp4 = icmp sgt i64 %n_row, 0%cmp71 = icmp sgt i64 %n_col, 0%or.cond = and i1 %cmp4, %cmp71br i1 %or.cond, label %bb.nph.us, label %for.end18

T F

bb.nph.us:%row.06.us = phi i64 [ %inc17.us, %for.end.us ], [ 0, %entry ]%tmp10 = mul i64 %row.06.us, %n_colbr label %for.body8.us

for.end18:ret void

for.body8.us:%col.03.us = phi i64 [ 0, %bb.nph.us ], [ %inc.us, %for.body8.us ]%tmp11 = add i64 %tmp10, %col.03.us%ptr_col.02.us = getelementptr i64* %p, i64 %tmp11%add.us = add i64 %row.06.us, %col.03.usstore i64 %add.us, i64* %ptr_col.02.us, align 8%inc.us = add nsw i64 %col.03.us, 1%exitcond = icmp eq i64 %inc.us, %n_colbr i1 %exitcond, label %for.end.us, label %for.body8.us

T F

for.end.us:%inc17.us = add nsw i64 %row.06.us, 1%exitcond9 = icmp eq i64 %inc17.us, %n_rowbr i1 %exitcond9, label %for.end18, label %bb.nph.us

T F

83 / 132

Linearized Multidimensional Array - Scalar Evolution

%ptr_col.02.us:

{{p,+,(8 * n_col)}<bb.nph.us>,+,8}<for.body8.us>Exits: {(-8 + (8 * n_col) + p),+,(8 * n_col)}<bb.nph.us>

%col.03.us:

{0,+,1}<for.body8.us>Exits: (-1 + n_col)

backedge-taken count "for.body8.us":

(-1 + n_col)

backedge-taken count "bb.nph.us":

(-1 + n_row)

84 / 132

Use scalar evolution analysis - The opt binary

# Generate LLVM-IR

> clang -S -emit-llvm your-prog.c -o your-prog.ll

# Canonicalize the code. Most probably not all of -O3

# is needed, but -mem2reg is sufficient.

> opt -O3 your-prog.ll > your-prog.preopt.ll

# Run the analysis

> opt -scalar-evolition -analyze your-prog.preopt.ll

85 / 132

Use Scalar Evolution Analysis

void YourAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {

AU.setPreservesAll();

AU.addRequired<ScalarEvolution>();

}

bool YourAnalysis::runOnFunction(Function &F) {

ScalarEvolution &SE = getAnalysis<ScalarEvolution>();

// Get SVEV for the first instruction of the Function.

Instruction *FirstInstruction = (*F.begin())->begin();

SCEV *evolution = SE->getSCEV(FirstInstruction);

if (isa<SCEVConstant>(evolution))

errs() << "The first instruction is a constant SCEV";

}

86 / 132

SCEVVisitor, SCEVExpander

SCEVVisitorAn iterator that walks over a SCEV. It allows to analyse or modifyit.

SCEVExpander

Recreate LLVM-IR from a modified SCEV. This is currently onlywell tested within the existing scalar loop optimizers.

87 / 132

Vectorization status

88 / 132

Conclusion - Loop Optimizations in LLVM

I Scalar evolution provides induction variable analysis

I Good set of scalar loop optimizations

I Almost no optimizations which change the loop structure

I First draft of basic block vectorization

89 / 132

Outline

The LLVM Project


Polly


90 / 132

The idea of Polly?

We want:

I Fast and power-efficient code

We have:

I SIMD, Caches, Multi-Core, Accelerators

But:

I Optimized code is needed

I Manual Optimization is complex and not performance portable

I Architectures are too diverse to optimize ahead of time

Goal:

I Automatic high-level optimizations for heterogeneousarchitectures

91 / 132

Get Polly

I Install Pollyhttp://polly.grosser.es/get_started.html

I Load Polly automatically

alias clang clang -Xclang -load -Xclang LLVMPolly.so

alias opt opt -load LLVMPolly.so

I Default behaviour preserved

I clang/opt now provide options to enable Polly

92 / 132

http://polly.grosser.es/get_started.html

Optimize a program with Polly

gemm.c [1024 x 1024 (static size), double]

for (int i = 0; i < N; i++)

for (int j = 0; j < M; j++) {

C[i][j] = 0;

for (int k = 0; k < K; k++)

C[i][j] += A[i][k] + B[k][j];

}

$ clang -O3 gemm.c -o gemm.clang

$ time ./gemm.clang

real 0m15.336s

$ clang -O3 -mllvm -o gemm.polly -mllvm -polly

$ time ./gemm.polly

real 0m2.144s

93 / 132

The Architecture

LLVM IR LLVM IRPSCoP

SCoP Detection

Code Generation

JSCoP

* Loop transformations* Data layout optimizations* Expose parallelism

Transformations

Manual Optimization / PoCC+Pluto

DependencyAnalysis

Export Import

SIMD

OpenMP OpenCL

94 / 132

Can Polly analyze our code?

$ clang -O3 gemm.c \

-mllvm -polly-show-only \

-mllvm -polly-detect-only=gemm

I Highlight the detected Scops

I Only check in function ’gemm’

Scop Graph for 'gemm' function

entry

entry.split

for.cond1.preheader

for.body3

for.body8

for.inc22

for.inc25

for.end27

95 / 132

Some code can not be analyzed


-mllvm -polly-show-only \


gemm (possible aliasing)

void gemm(double A[N][K],

double B[K][M],

double C[N][M]) {

for (int i = 0; i < N; i++)

for (int j = 0; j < M; j++) {

C[i][j] = 0;

for (int k = 0; k < K; k++)

C[i][j] += A[i][k] + B[k][j];

}

}


entry

entry.split

for.cond1.preheader

for.body3

for.body8

for.inc22

for.inc25

for.end27

%B may possibly alias

%A may possibly alias

%A may possibly alias

96 / 132

How to fix it?

Add ’restrict’

void gemm(double A[restrict N][K],

double B[restrict K][M],

double C[restrict N][M]);

Other options:

I Inlining

I Improved alias analysis

I Run time checks


entry

entry.split

for.cond1.preheader

for.body3

for.body8

for.inc22

for.inc25

for.end27

97 / 132

Extract polyhedral representation

gemm

for (int i = 0; i < 512; i++)

for (int j = 0; j < 512; j++) {

C[i][j] = 0; // Stmt1for (int k = 0; k < 512; k++)

C[i][j] += A[i][k] + B[k][j]; // Stmt2}


-mllvm -polly-run-export-jscop \


Writing JScop ’for.cond1.preheader => for.end27’ in function ’gemm’ to

’./gemm___%for.cond1.preheader---%for.end27.jscop’.

Domain = {Stmt1[i , j] : 0 <= i , j < 512; Stmt2[i , j , k] : 0 <= i , j , k < 512}Schedule = {Stmt1[i , j] → [i , j , 0]; Stmt2[i , j , k] → [i , j , 1, k]}

Writes = {Stmt1[i , j] → C [i , j]; Stmt2[i , j , k] → C [i , j]}Reads = {Stmt2[i , j , k] → A[i , k]; Stmt2[i , j , k] → B[k, j]}

98 / 132

The SCoP - Classical Definition

for i = 1 to (5n + 3)

for j = n to (4i + 3n + 4)

A[i-j] = A[i]

if i < (n - 20)

A[i+20] = j

I Structured control flowI Regular for loopsI Conditions

I Affine expressions in:I Loop bounds, conditions, access functions

I Side effect free

99 / 132

AST based frameworks

What about:

I Goto-based loops

I C++ iterators

I C++0x foreach loop

Common restrictions

I Limited to subset of C/C++

I Require explicit annotations

I Only canonical code

I Correct? (Integer overflow, Operator overloading, ...)

100 / 132

Semantic SCoP

Thanks to LLVM Analysis and Optimization Passes:

SCoP - The Polly way

I Structured control flowI Regular for loops → Anything that acts like a regular for loopI Conditions

I Affine expressions→ Expressions that calculate an affine result

I Side effect free known

I Memory accesses through arrays → Arrays + Pointers

101 / 132

Valid SCoPs

do..while loop

i = 0;

do {

int b = 2 * i;

int c = b * 3 + 5 * i;

A[c] = i;

i += 2;

} while (i < N);

pointer loop

int A[1024];

void pointer_loop () {

int *B = A;

while (B < &A[1024]) {

*B = 1;

++B;

}

}

102 / 132

Applying transformations

I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}

I

I TStripMine = {[i , j ]→ [i , jj , j ] : jjmod 4 = 0 ∧ jj <= j < jj + 4}

I S ′ = S

for (i = 0; i < 32; i++)

for (j = 0; j < 1000; j++)

A[j][i] += 1;

103 / 132


I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}I TInterchange = {[i , j ]→ [j , i ]}

I TStripMine = {[i , j ]→ [i , jj , j ] : jjmod 4 = 0 ∧ jj <= j < jj + 4}

I S ′ = S ◦ TInterchange

for (j = 0; j < 1000; j++)

for (i = 0; i < 32; i++)

A[j][i] += 1;

104 / 132


I D = {Stmt[i , j ] : 0 <= i < 32 ∧ 0 <= j < 1000}I S = {Stmt[i , j ]→ [i , j ]}I TInterchange = {[i , j ]→ [j , i ]}I TStripMine = {[i , j ]→ [i , jj , j ] : jj

mod 4 = 0 ∧ jj <= j < jj + 4}I S ′ = S ◦ TInterchange ◦ TStripMine

for (j = 0; j < 1000; j++)

for (ii = 0; ii < 32; ii+=4)

for (i = ii; i < ii+4; i++)

A[j][i] += 1;

105 / 132

Polly takes advantage of available parallelism

It creates automatically:

I OpenMP callsfor loops that are not surrounded by any other parallel loops

I SIMD instructionsfor innermost loops with a constant number of iterations

→ Optimizing code becomes the problem of finding the rightschedule.

106 / 132

Optimizing of Matrix Multiply

0

1

2

3

4

5

6

7

8

9

clang -O3

gcc -ffast-math -O3

icc -fastPolly: Only LLVM -O3

Polly: + Strip mining

Polly: += Vectorization

Polly: += Hoisting

Polly: += Unrolling

Spe

edup

32x32 double, Transposed matric Multiply, C[i][j] += A[k][i] * B[j][k];

Intel R© Core R© i5 @ 2.40GH

107 / 132

Automatic optimization with the Pluto algorithm

Polly provides two automatic optimizers

PoCC

I -polly-optimizer=pocc

I Original implementation

I We call the pocc binary

I More mature

I Integrated with a large setof research tools

ISL

I -polly-optimizer=isl(default)

I Reimplementation

I ISL is already linked intoPolly, no additional libraryneeded

I Still untuned heuristics

I Will be used for production.

108 / 132

Polly on Polybench - Sequential execution times

2mm3mm

adiatax

bicgcholesky

correlation

covariance

doitgen

durbindynprog

fdtd-2d

fdtd-apml

gauss-filter

gemmgemver

gesummv

gramschmidt

jacobi-1d-imper

jacobi-2d-imper

lu ludcmp

mvtreg_detect

seidel

symmsyr2k

syrktrisolv

trmm0

1

2

3

4

5

Speedup r

ela

tive t

o "

clang -

O3

" clang -O3pollycc -ftilepollycc -ftile -fvector

2mm3mm

adiatax

bicgcholesky

correlation

covariance

doitgen

durbindynprog

fdtd-2d

fdtd-apml

gauss-filter

gemmgemver

gesummv

gramschmidt

jacobi-1d-imper

jacobi-2d-imper

lu ludcmp

mvtreg_detect

seidel

symmsyr2k

syrktrisolv

trmm0

2

4

6

8

10

12

14

16

Speedup r

ela

tive t

o "

clang -

O3

" clang -O3pollycc -ftilepollycc -ftile -fvector

109 / 132

Polly on Polybench - Parallel execution times

2mm3mm

adiatax

bicgcholesky

correlation

covariance

doitgen

durbindynprog

fdtd-2d

fdtd-apml

gauss-filter

gemmgemver

gesummv

gramschmidt

jacobi-1d-imper

jacobi-2d-imper

lu ludcmp

mvtreg_detect

seidel

symmsyr2k

syrktrisolv

trmm0

5

10

15

20

25

Speedup r

ela

tive t

o "

clang -

O3

" clang -O3pollycc -ftile -fparallelpollycc -ftile -fparallel -fvector

2mm3mm

adiatax

bicgcholesky

correlation

covariance

doitgen

durbindynprog

fdtd-2d

fdtd-apml

gauss-filter

gemmgemver

gesummv

gramschmidt

jacobi-1d-imper

jacobi-2d-imper

lu ludcmp

mvtreg_detect

seidel

symmsyr2k

syrktrisolv

trmm0

20

40

60

80

100

120

Speedup r

ela

tive t

o "

clang -

O3

" clang -O3pollycc -ftile -fparallelpollycc -ftile -fparallel -fvector

110 / 132

Current Status

LLVM IR LLVM IRPSCoP

SCoP Detection

Code Generation

JSCoP

* Loop transformations* Data layout optimizations* Expose parallelism

Transformations

Manual Optimization / PoCC+Pluto

DependencyAnalysis

Export Import

SIMD

OpenMP OpenCL

Usable for experiments

Planned

Under Construction

111 / 132

How to proceed? Where can we copy?

I Short Vector Instructions→ Vectorizing compiler

I Data Locality→ Optimizing compilers , Pluto

I Thread Level Parallelism→ Optimizing compilers , Pluto

I Vector Accelerators→ Par4All , C-to-CUDA , ppcg

The overall problem:

112 / 132

Polly

Idea: Integrated vectorization

I Target the overall problem

I Re-use existing concepts and libraries

113 / 132

Next Steps

My agenda:

I Data-locality optimizations for larger programs (productionquality)

I Expose SIMDization opportunities with the core optimizers

I Offload computations to vector accelerators

Your ideas?

I Use Polly to drive instruction scheduling for VLIWarchitectures

I . . .

114 / 132

Conclusion

Polly

I Language Independent

I Optimizations for Data-Locality & Parallelism

I SIMD & OpenMP code generation support

I Planned: OpenCL Generation

http://polly.grosser.es

115 / 132

http://polly.grosser.es

Make Polly Production Quality

I Derive width of new induction variables

I Model integer wrapping correctly

I Support variable sized multi-dimensional arrays

I Bound compile-time

I Ensure there are no run-time regressions

I Testing and bug fixing

116 / 132

The size of induction variables

I D = {Stmt[i ] : 0 <= i < 32}I S = {Stmt[i ]→ [i ]}

I TScale = {[i ]→ [32i ]}

I S ′ = S

for (int_6 i = 0; i < 32; j++)

A[i] += 1;

117 / 132

The size of induction variables

I D = {Stmt[i ] : 0 <= i < 32}I S = {Stmt[i ]→ [i ]}I TScale = {[i ]→ [32i ]}I S ′ = S ◦ TScale

for (int_11 i = 0; i < 1024; i+=32)

A[i/32] += 1;

118 / 132

Correctly Model Integer Wrapping

I LLVM-IR instructions of integer types have modulo semantics

I They can be flagged with nsw or nuw(No signed wrap, no unsigned wrap)

I In the absence of these flags we have three choicesI Add run-time check, that proves absence of wrappingI Model wrapping in our polyhedral representationI Fix propagation of these flags (if possible)

119 / 132

Multi dimensional arrays

#define N;

void foo(int n, float A[][N], float **B, C[][n]) {

A[5][5] = 1;

B + 5 * n + 5 = 1;

C[5][5] = 1;

}

I A - Constant Size → already linearI B - Self-made made multi dimensional arrays

I Guess & ProveI Guess & Runtime Check

I C - C99 Variable Length Arrays / Fortran ArraysI Guess & ProveI Guess & Runtime CheckI Pass information from Clang / GFORTRAN

120 / 132

Outline

The LLVM Project


Polly


121 / 132

LLVM Development Model

I Development only on trunk

I No stable branches

I No stable API

I Small, incremental changes

I 6 month timed release cycle

→ trunk is normally well tested!→ Private branches are hard to maintain!

122 / 132

Preparing a Patch

I As small as possible

I Unrelated changes → Separate patches

I No unrelated style changes

I Include a test case

I Follow style of the surrounding code

I Run make check

Goal: Make patch review trivial

123 / 132

Contributing a Patch

I Post patch to [email protected]

I Wait for patch review

I Address possible comments

I Wait for OK to commit

I Commit (or ask for commit)

Trivial changes can be committed can be reviewed post-commit.

124 / 132

Getting Patch Reviews

This is a game: Know how to play it!

I Establish and maintain track recordI Continuously contribute simple changesI Perform code reviews regularly

I Give fast feedback to code reviews

I Play review ping-pong!If you review patches of people working in your area, they aremore likely to review yours.

125 / 132

Polly - Open Projects

I Run-Time Check for Absence of Aliasing

I SPEC 2000/2006 Analysis

I Register Tiling

I Loop Interchange Heuristic

I OpenSCoP Import/Export

I Connect Pluto with Polly

126 / 132

GSoC - Google Summer of Code

I Your Project in an Open Source Organization

I 3 Month (June/July/August)

I Earn 5000 US $ (≈ 270.000 Rs)

I 180 Organizations (including LLVM)

I http://code.google.com/soc

127 / 132

http://code.google.com/soc

GSoC - Is it Research?

I No, but it is strongly related

I Transform research prototypes into production code

I Evaluate your research ideas in practice

I Have your ideas used by millions of users(Your code running on every iPhone?)

I Improve the infrastructure/tools you use for research

128 / 132

GSoC - Projects

I 7 GSoC Students have been working on GraphiteI Tobias Grosser - Transform GIMPLE to GRAPHITE (2008)I Li Feng - Automatic parallelization in Graphite (2009)

I 4 GSoC Students have been working on PollyI Hongbin Zheng - SCoP Detection (2010)I Ragesh Aloor - Memory Access Transformations (2011)I Yabin Hu - GPGPU Code Generation Infrastructure (2012)I Junqi Deng - A Data Prefetching Transformation (2012)

I Other interesting studentsI Justin Holewinski - LLVM PTX Backend (2011)

129 / 132

GSoC - Opportunities

I 7 GSoC Students have been working on GraphiteI Tobias Grosser - PhD. Ecole Normale Superieure, FranceI Li Feng - PhD. Ecole Normale Superieure, France

I 4 GSoC Students have been working on PollyI Hongbin Zheng - Internship at University of IllinoisI Ragesh Aloor - Invited to CGO and IMPACT 2011 in FranceI Yabin Hu - in progressI Junqi Deng - in progress

I Other interesting studentsI Justin Holewinski - Internship at NVIDIA

130 / 132

GSoC - Preparing an Application

I Get in touch early (November, December)

I Discuss your ideas and get feedback from core developers

I Get your hands dirty, prototype some ideas

I Start contributing small patches

→ A good application has a high chance of being accepted,but it also requires proper preparation.

131 / 132

Conclusion

I LLVM an interesting platform for High-Level Optimizations

I Good Scalar Optimizations and Infrastructure

I Polly provides modern, uniform infrastructure for LoopOptimizations

I Many Possibilities to Contribute

132 / 132

Date post:	22-May-2020
Category:	Documents
Upload:	others
View:	47 times
Download:	0 times

LLVM - Grosser · The Python Experience I unladen-swallow: A C-Python LLVM Back-end I Goal: A...

Documents