PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR...

Post on 08-Aug-2020

0 views 0 download

transcript

PSLP: Padded SLP

Automatic Vectorization

Vasileios Porpodas†, Alberto Magni‡

and Timothy M. Jones†

University of Cambridge†

University of Edinburgh‡

EuroLLVM APR 2015

slide 1 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

Scalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

Scalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

• High Performance0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency

• Supported since mid 90’s

• Frequent updates of vectorISAs

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

AVX2

SSE4

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency

• Supported since mid 90’s

• Frequent updates of vectorISAs

• Vector generation notdone in hardware

• Low-level programming orcapable compiler

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

AVX2

SSE4

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed

• In practice it is missing features present in the Loopvectorizer (Interleaved Loads, Predication)

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR

Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

Find vectorizationseed instructions1.

Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

Find vectorizationseed instructions1.

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

• Emit vectors only if profitable

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Vectorize groups& emit vectors

YES

5.

DONE

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

• Emit vectors only if profitable

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Vectorize groups& emit vectors

YES

5.

NO

DONE

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

When SLP Fails

1 Data DependenciesADD3ADD1 ADD2

ADD4

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

When SLP Fails

1 Data Dependencies

2 Too manygather/scatterinstructions. Costsoutweigh benefits.

ADD3ADD1 ADD2ADD4

ADD1ADD2ADD3ADD4

Original Vectorized

ADD1 ADD2 ADD3 ADD4

Insert1Insert2Insert3Insert4

Extract1Extract2Extract3Extract4

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

When SLP Fails

1 Data Dependencies

2 Too manygather/scatterinstructions. Costsoutweigh benefits.

3 Non-isomorphism

ADD3ADD1 ADD2ADD4

ADD1ADD2ADD3ADD4

Original Vectorized

ADD1 ADD2 ADD3 ADD4

Insert1Insert2Insert3Insert4

Extract1Extract2Extract3Extract4

ADD1 ADD2 ADD4MUL

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

a. Input C code

...

...

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

+

*

+

a. Input C code

...

...

7.

1. 5.

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

...

...

7.

1. 5.

S S 0

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

...

...

7.

1. 5.

S S

+ +

0

1

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

S

+

*

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

Scalar Cost

b. DFG c. SLP internal graph d. SLP vectorized groups

LL

S

+

++

7

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

SS

LL

+

S

+

*

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

Vector CostScalar Cost

b. DFG c. SLP internal graph d. SLP vectorized groups

NoBenefit

LL

S

+

++

7 7*

ii++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

+

S

*

7.

1.

a. PSLP graphs

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

*7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1. 5.

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.Left Right

+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.Left Right

+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

c. PSLP groupsb. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

1

2

0 S S

++

3 **

4 L L

Left Right+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

c. PSLP groupsb. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

1

2

0 S S

++

3 **

4 L L

Left Right 5+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

3. Perform minimal Padding of graphs

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

CalculateScalar Cost

CalculateVector Cost

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

• Code vectorized byoriginal SLP

YES

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

8.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

} Vanilla SLP

Generate a graph for each seed

9.

Generate SLP graph containinggroups of isomorphic scalars

Vectorize groups & emit vectors

2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

• Code vectorized byoriginal SLP

YES

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

8.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

} Vanilla SLP

Generate a graph for each seed

9.

Generate SLP graph containinggroups of isomorphic scalars

Vectorize groups & emit vectors

NO

DONE

2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

g1g2

Non−Isomorphic

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

g1g2

Non−Isomorphic

MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

g1g2

Non−Isomorphic

g1

g2

MCS1 MCS2MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

diff1

diff2

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

Non−Isomorphic

g1

g2

L

+

L

+

MCS1 MCS2MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

diff1

diff2

S

+

L

1

S

5

+

L

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Non−Isomorphic

g1

g2

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

diff1

diff2

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Non−Isomorphic

g1

g2diff1diff1

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

diff1

diff2

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Isomorphic !Non−Isomorphic

g1

g2diff1diff1

diff2diff2

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Minimal Padding Algorithm

diff1

diff2 SELECTSELECT

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Isomorphic !Non−Isomorphic

g1

g2diff1diff1

diff2diff2

L

+

L

+

LeftRight

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

S

*

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+

S

*

S

L

+

*

Left

7

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

1

+

C

A

a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

1

+

C

A

a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

72

1

+

C

A

a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

1

+

C

A

a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

A

B1

+

C

A

a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

A

B

A

B

1

+

C

A

a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

opt

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

opt

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)3 O3 + PSLP enabled (PSLP)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP increases performance

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

GMean

Nor

mal

ized

Tim

e

Performance of Kernels (Execution Time)

O3 SLP PSLP

0.97

0.98

0.99

1.00

1.01

cjpegmpeg2dec

433.milc473.astar

GMean

Whole Benchmarks (Execution Time)

O3 SLP PSLP

slide 14 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

SLPonly

• SLP is adequate

slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

SLPonly

PSLP extends SLP • SLP is adequate

• SLP stops at non-isomorphiccode. PSLP extends it.

slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

PSLPonly

SLPonly

PSLP extends SLP • SLP is adequate

• SLP stops at non-isomorphiccode. PSLP extends it.

• SLP fails completely. PSLP

succeeds.slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

Optimizing away redundant Selects

• Select-removal

optimizations

remove about 21%

of the Selects

0%

5%

10%

15%

20%

25%

30%

35%

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

GMean

Per

cent

age

of S

elec

ts

Percentage of Selects per region before and after Optimizations

Original-Selects Optimized-Selects

slide 16 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects

• PSLP performs better compared to SLP oncommodity SIMD-capable hardware

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/