PSLP: Padded SLP
Automatic Vectorization
Vasileios Porpodas†, Alberto Magni‡
and Timothy M. Jones†
University of Cambridge†
University of Edinburgh‡
EuroLLVM APR 2015
slide 1 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
Scalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
Scalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
• High Performance0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
• High Performance
• Energy efficiency0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
• High Performance
• Energy efficiency
• Supported since mid 90’s
• Frequent updates of vectorISAs
0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
AVX2
SSE4
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
Why SIMD Vectorization?
• Scalable parallelism
• High Performance
• Energy efficiency
• Supported since mid 90’s
• Frequent updates of vectorISAs
• Vector generation notdone in hardware
• Low-level programming orcapable compiler
0 1 2 3
Vector Reg. File
b. Vector ParallelismScalar Func. Units
Scalar Reg. File
a. ILP
FUFUFUFU
Vector Unit
AVX2
SSE4
slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
• State-of-the-art straight-line code vectorizer
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
• State-of-the-art straight-line code vectorizer
• Implemented in most compilers (including GCC andLLVM)
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
• State-of-the-art straight-line code vectorizer
• Implemented in most compilers (including GCC andLLVM)
• In theory it should be a superset of loop-vectorizer
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
• State-of-the-art straight-line code vectorizer
• Implemented in most compilers (including GCC andLLVM)
• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Straight-Line Code Vectorizer
• Superword Level Parallelism [Larsen PLDI’00]
• State-of-the-art straight-line code vectorizer
• Implemented in most compilers (including GCC andLLVM)
• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed
• In practice it is missing features present in the Loopvectorizer (Interleaved Loads, Predication)
slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR
Scalar Code
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
Find vectorizationseed instructions1.
Scalar Code
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
• Graph contains vectorizableisomorphic instructions
Find vectorizationseed instructions1.
Scalar Code
2.Generate graph of
isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
• Graph contains vectorizableisomorphic instructions
• Cost: weighted instr. count
Find vectorizationseed instructions1.
CalculateVector Cost
CalculateScalar Cost3.
Scalar Code
2.Generate graph of
isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
• Graph contains vectorizableisomorphic instructions
• Cost: weighted instr. count
• Check vectorization profitability
Find vectorizationseed instructions1.
CalculateVector Cost
CalculateScalar Cost3.
4.If<Vector Cost
Scalar Cost
Scalar Code
2.Generate graph of
isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
• Graph contains vectorizableisomorphic instructions
• Cost: weighted instr. count
• Check vectorization profitability
• Emit vectors only if profitable
Find vectorizationseed instructions1.
CalculateVector Cost
CalculateScalar Cost3.
4.If<Vector Cost
Scalar Cost
Vectorize groups& emit vectors
YES
5.
DONE
Scalar Code
2.Generate graph of
isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Vectorization Algorithm
• Input is scalar IR• Seed instructions are:
1 Consecutive Stores2 Reductions
• Graph contains vectorizableisomorphic instructions
• Cost: weighted instr. count
• Check vectorization profitability
• Emit vectors only if profitable
Find vectorizationseed instructions1.
CalculateVector Cost
CalculateScalar Cost3.
4.If<Vector Cost
Scalar Cost
Vectorize groups& emit vectors
YES
5.
NO
DONE
Scalar Code
2.Generate graph of
isomorphic scalar groups
slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/
When SLP Fails
1 Data DependenciesADD3ADD1 ADD2
ADD4
slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/
When SLP Fails
1 Data Dependencies
2 Too manygather/scatterinstructions. Costsoutweigh benefits.
ADD3ADD1 ADD2ADD4
ADD1ADD2ADD3ADD4
Original Vectorized
ADD1 ADD2 ADD3 ADD4
Insert1Insert2Insert3Insert4
Extract1Extract2Extract3Extract4
slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/
When SLP Fails
1 Data Dependencies
2 Too manygather/scatterinstructions. Costsoutweigh benefits.
3 Non-isomorphism
ADD3ADD1 ADD2ADD4
ADD1ADD2ADD3ADD4
Original Vectorized
ADD1 ADD2 ADD3 ADD4
Insert1Insert2Insert3Insert4
Extract1Extract2Extract3Extract4
ADD1 ADD2 ADD4MUL
slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
a. Input C code
...
...
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
+
*
+
a. Input C code
...
...
7.
1. 5.
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
b. DFG
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
SS
+
*
+
a. Input C code
...
...
7.
1. 5.
S S 0
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
b. DFG c. SLP internal graph d. SLP vectorized groups
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
SS
+
*
+
a. Input C code
...
...
7.
1. 5.
S S
+ +
0
1
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
b. DFG c. SLP internal graph d. SLP vectorized groups
++
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
SS
+
*
+
a. Input C code
STOP !NON−ISOMORPHIC
* L
...
...
L
1. 5.
7.
1. 5.
7.
S S
+ +
0
1
2 L*
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
b. DFG c. SLP internal graph d. SLP vectorized groups
++
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
SS
+
S
+
*
*
+
a. Input C code
STOP !NON−ISOMORPHIC
* L
...
...
L
1. 5.
7.
1. 5.
7.
S S
+ +
0
1
2 L*
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
Scalar Cost
b. DFG c. SLP internal graph d. SLP vectorized groups
LL
S
+
++
7
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
SLP Fails due to non-isomorphism
X Instruction Node or Constant Data Flow Edge
S S
L
L
SS
SS
LL
+
S
+
*
*
+
a. Input C code
STOP !NON−ISOMORPHIC
* L
...
...
L
1. 5.
7.
1. 5.
7.
S S
+ +
0
1
2 L*
B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;
Vector CostScalar Cost
b. DFG c. SLP internal graph d. SLP vectorized groups
NoBenefit
LL
S
+
++
7 7*
ii++
slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
+
S
*
7.
1.
a. PSLP graphs
+
L 5.
Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
*7.
1.
b. PSLP padded graphsa. PSLP graphs
7.
1. 5.
+
L 5.
Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
* *7.
1.
b. PSLP padded graphsa. PSLP graphs
7.
1.
7.
5.
+
L 5.
Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
* *7.
1.
b. PSLP padded graphsa. PSLP graphs
7.
1.
7.
5.Left Right
+
L 5.
Select Instruction Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
* *7.
1.
b. PSLP padded graphsa. PSLP graphs
7.
1.
7.
5.Left Right
+
L 5.
Select Instruction Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
* *7.
1.
c. PSLP groupsb. PSLP padded graphsa. PSLP graphs
7.
1.
7.
5.
1
2
0 S S
++
3 **
4 L L
Left Right+
L 5.
Select Instruction Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP fixes Non-Isomorphism
S
L
S
L
S
L
+
S
+ +
*
* *7.
1.
c. PSLP groupsb. PSLP padded graphsa. PSLP graphs
7.
1.
7.
5.
1
2
0 S S
++
3 **
4 L L
Left Right 5+
L 5.
Select Instruction Data Flow EdgeInstruction or ConstantX
slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
3. Perform minimal Padding of graphs
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
CalculateScalar Cost
CalculateVector Cost
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
CalculateScalar Cost
CalculateVector Cost
IfPadded Cost
is best5.
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
• Emit redundant codeto createisomorphism
CalculateScalar Cost
CalculateVector Cost
IfPadded Cost
is best5.
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Emit Padded Scalars
YES
6.
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
• Emit redundant codeto createisomorphism
7.If<Vector Cost
Scalar Cost
NO
CalculateScalar Cost
CalculateVector Cost
IfPadded Cost
is best5.
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Emit Padded Scalars
YES
6.
Generate a graph for each seed2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
• Emit redundant codeto createisomorphism
• Code vectorized byoriginal SLP
YES
7.If<Vector Cost
Scalar Cost
NO
CalculateScalar Cost
CalculateVector Cost
IfPadded Cost
is best5.
8.
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Emit Padded Scalars
YES
6.
} Vanilla SLP
Generate a graph for each seed
9.
Generate SLP graph containinggroups of isomorphic scalars
Vectorize groups & emit vectors
2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP Algorithm
• Extension to SLP
• Generate multiplegraphs (unlike SLP)
• Minimal Padding
• Cost estimation
• Emit redundant codeto createisomorphism
• Code vectorized byoriginal SLP
YES
7.If<Vector Cost
Scalar Cost
NO
CalculateScalar Cost
CalculateVector Cost
IfPadded Cost
is best5.
8.
3. Perform minimal Padding of graphs
4.Calculate PaddedVector Cost
Emit Padded Scalars
YES
6.
} Vanilla SLP
Generate a graph for each seed
9.
Generate SLP graph containinggroups of isomorphic scalars
Vectorize groups & emit vectors
NO
DONE
2.
1. Find vectorization seed instructions
slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
S
+*
7L
1
+
L
S
5
g1g2
Non−Isomorphic
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
S
+*
7L
1
+
L
S
5
g1g2
Non−Isomorphic
MCS1MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
g1g2
Non−Isomorphic
g1
g2
MCS1 MCS2MCS1MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
diff1
diff2
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
*
7
g1g2
Non−Isomorphic
g1
g2
L
+
L
+
MCS1 MCS2MCS1MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
diff1
diff2
S
+
L
1
S
5
+
L
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
*
7
g1g2
MinCS2MinCS1
Non−Isomorphic
g1
g2
L
+
L
+
MCS1 MCS2MCS1 MCS1 MCS2MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
diff1
diff2
S
+
L 7
*
1
S
5
+
L 7
*
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
*
7
g1g2
MinCS2MinCS1
Non−Isomorphic
g1
g2diff1diff1
L
+
L
+
MCS1 MCS2MCS1 MCS1 MCS2MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
diff1
diff2
S
+
L 7
*
1
S
5
+
L 7
*
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
*
7
g1g2
MinCS2MinCS1
Isomorphic !Non−Isomorphic
g1
g2diff1diff1
diff2diff2
L
+
L
+
MCS1 MCS2MCS1 MCS1 MCS2MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
Minimal Padding Algorithm
diff1
diff2 SELECTSELECT
S
+
L 7
*
1
S
5
+
L 7
*
S
+*
7L
1
+
L
S
5
+
L
S
1
+
L
S
5
*
7
g1g2
MinCS2MinCS1
Isomorphic !Non−Isomorphic
g1
g2diff1diff1
diff2diff2
L
+
L
+
LeftRight
MCS1 MCS2MCS1 MCS1 MCS2MCS2
slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
S
*
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+
S
*
S
L
+
*
Left
7
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
1
+
C
A
a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
A
+
0C
1
+
C
A
a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
A
+
0C
A
72
1
+
C
A
a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
A
+
0C
A
2
A
72
1
+
C
A
a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
A
+
0C
A
2
A
72
A
B1
+
C
A
a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
We can do better: Remove redundant Selects
S
L
+
*
S
L
+*
S
L
+
S
*
S
L
+
*
S
L
+*Left
7
17
1
EXAMPLE: Instruction acting as Select
7.
1.
+
L 5.
7
5
Right
1
1
A
+
0C
A
2
A
72
A
B
A
B
1
+
C
A
a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/
Opportunities for PSLP in real-life applications
1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)
a[0].reala[0].imaga[1].reala[1].imag
...
...
b[0].imag = − a[0].imag
b[1].imag = − a[1].imag
b[0].real = a[0].real
b[1].real = a[1].real
Memory
slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/
Opportunities for PSLP in real-life applications
1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)
a[0].reala[0].imaga[1].reala[1].imag
...
...
b[0].imag = − a[0].imag
b[1].imag = − a[1].imag
b[0].real = a[0].real
b[1].real = a[1].real
Memory
slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/
Opportunities for PSLP in real-life applications
1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)
a[0].reala[0].imaga[1].reala[1].imag
...
...
b[0].imag = − a[0].imag
b[1].imag = − a[1].imag
b[0].real = a[0].real
b[1].real = a[1].real
Memory
2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)
tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266
slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/
Opportunities for PSLP in real-life applications
1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)
a[0].reala[0].imaga[1].reala[1].imag
...
...
b[0].imag = − a[0].imag
b[1].imag = − a[1].imag
b[0].real = a[0].real
b[1].real = a[1].real
Memory
2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)
tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266
tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266
opt
slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/
Opportunities for PSLP in real-life applications
1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)
a[0].reala[0].imaga[1].reala[1].imag
...
...
b[0].imag = − a[0].imag
b[1].imag = − a[1].imag
b[0].real = a[0].real
b[1].real = a[1].real
Memory
2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)
tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266
tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266
opt
slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math
• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math
• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:
1 All loop, SLP and PSLP vectorizers disabled (O3)
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math
• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:
1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
Experimental Setup
• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.
• Target: Intel Core i5-4570 @ 3.2Ghz
• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math
• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:
1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)3 O3 + PSLP enabled (PSLP)
slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP increases performance
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
conjugates
su3-adjoint
make-ahmat-slow
jdct-ifastfloyd-warshall
GMean
Nor
mal
ized
Tim
e
Performance of Kernels (Execution Time)
O3 SLP PSLP
0.97
0.98
0.99
1.00
1.01
cjpegmpeg2dec
433.milc473.astar
GMean
Whole Benchmarks (Execution Time)
O3 SLP PSLP
slide 14 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP enables or extends vectorization
0
10
20
30
40
50
conjugates
su3-adjoint
make-ahmat-slow
jdct-ifastfloyd-warshall
cjpegmpeg2dec
433.milc473.astar
Tim
es T
echn
ique
Suc
ceed
s
Vectorization Coverage Breakdown163
SLP-onlyPSLP-extends-SLP
PSLP-only
SLPonly
• SLP is adequate
slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP enables or extends vectorization
0
10
20
30
40
50
conjugates
su3-adjoint
make-ahmat-slow
jdct-ifastfloyd-warshall
cjpegmpeg2dec
433.milc473.astar
Tim
es T
echn
ique
Suc
ceed
s
Vectorization Coverage Breakdown163
SLP-onlyPSLP-extends-SLP
PSLP-only
SLPonly
PSLP extends SLP • SLP is adequate
• SLP stops at non-isomorphiccode. PSLP extends it.
slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/
PSLP enables or extends vectorization
0
10
20
30
40
50
conjugates
su3-adjoint
make-ahmat-slow
jdct-ifastfloyd-warshall
cjpegmpeg2dec
433.milc473.astar
Tim
es T
echn
ique
Suc
ceed
s
Vectorization Coverage Breakdown163
SLP-onlyPSLP-extends-SLP
PSLP-only
PSLPonly
SLPonly
PSLP extends SLP • SLP is adequate
• SLP stops at non-isomorphiccode. PSLP extends it.
• SLP fails completely. PSLP
succeeds.slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/
Optimizing away redundant Selects
• Select-removal
optimizations
remove about 21%
of the Selects
0%
5%
10%
15%
20%
25%
30%
35%
conjugates
su3-adjoint
make-ahmat-slow
jdct-ifastfloyd-warshall
cjpegmpeg2dec
433.milc473.astar
GMean
Per
cent
age
of S
elec
ts
Percentage of Selects per region before and after Optimizations
Original-Selects Optimized-Selects
slide 16 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
• Converts non-isomorphic code into isomorphic by:
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal
injection of redundant code
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal
injection of redundant code• Emitting Select instructions to guarantee correctness
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal
injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/
Conclusion
• PSLP improves vectorization coverage compared tothe state-of-the-art
• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal
injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects
• PSLP performs better compared to SLP oncommodity SIMD-capable hardware
slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/