Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 1 times |
1 University of MichiganElectrical Engineering and Computer Science
Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping
Nathan Clark, Amir Hormati, Scott Mahlke,
Sami Yehia*, Krisztián Flautner*
University of Michigan *ARM Ltd.
2 University of MichiganElectrical Engineering and Computer Science
Computational Efficiency
• Low power envelope
• More useful work/transistors
• Hardware accelerators
• Niagara II encryption engine
Source: AMD Analyst Day 12/14/06
3 University of MichiganElectrical Engineering and Computer Science
How Are Accelerators Used?
Control statically placed in binary
CPU
Accel.Program
4 University of MichiganElectrical Engineering and Computer Science
Problem With Static Control
Not forward/backward compatible
CPU
Accel.
ProgramCPU
CPU
Accel.
5 University of MichiganElectrical Engineering and Computer Science
Solution: Virtualization
• Statically identify accelerated computation• Abstract accelerator features• Dynamically retarget binary
Proc.
Accel.
Program
Proc.
Proc.
Accel.
Trans.
Trans.
Trans.
Engineer/Compiler
6 University of MichiganElectrical Engineering and Computer Science
Liquid SIMD
• Virtualize SIMD accelerators
• Why virtualize SIMD?– Intel MMX to SSE2– ARM v6 to Neon– Wide vectors useful [Lin 06]
7 University of MichiganElectrical Engineering and Computer Science
SIMD Accelerator Assumptions
• Same instruction stream• Separate pipeline – memory interface
Fetch Decode
ScalarExec
SIMDExec
Retire
8 University of MichiganElectrical Engineering and Computer Science
• Use scalar ISA to represent SIMD operations– Compatibility, low overhead
• Key: easy to translate
How to Virtualize
Program
Branch
9 University of MichiganElectrical Engineering and Computer Science
Virtualization Architecture
Fetch
Decode Execute
Retire
Accel.uCodeCache
Trans.
10 University of MichiganElectrical Engineering and Computer Science
1. Data Parallel Operations
for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4;}
+
&
A B
+
&
A B
+
&
A B
C
11 University of MichiganElectrical Engineering and Computer Science
1a. What If There’s No Scalar Equivalent?
for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ...}
SADD
A B
Idioms can always be constructed
12 University of MichiganElectrical Engineering and Computer Science
2. Scalarizing Permutations
&
+for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1}
for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const …}
offset = {4, 4, 4, 4, -4, -4, -4, -4}
&
+
&
+
offset = {4, 4, 4, 4, -4, -4, -4, -4}offset = {4, 4, 4, 4, -4, -4, -4, -4}
13 University of MichiganElectrical Engineering and Computer Science
3. Scalarizing Reductions
+
for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; …}
14 University of MichiganElectrical Engineering and Computer Science
Applied to ARM Neon
• All instructions supported except…
• VTBL – indirect indexingv1 = vtbl v2, v3
• Interleaved memory accesses
• Not needed in evaluated benchmarks
v3
1 0 1 3v2
v1
v1
Mem
15 University of MichiganElectrical Engineering and Computer Science
Translation to SIMD
• Update induction variable• Use inverse of defined translation rules
for(i = 0; i < 8; i++){ r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3;}
for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant
}
for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4
}
i += 4for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i];
}
for(i = 0; i < 8; i += 4){ v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3;}
16 University of MichiganElectrical Engineering and Computer Science
Translator Design
Translator: efficiency, speed, flexibility
Proc.
Accel.
Program
Proc.
Proc.
Accel.
Trans.
Trans.
Trans.
Engineer/Compiler
17 University of MichiganElectrical Engineering and Computer Science
Evaluation
• Trimaran ARM
• Hand SIMDized loops
• SimpleScalar model ARM926 w/ Neon SIMD
• VHDL translator, 130nm std. cell
18 University of MichiganElectrical Engineering and Computer Science
Liquid SIMD Issues
• Code bloat– <1% overhead beyond baseline
• Register pressure– Not a problem
• Translator cost– 0.2 mm2 + 2KB cache
• Translation overhead
19 University of MichiganElectrical Engineering and Computer Science
Translation Overhead
SPECfp MediaBench Kernels
20 University of MichiganElectrical Engineering and Computer Science
Summary
• Accelerators are more common and evolving– Costly binary migration
• SIMD virtualization using scalar ISA– One binary: forward/backward compatibility– Negligible overhead