The Queen’s TowerThe Queen’s TowerImperial College LondonImperial College LondonSouth Kensington, SW7South Kensington, SW7
6th Jun 2007 | Ashley Brown
Real-Number Real-Number Optimisation: A Optimisation: A
Speculative, Profile-Speculative, Profile-Guided Approach Guided Approach
PhD Transfer Presentation6th June 2007
6th Jun 2007 | Ashley Brown
IntroductionIntroduction
• Most useful applications use real-number algorithms– Chemical modelling– Weather forecasting– MP3 players– Mobile phones– Etc.
• Science Applications = Double precision floating point– 64-bit or 80-bit floating point ALU
• Embedded Applications = 32-bit fixed point– 32-bit integer ALU
# 2
6th Jun 2007 | Ashley Brown # 3
Introduction: Our FocusIntroduction: Our Focus
• Two distinct sets of requirements
• Embedded systems– High precision often not important (video/audio
processing)– Fixed point implementations possible
• Scientific computation– High precision extremely important– Reduction in precision or conversion to single prec.
must be done with great care– IEEE-754 floating point
6th Jun 2007 | Ashley Brown
What if….?
• What if we could make these applications run faster?
• What if we could shrink the hardware resources needed to run them?
• What if we could do it more aggressively than current methods?
• What if we took up gambling?
6th Jun 2007 | Ashley Brown
Introduction: Avoiding the Safe OptionIntroduction: Avoiding the Safe Option
• Standard formats (e.g. IEEE-754) are good for generality
• But, we can do better– Optimise data format for the job in hand– Use reconfigurable technology to change format on-
the-fly
• Static analysis steps must be conservative
• Formal proofs are also conservative
• We could be more daring!
# 5
6th Jun 2007 | Ashley Brown
Motivation
• Graphics cards have made highly parallel vector processors commodity items
• In the high performance computing world, FPGAs are available as acceleration cards– New Hypertransport cards provide faster communication
channels
• Both provide speed increases if used effectively– But graphics cards only have single-precision f.p.– Simply implementing double precision f.p. on FPGAs
provides no benefits
• Optimise aggressively to provide reduced f.p. capabilities, within needs of application
6th Jun 2007 | Ashley Brown
FPGA Focus
• FPGAs provide a good prototyping platform
• Exploiting reconfiguration may provide further benefits
• Limitations:– Clock-rates are much slower than attached
processors– Communication channels are typically slow
• Acceleration comes from parallelism
• The rest of this talk considers FPGA/custom hardware implementations
6th Jun 2007 | Ashley Brown # 8
The ProblemThe Problem
• Double precision floating point on FPGAs uses a lot of area
• Density is improving: but still want to squeeze more in!– Re-using hardware can reduce concurrency
• Scientific applications: typically 64-bit floating point
• Often full precision is (believed to be) required– Is this really the case?
• We have more options than single or double
6th Jun 2007 | Ashley Brown # 9
Current Solutions for F.P. minimisationCurrent Solutions for F.P. minimisation
• Finding ‘minimal precision’:– Tools such as BitSize– Select precision for some operands, tool calculates
the rest– Test vectors used to gauge errors
• Reducing hardware area:– Replacing floating point by fixed point, transparent
to user (Cheung et al.)– Solution above is dangerous in scientific
computations– Works in this case as trigonometric functions have
defined ranges
6th Jun 2007 | Ashley Brown
Current Solutions for F.P. minimisation (2)Current Solutions for F.P. minimisation (2)
• Approach by Dongarra et al– Inspired by Cell single precision f.p. performance– Use single precision most of the time with an
iterative refinement algorithm– When approaching convergence, switch to double
precision to finish off– Works on Cell and with SSE instructions
• Strodzka’s mixed-precision approach– Split loop into low and high preceision loops– Low precision computation loop– High precision correction loop
# 10
Profile-guided Speculative OptimisationProfile-guided Speculative Optimisation
• Three stages– Profile to Find Key Kernels– Optimise Data Format– Generate hardware with fallback
mechanism
• Optimised hardware should produce correct results for most calculations
• We need to know when it gets it wrong
6th Jun 2007 | Ashley Brown # 11
Generate Optimised
Hardware + Error Recovery
Floating Point Application
Floating Point Profiler
FloatWatch
Identify Key Kernels
Select Optimised Data Format for
Kernels
Application with Speculative F.P.
Backup option if we get it wrong!Backup option if we get it wrong!
• Aggressive optimisation
• We could get it wrong
• Must ensure we get correct answers
• We must guess correctly often enough to make falling back insignificant
6th Jun 2007 | Ashley Brown # 12
Operandsin Range?
Reduced Floating Point Unit
Next Floating Point Operation
Yes
Software Libraryor Other Fallback
MechanismNo
6th Jun 2007 | Ashley Brown # 13
Optimisation OpportunitiesOptimisation Opportunities
• Reduce floating point unit– Reduced precision– Restricted normalisation
• Use an alternative representation– Non-standard floating point (e.g. 48-bit)– Fixed point– Dual fixed-point
• Minimisation of redundancy– Remove denormal handling unless required– Remove or predict zero-value calculations
6th Jun 2007 | Ashley Brown
Background: Floating Point PrimerBackground: Floating Point Primer
exponentbias b=(2e-1)-1
significandsign
1 bit
e bits
s bits
# 14
6th Jun 2007 | Ashley Brown
Simplified F.P. Adder
subtract
s exp significand s exp significand
shift right
neg
abs
add
Align operandsImplicit leading ‘1’ must shifted in on left.
Swap OperandsSmallest operand gets shifted.
Compare ExponentsFractional parts must be aligned according to exponent different.
Add
Select ExponentLargest exponent used.
neg
s exp significand
exp significandNormaliseNormalise result, adjusting exponent as necessary.
6th Jun 2007 | Ashley Brown # 16
Reduce HardwareReduce Hardware
• Example using MORPHY
• F.P. values are interesting– Most confined to a narrow
range– Different data sets do not
vary the range
• Full range of double precision floating point not required
• Reduce Exponent– Limit size of shifting logic– Smaller data format =
lower communication cost
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-256 -6
4-1
6 -4 -1-0
.25
-0.0
625
-0.0
1562
5
-0.0
0390
63
-0.0
0097
66
0.00
0976
6
0.00
3906
3
0.01
5625
0.06
250.
25 1 4 16 64 256
Value Magnitude
Co
un
t
Methane
Water
Peroxide
subtract
s exp significand s exp significand
shift right
neg
abs
add
Align operandsImplicit leading ‘1’ must shifted in on left.
Swap OperandsSmallest operand gets shifted.
Compare ExponentsFractional parts must be aligned according to exponent different.
Add
Select ExponentLargest exponent used.
neg
s exp significand
exp significandNormaliseNormalise result, adjusting exponent as necessary.
6th Jun 2007 | Ashley Brown28th Jan 2007 | Ashley Brown # 17
Reduce Hardware – Alignment/NormalisationReduce Hardware – Alignment/Normalisation
• Most expensive step: shifting for add/subtract– Operand alignment– Normalisation
• Set limits on alignment to reduce hardware size– Trap to software to perform other alignments
• Provisional results: only shift-by-4 required for some applications
subtract
s exp significand s exp significand
shift right
neg
abs
add
Align operandsImplicit leading ‘1’ must shifted in on left.
Swap OperandsSmallest operand gets shifted.
Compare ExponentsFractional parts must be aligned according to exponent different.
Add
Select ExponentLargest exponent used.
neg
s exp significand
exp significandNormaliseNormalise result, adjusting exponent as necessary.
6th Jun 2007 | Ashley Brown # 18
Alternative Representations #1: Custom Floating Alternative Representations #1: Custom Floating PointPoint
• No need to use 64- or 32-bit
• Use a compromise instead, maybe 48-bit is enough?
1 mantissa(52)exp(11)
1 mantissa(23)exp(8)
1 mantissa(38)exp(9)
IEEE Single
Custom
IEEE Double
• Can we drop the sign bit?
• Reduce hardware
• Reduce communications
subtract
s exp significand s exp significand
shift right
neg
abs
add
Align operandsImplicit leading ‘1’ must shifted in on left.
Swap OperandsSmallest operand gets shifted.
Compare ExponentsFractional parts must be aligned according to exponent different.
Add
Select ExponentLargest exponent used.
neg
s exp significand
exp significandNormaliseNormalise result, adjusting exponent as necessary.
6th Jun 2007 | Ashley Brown # 19
Alternative Representations #2: Fixed PointAlternative Representations #2: Fixed Point
• For very narrow ranges, fixed point may be an option
• Must be treated with extreme care
• Dual fixed-point format provides another possibility– Two different formats: different fixed point positions– 1 bit reserved to switch between formats
subtract
s exp significand s exp significand
shift right
neg
abs
add
Align operandsImplicit leading ‘1’ must shifted in on left.
Swap OperandsSmallest operand gets shifted.
Compare ExponentsFractional parts must be aligned according to exponent different.
Add
Select ExponentLargest exponent used.
neg
s exp significand
exp significandNormaliseNormalise result, adjusting exponent as necessary.
6th Jun 2007 | Ashley Brown # 20
FloatWatchFloatWatch
• Valgrind-based value profiler
• Can return a number of metrics:– Floating point value
ranges– Variation between 32-bit
and 64-bit F.P. executions
– Difference in magnitude between F.P. operations
• Each metric has uses for optimisation!
6th Jun 2007 | Ashley Brown # 21
FloatWatchFloatWatch
• Operates on x86 binaries under Valgrind– x86 machine code
converted to simplified SSA– FloatWatch inserts
instrumentation code after floating point operations
– SSA converted back to x86 and cached
• Outputs a data file with selected metrics
• Processing script produces HTML+JavaScript report
Valgrind
FloatWatch
FloatWatchPost-
processorRawOutput
Web Browser
Graphing Tools
UserData
ManipulationCSV export
x86 binary
Source Files (C, FORTRAN)
HTML
6th Jun 2007 | Ashley Brown # 22
ReportReport
• Dynamic HTML interface– Copy HTML file from computing cluster to desktop,
no installation required
• Select/deselect source lines, SSA “instructions”– Dynamic in-page graph – Table for exporting to GNU-plot, Excel etc.
• View value ranges at instruction, source line, function, file and application levels.
6th Jun 2007 | Ashley Brown # 28
0
50000000
100000000
150000000
200000000
250000000
300000000
Value Magnitude
U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
UNEW(I,J)-2.*U(I,J)
2*U(I,J)
ALPHA
UOLD(I,J)
UNEW(I,J)
U(I,J)
UOLD( I , J ) = U( I , J )+ALPHA * (UNEW( I , J )−2. _U( I , J )+UOLD( I , J ) )
t 193 = Shl 32 ( t 52 , 0 x2 : I 8 )t 192 = Add32 ( t 193 , 0 x80549D8 : I 32 )t 196 = GET: I 32 ( 144)t 199 = F64i {0 x7FF8000000000000 } t 201 = LDl e : F32 ( t 192 )t 200 = F32t oF64 ( t 201 )t 202 = GETI ( 1 2 8 : 8 xI 8 ) [ t 196 , −1]t 198 = Mux0X( t 202 , t 200 , t 199 )t 216 = GETI ( 1 2 8 : 8 xI 8 ) [ t 196 , −2]t 214 = Mux0X( t 216 , t 198 , t 199 )t 220 = AddF64( t 198 , t 214 )t 238 = Shl 32 ( t 38 , 0 x2 : I 8 )t 237 = Add32 ( t 238 , 0 x83579E4 : I 32 )t 246 = LDl e : F32 ( t 237 )t 245 = F32t oF64 ( t 246 )t 251 = SubF64 ( t 245 , t 220 )t 284 = Shl 32 ( t 69 , 0 x2 : I 8 )t 283 = Add32 ( t 284 , 0 x865A9F0 : I 32 )t 289 = LDl e : F32 ( t 283 )t 288 = F32t oF64 ( t 289 )t 287 = AddF64( t 251 , t 288 )t 300 = LDl e : F32 ( 0 x8E63234 : I 32 )t 299 = F32t oF64 ( t 300 )t 298 = Mul F64 ( t 287 , t 299 )t 309 = Shl 32 ( t 24 , 0 x2 : I 8 )t 308 = Add32 ( t 309 , 0 x80549D8 : I 32 )t 317 = LDl e : F32 ( t 308 )t 316 = F32t oF64 ( t 317 )t 322 = AddF64( t 298 , t 316 )
6th Jun 2007 | Ashley Brown # 29
What does this tell us?What does this tell us?
• Alpha is constant (but could have found that from source)
• Memory operands all fall within the same range
• Result falls within the same range as memory operands
• Intermediate values result in a shift in the range
• Optimisation: we do not need double precision– A custom floating
point format would suffice
0
50000000
100000000
150000000
200000000
250000000
300000000
Value Magnitude
U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
UNEW(I,J)-2.*U(I,J)
2*U(I,J)
ALPHA
UOLD(I,J)
UNEW(I,J)
U(I,J)
6th Jun 2007 | Ashley Brown # 30
Profiling Results – SPECFP95 ‘swim’Profiling Results – SPECFP95 ‘swim’
0
1000000000
2000000000
3000000000
4000000000
5000000000
6000000000
-1x
2^
47
-1x
2^
26
-1x
2^
5
-1x
2^
-16
-1x
2^
-37
-1x
2^
-58
-1x
2^
-79
-1x
2^
-10
0
-1x
2^
-12
1
-1x
2^
-14
2
1x
2^
-15
4
1x
2^
-13
3
1x
2^
-11
2
1x
2^
-91
1x
2^
-70
1x
2^
-49
1x
2^
-28
1x
2^
-7
1x
2^
14
1x
2^
35
Value Magnitude
Co
un
t
Sawtooth caused by multiplication
6th Jun 2007 | Ashley Brown # 31
‘‘swim’ Close-upswim’ Close-up
0
200000000
400000000
600000000
800000000
1000000000
1200000000
1400000000
1600000000
1800000000
-1x2
^27
-1x2
^22
-1x2
^17
-1x2
^12
-1x2
^7
-1x2
^2
-1x2
^-3
-1x2
^-8
-1x2
^-13
-1x2
^-18
-1x2
^-23
-1x2
^-28
-1x2
^-33
-1x2
^-38
-1x2
^-43
-1x2
^-48
-1x2
^-53
Value Magnitude
Co
un
t
6th Jun 2007 | Ashley Brown # 32
Profiling Results – SPECFP95 ‘mgrid’Profiling Results – SPECFP95 ‘mgrid’
0
500000000
1000000000
1500000000
2000000000
2500000000
3000000000
3500000000
4000000000
-1x
2^
65
-1x
2^
-45
-1x
2^
-15
5
-1x
2^
-26
5
-1x
2^
-37
5
-1x
2^
-48
5
-1x
2^
-59
5
-1x
2^
-70
5
-1x
2^
-81
5
-1x
2^
-92
5
1x
2^
-10
12
1x
2^
-90
2
1x
2^
-79
2
1x
2^
-68
2
1x
2^
-57
2
1x
2^
-46
2
1x
2^
-35
2
1x
2^
-24
2
1x
2^
-13
2
1x
2^
-22
Value Magnitude
Co
un
t
Operations producing zero
Two ranges: similar shapes
6th Jun 2007 | Ashley Brown # 33
Range Close-upRange Close-up
0
500000000
1000000000
1500000000
2000000000
2500000000
3000000000
3500000000
4000000000
-1x2
^7
-1x2
^2
-1x2
^-3
-1x2
^-8
-1x2
^-13
-1x2
^-18
-1x2
^-23
-1x2
^-28
-1x2
^-33
-1x2
^-38
-1x2
^-43
-1x2
^-48
-1x2
^-53
-1x2
^-58
-1x2
^-63
-1x2
^-68
Value Magnitude
Co
un
t
6th Jun 2007 | Ashley Brown # 34
Profiling Results – MMVBProfiling Results – MMVB
0%
20%
40%
60%
80%
100%
120%
-2 -0.5 -0.13 -0.03 -0.01 -0 -0 -0 -0 -0 -0 -0 -0 6E-08
2E-07
1E-06
4E-06
2E-05
6E-05
2E-04
1E-03
0.004 0.016 0.063 0.25 1
Value Magnitude
Co
un
t
9a 7 6D2 12_2 13_d3h
As with MORPHY, ranges similar between datasets
But we were using test datasets
6th Jun 2007 | Ashley Brown
Adaptive Floating PointAdaptive Floating Point
• The main focus of future work
• Dynamic modification of acceleration hardware– Exploit reconfigurability of FPGAs– Reconfigure device to meet application requirements
• Important considerations:– What happens to data already on the chip?
# 35
6th Jun 2007 | Ashley Brown # 36
““Pipeline Prediction”Pipeline Prediction”
• Similar concept to branch prediction
• Build a selection of pipelines with different performance characteristics– Slow but generic version– Fast version with limited range, reduced operand
alignment– Compromise in between
• Predict which version is best to use (how?)
Full Microprocessor
F.P.Software Library
Floating Point Datapath
+
*
+
6th Jun 2007 | Ashley Brown # 37
True Reconfiguration – Temporal ProfilingTrue Reconfiguration – Temporal Profiling
• Value ranges can vary– for different application phases
– within loops iterating to convergence
• Potential to reconfigure hardware as phases change
• Particularly apparent when iterating to convergence
• Simple example using Newton-Raphson method– Solve a cos(x) – bx3 + c = 0
– Choose an estimate for the solution, iterate to refine
6th Jun 2007 | Ashley Brown
Predictable Behaviour: Newton’s Method
Iterations when a=1, b=1, c=0
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9 10 11 12 13
Iteration
Val
ue
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
Each line represents a different starting estimate – the actual value is around 0.86
6th Jun 2007 | Ashley Brown
Resolution Refinement
• Work by Dongarra et al uses this technique– Start with SSE/32-bit f.p.– Iterate to “ballpark” convergence– Switch to 64-bit for more precise result
• Alternative with reconfiguration– Multi-stepped refinement– Start at 8-bit, move to 24, 48, 64, 128 depending on
precision required
6th Jun 2007 | Ashley Brown # 40
Problems with our approachProblems with our approach
• No guarantees that values do not occur outside identified ranges– We must have a backup plan if it goes wrong
• Not all applications will demonstrate behaviour similar to MORPHY– Value ranges could vary wildly with different
datasets
• Valgrind is slow
• Getting FPGAs to provide a speed-up can be difficult and painful
6th Jun 2007 | Ashley Brown # 41
Future WorkFuture Work
• State-based profiling:– profile functions based on call-stack– allows context-dependent configurations
• Active simulation– Test new representations to check for errors
• Use results in practice– FPGA implementations for real applications– Adaptive representations
The Queen’s TowerThe Queen’s TowerImperial College LondonImperial College LondonSouth Kensington, SW7South Kensington, SW7
6th Jun 2007 | Ashley Brown
Any Questions?Any Questions?
JezebelJezebel1916 Dennis ‘N’ Type Fire Engine1916 Dennis ‘N’ Type Fire Engine
Royal College of Science Motor ClubRoyal College of Science Motor ClubImperial College Union, SW7Imperial College Union, SW7