The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown...

The Queen’s TowerThe Queen’s TowerImperial College LondonImperial College LondonSouth Kensington, SW7South Kensington, SW7

6th Jun 2007 | Ashley Brown

Real-Number Real-Number Optimisation: A Optimisation: A

Speculative, Profile-Speculative, Profile-Guided Approach Guided Approach

PhD Transfer Presentation6th June 2007


IntroductionIntroduction

• Most useful applications use real-number algorithms– Chemical modelling– Weather forecasting– MP3 players– Mobile phones– Etc.

• Science Applications = Double precision floating point– 64-bit or 80-bit floating point ALU

• Embedded Applications = 32-bit fixed point– 32-bit integer ALU

# 2

6th Jun 2007 | Ashley Brown # 3

Introduction: Our FocusIntroduction: Our Focus

• Two distinct sets of requirements

• Embedded systems– High precision often not important (video/audio

processing)– Fixed point implementations possible

• Scientific computation– High precision extremely important– Reduction in precision or conversion to single prec.

must be done with great care– IEEE-754 floating point


What if….?

• What if we could make these applications run faster?

• What if we could shrink the hardware resources needed to run them?

• What if we could do it more aggressively than current methods?

• What if we took up gambling?


Introduction: Avoiding the Safe OptionIntroduction: Avoiding the Safe Option

• Standard formats (e.g. IEEE-754) are good for generality

• But, we can do better– Optimise data format for the job in hand– Use reconfigurable technology to change format on-

the-fly

• Static analysis steps must be conservative

• Formal proofs are also conservative

• We could be more daring!

# 5


Motivation

• Graphics cards have made highly parallel vector processors commodity items

• In the high performance computing world, FPGAs are available as acceleration cards– New Hypertransport cards provide faster communication

channels

• Both provide speed increases if used effectively– But graphics cards only have single-precision f.p.– Simply implementing double precision f.p. on FPGAs

provides no benefits

• Optimise aggressively to provide reduced f.p. capabilities, within needs of application


FPGA Focus

• FPGAs provide a good prototyping platform

• Exploiting reconfiguration may provide further benefits

• Limitations:– Clock-rates are much slower than attached

processors– Communication channels are typically slow

• Acceleration comes from parallelism

• The rest of this talk considers FPGA/custom hardware implementations


The ProblemThe Problem

• Double precision floating point on FPGAs uses a lot of area

• Density is improving: but still want to squeeze more in!– Re-using hardware can reduce concurrency

• Scientific applications: typically 64-bit floating point

• Often full precision is (believed to be) required– Is this really the case?

• We have more options than single or double


Current Solutions for F.P. minimisationCurrent Solutions for F.P. minimisation

• Finding ‘minimal precision’:– Tools such as BitSize– Select precision for some operands, tool calculates

the rest– Test vectors used to gauge errors

• Reducing hardware area:– Replacing floating point by fixed point, transparent

to user (Cheung et al.)– Solution above is dangerous in scientific

computations– Works in this case as trigonometric functions have

defined ranges


Current Solutions for F.P. minimisation (2)Current Solutions for F.P. minimisation (2)

• Approach by Dongarra et al– Inspired by Cell single precision f.p. performance– Use single precision most of the time with an

iterative refinement algorithm– When approaching convergence, switch to double

precision to finish off– Works on Cell and with SSE instructions

• Strodzka’s mixed-precision approach– Split loop into low and high preceision loops– Low precision computation loop– High precision correction loop

# 10

Profile-guided Speculative OptimisationProfile-guided Speculative Optimisation

• Three stages– Profile to Find Key Kernels– Optimise Data Format– Generate hardware with fallback

mechanism

• Optimised hardware should produce correct results for most calculations

• We need to know when it gets it wrong


Generate Optimised

Hardware + Error Recovery

Floating Point Application

Floating Point Profiler

FloatWatch

Identify Key Kernels

Select Optimised Data Format for

Kernels

Application with Speculative F.P.

Backup option if we get it wrong!Backup option if we get it wrong!

• Aggressive optimisation

• We could get it wrong

• Must ensure we get correct answers

• We must guess correctly often enough to make falling back insignificant


Operandsin Range?

Reduced Floating Point Unit

Next Floating Point Operation

Yes

Software Libraryor Other Fallback

MechanismNo


Optimisation OpportunitiesOptimisation Opportunities

• Reduce floating point unit– Reduced precision– Restricted normalisation

• Use an alternative representation– Non-standard floating point (e.g. 48-bit)– Fixed point– Dual fixed-point

• Minimisation of redundancy– Remove denormal handling unless required– Remove or predict zero-value calculations


Background: Floating Point PrimerBackground: Floating Point Primer

exponentbias b=(2e-1)-1

significandsign

1 bit

e bits

s bits

# 14


Simplified F.P. Adder

subtract

s exp significand s exp significand

shift right

neg

abs

add

Align operandsImplicit leading ‘1’ must shifted in on left.

Swap OperandsSmallest operand gets shifted.

Compare ExponentsFractional parts must be aligned according to exponent different.

Add

Select ExponentLargest exponent used.

neg

s exp significand

exp significandNormaliseNormalise result, adjusting exponent as necessary.


Reduce HardwareReduce Hardware

• Example using MORPHY

• F.P. values are interesting– Most confined to a narrow

range– Different data sets do not

vary the range

• Full range of double precision floating point not required

• Reduce Exponent– Limit size of shifting logic– Smaller data format =

lower communication cost

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

-256 -6

4-1

6 -4 -1-0

.25

-0.0

625

-0.0

1562

5

-0.0

0390

63

-0.0

0097

66

0.00

0976

6

0.00

3906

3

0.01

5625

0.06

250.

25 1 4 16 64 256

Value Magnitude

Co

un

t

Methane

Water

Peroxide

subtract


shift right

neg

abs

add




Add


neg

s exp significand


6th Jun 2007 | Ashley Brown28th Jan 2007 | Ashley Brown # 17

Reduce Hardware – Alignment/NormalisationReduce Hardware – Alignment/Normalisation

• Most expensive step: shifting for add/subtract– Operand alignment– Normalisation

• Set limits on alignment to reduce hardware size– Trap to software to perform other alignments

• Provisional results: only shift-by-4 required for some applications

subtract


shift right

neg

abs

add




Add


neg

s exp significand



Alternative Representations #1: Custom Floating Alternative Representations #1: Custom Floating PointPoint

• No need to use 64- or 32-bit

• Use a compromise instead, maybe 48-bit is enough?

1 mantissa(52)exp(11)



IEEE Single

Custom

IEEE Double

• Can we drop the sign bit?

• Reduce hardware

• Reduce communications

subtract


shift right

neg

abs

add




Add


neg

s exp significand



Alternative Representations #2: Fixed PointAlternative Representations #2: Fixed Point

• For very narrow ranges, fixed point may be an option

• Must be treated with extreme care

• Dual fixed-point format provides another possibility– Two different formats: different fixed point positions– 1 bit reserved to switch between formats

subtract


shift right

neg

abs

add




Add


neg

s exp significand



FloatWatchFloatWatch

• Valgrind-based value profiler

• Can return a number of metrics:– Floating point value

ranges– Variation between 32-bit

and 64-bit F.P. executions

– Difference in magnitude between F.P. operations

• Each metric has uses for optimisation!


FloatWatchFloatWatch

• Operates on x86 binaries under Valgrind– x86 machine code

converted to simplified SSA– FloatWatch inserts

instrumentation code after floating point operations

– SSA converted back to x86 and cached

• Outputs a data file with selected metrics

• Processing script produces HTML+JavaScript report

Valgrind

FloatWatch

FloatWatchPost-

processorRawOutput

Web Browser

Graphing Tools

UserData

ManipulationCSV export

x86 binary

Source Files (C, FORTRAN)

HTML


ReportReport

• Dynamic HTML interface– Copy HTML file from computing cluster to desktop,

no installation required

• Select/deselect source lines, SSA “instructions”– Dynamic in-page graph – Table for exporting to GNU-plot, Excel etc.

• View value ranges at instruction, source line, function, file and application levels.







0

50000000

100000000

150000000

200000000

250000000

300000000

Value Magnitude

U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

UNEW(I,J)-2.*U(I,J)

2*U(I,J)

ALPHA

UOLD(I,J)

UNEW(I,J)

U(I,J)

UOLD( I , J ) = U( I , J )+ALPHA * (UNEW( I , J )−2. _U( I , J )+UOLD( I , J ) )

t 193 = Shl 32 ( t 52 , 0 x2 : I 8 )t 192 = Add32 ( t 193 , 0 x80549D8 : I 32 )t 196 = GET: I 32 ( 144)t 199 = F64i {0 x7FF8000000000000 } t 201 = LDl e : F32 ( t 192 )t 200 = F32t oF64 ( t 201 )t 202 = GETI ( 1 2 8 : 8 xI 8 ) [ t 196 , −1]t 198 = Mux0X( t 202 , t 200 , t 199 )t 216 = GETI ( 1 2 8 : 8 xI 8 ) [ t 196 , −2]t 214 = Mux0X( t 216 , t 198 , t 199 )t 220 = AddF64( t 198 , t 214 )t 238 = Shl 32 ( t 38 , 0 x2 : I 8 )t 237 = Add32 ( t 238 , 0 x83579E4 : I 32 )t 246 = LDl e : F32 ( t 237 )t 245 = F32t oF64 ( t 246 )t 251 = SubF64 ( t 245 , t 220 )t 284 = Shl 32 ( t 69 , 0 x2 : I 8 )t 283 = Add32 ( t 284 , 0 x865A9F0 : I 32 )t 289 = LDl e : F32 ( t 283 )t 288 = F32t oF64 ( t 289 )t 287 = AddF64( t 251 , t 288 )t 300 = LDl e : F32 ( 0 x8E63234 : I 32 )t 299 = F32t oF64 ( t 300 )t 298 = Mul F64 ( t 287 , t 299 )t 309 = Shl 32 ( t 24 , 0 x2 : I 8 )t 308 = Add32 ( t 309 , 0 x80549D8 : I 32 )t 317 = LDl e : F32 ( t 308 )t 316 = F32t oF64 ( t 317 )t 322 = AddF64( t 298 , t 316 )


What does this tell us?What does this tell us?

• Alpha is constant (but could have found that from source)

• Memory operands all fall within the same range

• Result falls within the same range as memory operands

• Intermediate values result in a shift in the range

• Optimisation: we do not need double precision– A custom floating

point format would suffice

0

50000000

100000000

150000000

200000000

250000000

300000000

Value Magnitude

U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))

UNEW(I,J)-2.*U(I,J)

2*U(I,J)

ALPHA

UOLD(I,J)

UNEW(I,J)

U(I,J)


Profiling Results – SPECFP95 ‘swim’Profiling Results – SPECFP95 ‘swim’

0

1000000000

2000000000

3000000000

4000000000

5000000000

6000000000

-1x

2^

47

-1x

2^

26

-1x

2^

5

-1x

2^

-16

-1x

2^

-37

-1x

2^

-58

-1x

2^

-79

-1x

2^

-10

0

-1x

2^

-12

1

-1x

2^

-14

2

1x

2^

-15

4

1x

2^

-13

3

1x

2^

-11

2

1x

2^

-91

1x

2^

-70

1x

2^

-49

1x

2^

-28

1x

2^

-7

1x

2^

14

1x

2^

35

Value Magnitude

Co

un

t

Sawtooth caused by multiplication


‘‘swim’ Close-upswim’ Close-up

0

200000000

400000000

600000000

800000000

1000000000

1200000000

1400000000

1600000000

1800000000

-1x2

^27

-1x2

^22

-1x2

^17

-1x2

^12

-1x2

^7

-1x2

^2

-1x2

^-3

-1x2

^-8

-1x2

^-13

-1x2

^-18

-1x2

^-23

-1x2

^-28

-1x2

^-33

-1x2

^-38

-1x2

^-43

-1x2

^-48

-1x2

^-53

Value Magnitude

Co

un

t


Profiling Results – SPECFP95 ‘mgrid’Profiling Results – SPECFP95 ‘mgrid’

0

500000000

1000000000

1500000000

2000000000

2500000000

3000000000

3500000000

4000000000

-1x

2^

65

-1x

2^

-45

-1x

2^

-15

5

-1x

2^

-26

5

-1x

2^

-37

5

-1x

2^

-48

5

-1x

2^

-59

5

-1x

2^

-70

5

-1x

2^

-81

5

-1x

2^

-92

5

1x

2^

-10

12

1x

2^

-90

2

1x

2^

-79

2

1x

2^

-68

2

1x

2^

-57

2

1x

2^

-46

2

1x

2^

-35

2

1x

2^

-24

2

1x

2^

-13

2

1x

2^

-22

Value Magnitude

Co

un

t

Operations producing zero

Two ranges: similar shapes


Range Close-upRange Close-up

0

500000000

1000000000

1500000000

2000000000

2500000000

3000000000

3500000000

4000000000

-1x2

^7

-1x2

^2

-1x2

^-3

-1x2

^-8

-1x2

^-13

-1x2

^-18

-1x2

^-23

-1x2

^-28

-1x2

^-33

-1x2

^-38

-1x2

^-43

-1x2

^-48

-1x2

^-53

-1x2

^-58

-1x2

^-63

-1x2

^-68

Value Magnitude

Co

un

t


Profiling Results – MMVBProfiling Results – MMVB

0%

20%

40%

60%

80%

100%

120%

-2 -0.5 -0.13 -0.03 -0.01 -0 -0 -0 -0 -0 -0 -0 -0 6E-08

2E-07

1E-06

4E-06

2E-05

6E-05

2E-04

1E-03

0.004 0.016 0.063 0.25 1

Value Magnitude

Co

un

t

9a 7 6D2 12_2 13_d3h

As with MORPHY, ranges similar between datasets

But we were using test datasets


Adaptive Floating PointAdaptive Floating Point

• The main focus of future work

• Dynamic modification of acceleration hardware– Exploit reconfigurability of FPGAs– Reconfigure device to meet application requirements

• Important considerations:– What happens to data already on the chip?

# 35


““Pipeline Prediction”Pipeline Prediction”

• Similar concept to branch prediction

• Build a selection of pipelines with different performance characteristics– Slow but generic version– Fast version with limited range, reduced operand

alignment– Compromise in between

• Predict which version is best to use (how?)

Full Microprocessor

F.P.Software Library

Floating Point Datapath

+

*

+


True Reconfiguration – Temporal ProfilingTrue Reconfiguration – Temporal Profiling

• Value ranges can vary– for different application phases

– within loops iterating to convergence

• Potential to reconfigure hardware as phases change

• Particularly apparent when iterating to convergence

• Simple example using Newton-Raphson method– Solve a cos(x) – bx3 + c = 0

– Choose an estimate for the solution, iterate to refine


Predictable Behaviour: Newton’s Method

Iterations when a=1, b=1, c=0

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9 10 11 12 13

Iteration

Val

ue

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Each line represents a different starting estimate – the actual value is around 0.86


Resolution Refinement

• Work by Dongarra et al uses this technique– Start with SSE/32-bit f.p.– Iterate to “ballpark” convergence– Switch to 64-bit for more precise result

• Alternative with reconfiguration– Multi-stepped refinement– Start at 8-bit, move to 24, 48, 64, 128 depending on

precision required


Problems with our approachProblems with our approach

• No guarantees that values do not occur outside identified ranges– We must have a backup plan if it goes wrong

• Not all applications will demonstrate behaviour similar to MORPHY– Value ranges could vary wildly with different

datasets

• Valgrind is slow

• Getting FPGAs to provide a speed-up can be difficult and painful


Future WorkFuture Work

• State-based profiling:– profile functions based on call-stack– allows context-dependent configurations

• Active simulation– Test new representations to check for errors

• Use results in practice– FPGA implementations for real applications– Adaptive representations

The Queen’s TowerThe Queen’s TowerImperial College LondonImperial College LondonSouth Kensington, SW7South Kensington, SW7


Any Questions?Any Questions?

JezebelJezebel1916 Dennis ‘N’ Type Fire Engine1916 Dennis ‘N’ Type Fire Engine

Royal College of Science Motor ClubRoyal College of Science Motor ClubImperial College Union, SW7Imperial College Union, SW7

Date post:	14-Dec-2015
Category:	Documents
Upload:	kate-lite
View:	213 times
Download:	0 times

The Queen’s Tower Imperial College London South Kensington, SW7 6th Jun 2007 | Ashley Brown...

Documents