Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | dwayne-cameron |
View: | 215 times |
Download: | 0 times |
Performance Analysis Performance Analysis OOf f Generics Generics
IIn Scientific Computingn Scientific Computing
Laurentiu Dragan Stephen M. WattLaurentiu Dragan Stephen M. Watt
Ontario Research Centre for Computer AlgebraOntario Research Centre for Computer Algebra
University of Western OntarioUniversity of Western Ontario
SYNASC 2005SYNASC 2005
OverviewOverview
MotivationMotivation
Parametric Polymorphism ImplementationParametric Polymorphism Implementation
Generalizing A Numeric BenchmarkGeneralizing A Numeric Benchmark
Language IssuesLanguage Issues
ResultsResults
Potential OptimizationsPotential Optimizations
ConclusionConclusion
MotivationMotivation
Increasing demand for generic codeIncreasing demand for generic code
Scientific code requires high-performance making Scientific code requires high-performance making optimizations very importantoptimizations very important
Generic code – not as fast as specialized codeGeneric code – not as fast as specialized code
No tools to measure performance of generic codeNo tools to measure performance of generic code
Benchmarks – tool to measure the performance Benchmarks – tool to measure the performance
SciGMark – benchmark for generic codeSciGMark – benchmark for generic code
Compilers – optimize the generic code – performance Compilers – optimize the generic code – performance close to hand specialized codeclose to hand specialized code
Parametric Polymophism Parametric Polymophism ImplementationImplementation
Some languages with support for GenericsSome languages with support for Generics– Aldor, C++Aldor, C++– Java, C#Java, C#
Some types can be given as parametersSome types can be given as parameters
ImplementationsImplementations– Homogeneous: Java, C#Homogeneous: Java, C#
Share the generic codeShare the generic code
Example: Example: Vector<Integer>Vector<Integer> → → Vector Vector with elements of type with elements of type ObjectObject
– Heterogeneous: C++, C#Heterogeneous: C++, C#Specialize the generic codeSpecialize the generic code
Example: Example: std::vector<int>std::vector<int> → new specialized class → new specialized class
Generalizing A Numeric Generalizing A Numeric BenchmarkBenchmark
SciMark 2SciMark 2
Polynomial MultiplicationPolynomial Multiplication
Implemented in Aldor, C++, C#, JavaImplemented in Aldor, C++, C#, Java
SciMark 2SciMark 2
Fast Fourier transform – 1024Fast Fourier transform – 1024– Complex arithmetic, shuffling, non-constant memory Complex arithmetic, shuffling, non-constant memory
reference, trigonometric functionsreference, trigonometric functions
Jacobi successive over-relaxation – 100x100Jacobi successive over-relaxation – 100x100– Typical access patterns in finite difference applicationsTypical access patterns in finite difference applications
Monte Carlo integrationMonte Carlo integration– Random number generator, function inliningRandom number generator, function inlining
Sparse matrix multiplication – 1000, 5000 non-zeroSparse matrix multiplication – 1000, 5000 non-zero– Indirection addressing, non-regular memory referencesIndirection addressing, non-regular memory references
Dense LU factorization – 100x100Dense LU factorization – 100x100– Dense matrix operationsDense matrix operations
From SciMark 2 to SciGMarkFrom SciMark 2 to SciGMark
SciMark – double hardcodedSciMark – double hardcoded– Arrays are of type doubleArrays are of type double– Any change – extensive modifications to the codeAny change – extensive modifications to the code
SciGMark – classes are parametricSciGMark – classes are parametric– Change representation – minimal code changesChange representation – minimal code changes– Double becomes parameter RDouble becomes parameter R
+
R a(R o)
void ae(R o)
doubleR
DoubleRing
Class SOR {Class SOR { double[] array;double[] array;}}
Class SOR < R extends IRing<R> > {Class SOR < R extends IRing<R> > { R [ ] array;R [ ] array;}}
Basic Generic TypesBasic Generic Types
IRing IRing – Provides operations for addition, subtraction, multiplication, Provides operations for addition, subtraction, multiplication,
division – mutable, non-mutabledivision – mutable, non-mutable– Conversions to and from Conversions to and from intint and and doubledouble– Factories to produce new elements of these typeFactories to produce new elements of these type
DoubleRing – wrapper for doubleDoubleRing – wrapper for double– Implements IRingImplements IRing
ComplexComplex– Implements IComplex (simple extension to IRing)Implements IComplex (simple extension to IRing)– Complex<R extends IRing<R>> Complex<R extends IRing<R>>
implements IComplex<Complex<R>,R> implements IComplex<Complex<R>,R>
Generic Tests Generic Tests
GenFFTGenFFT– Uses R: Complex<DoubleRing>Uses R: Complex<DoubleRing>– Complex numbers – two consecutive entries in the arrayComplex numbers – two consecutive entries in the array
Depending on the application – different representation (e.g. Depending on the application – different representation (e.g. Hermitian matrix)Hermitian matrix)
GenMat, GenLUGenMat, GenLU– Use R: DoubleRingUse R: DoubleRing– The classes contain more methods – the whole class The classes contain more methods – the whole class
contains a type parametercontains a type parameter
GenSOR, GenMonteCarloGenSOR, GenMonteCarlo– Use R: DoubleRingUse R: DoubleRing– Have single static method with a type parameterHave single static method with a type parameter
Polynomial MultiplicationPolynomial Multiplication
40 coefficients40 coefficients
Dense representation unidimensional arrayDense representation unidimensional array
Regular memory access, temporary objects creation Regular memory access, temporary objects creation (memory allocation)(memory allocation)
ImplementationImplementation– DensePolynomialDensePolynomial
DensePolynomialG <E extends IRing<E> > implements DensePolynomialG <E extends IRing<E> > implements IRing<DensePolynomialG<E> >IRing<DensePolynomialG<E> >
– SmallPrimeFieldSmallPrimeFieldRepresented by an intRepresented by an int
SmallPrimeFieldG implements IRing<SmallPrimeFieldG>SmallPrimeFieldG implements IRing<SmallPrimeFieldG>
Specializing Polynomial Specializing Polynomial MultiplicationMultiplication
The code was initially implemented using genericsThe code was initially implemented using generics
Inlined all the calls to Inlined all the calls to SmallPrimeFieldSmallPrimeField
Replaced all the instances of Replaced all the instances of SmallPrimeFieldSmallPrimeField with with intint
Essentially the inverse of the operation performed to Essentially the inverse of the operation performed to “generalize” the SciMark“generalize” the SciMark
No changes to the algorithm – all changes could be No changes to the algorithm – all changes could be performed automaticallyperformed automatically
Language IssuesLanguage Issues
JavaJava– No operator overloadingNo operator overloading– Homogeneous – erasure technique – subclassingHomogeneous – erasure technique – subclassing– Implemented at language level – no virtual machine support Implemented at language level – no virtual machine support
– limitations – require object factory– limitations – require object factory– Type inference for generics is invariant – Pass the type as Type inference for generics is invariant – Pass the type as
argument argument Complex <R extends IRing<R>> implements IComplex<Complex<R>,R>Complex <R extends IRing<R>> implements IComplex<Complex<R>,R>
C#C#– Reference types (homogeneous) – Java; primitive types Reference types (homogeneous) – Java; primitive types
(heterogeneous) – C++(heterogeneous) – C++– Structures instead of classes – structures in collections are Structures instead of classes – structures in collections are
boxedboxed
Language IssuesLanguage Issues
C++C++– HeterogeneousHeterogeneous– Parametric polymorphism (templates) macro processorParametric polymorphism (templates) macro processor– No bounded polymorphismNo bounded polymorphism– No way to test the generic class until is instantiateNo way to test the generic class until is instantiate
AldorAldor– HomogeneousHomogeneous– Supports dependent typesSupports dependent types– Polymorphic types constructed using domain constructing Polymorphic types constructed using domain constructing
functionsfunctions
SciGMark ResultsSciGMark Results
Results in MFlops Results in MFlops
Testing environment:Testing environment:– Pentium IV – 3.2GHz (1MB cache), 2 GB RAMPentium IV – 3.2GHz (1MB cache), 2 GB RAM– Windows XP SP2Windows XP SP2– Cygwin/GCC 3.4.4Cygwin/GCC 3.4.4– Sun JDK 1.5.0_04 Sun JDK 1.5.0_04 – Microsoft .NET v2.0.50215Microsoft .NET v2.0.50215– Aldor 1.0.2Aldor 1.0.2
SciGMark ResultsSciGMark Results
N/A35920320244415743471Comp.
401566321282274836562PM
100x10055354031898274780103LU
1000, 500048544773941011173987MM
N/A20390622826226546MC
100x10041715417226816641971SOR
1024340124273212336559FFT
SpeGenSpeGenSpeGenSpeGenSize
AldorC#JavaC++Test
SciGMark ResultsSciGMark Results
N/A35920320244415743471Comp.
401566321282274836562PM
100x10055354031898274780103LU
1000, 500048544773941011173987MM
N/A20390622826226546MC
100x10041715417226816641971SOR
1024340124273212336559FFT
SpeGenSpeGenSpeGenSpeGenSize
AldorC#JavaC++Test
SciGMark ResultsSciGMark Results
N/A35920320244415743471Comp.
401566321282274836562PM
100x10055354031898274780103LU
1000, 500048544773941011173987MM
N/A20390622826226546MC
100x10041715417226816641971SOR
1024340124273212336559FFT
SpeGenSpeGenSpeGenSpeGenSize
AldorC#JavaC++Test
Aldor ResultsAldor Results
Testing environment:Testing environment:– Pentium IV – 3.2GHz (1MB cache), 2 GB RAMPentium IV – 3.2GHz (1MB cache), 2 GB RAM– Linux Fedora Core 3Linux Fedora Core 3– Aldor 1.0.2Aldor 1.0.2
Stanford benchmark Stanford benchmark – Aldor’s performance can be almost as good as C++Aldor’s performance can be almost as good as C++
Aldor ResultsAldor Results
1.291.43Comp int
1.051.07Comp FP
407190.24268380.38Oscar FFT
203550.49143420.69FP Mat Mult
102.00101.00Tree Sort
190890.53135260.74Bubble Sort
152140.66125380.79Quick Sort
46262.1634842.89Puzzle
491550.20153860.65Mat Mult
219870.45197000.528-Queen
469240.21172970.58Towers
269010.37234000.43Permutations
IterationsTimeIterationsTime
C++AldorTest
Potential OptimizationsPotential Optimizations
6-18 times performance improvement6-18 times performance improvement
Specialized codeSpecialized code– Same algorithmSame algorithm– Generic types replaced by specialized typesGeneric types replaced by specialized types– Eliminate generic wrapper objects – primitive typesEliminate generic wrapper objects – primitive types
Test Case AldorTest Case Aldor
Domain producing function:Domain producing function:
PolynomialVect(C: Ring) == add {PolynomialVect(C: Ring) == add { Rep == Vector Polynomial C; Rep == Vector Polynomial C; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i + j; res(k) := i + j; per res per res } }}}PC == PolynomialVect(Complex DoubleFloat);PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);PQ == PolynomialVect(Rational);
Test Case AldorTest Case Aldor
PolynomialVect(PolynomialVect(CC: Ring) == add {: Ring) == add { Rep == Vector Polynomial Rep == Vector Polynomial CC;; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i res(k) := i ++ j; j; per res per res } }}}PC == PolynomialVect(Complex DoubleFloat);PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);PQ == PolynomialVect(Rational);
Domain producing function:Domain producing function:
Test Case AldorTest Case Aldor
Specialize the domain producing functionSpecialize the domain producing function
PC == add {PC == add { Rep == Vector Polynomial Rep == Vector Polynomial Complex DoubleFloatComplex DoubleFloat;; (f: %) + (g: %): % == { (f: %) + (g: %): % == { res := new(#f); res := new(#f); rf := rep f; rg := rep g; rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat for k in 1..#f for i in rf for j in rg repeat res(k) := i res(k) := i ++ j; j; -- ‘+’ from Complex-- ‘+’ from Complex per res per res } }}}
Optimize Data RepresentationOptimize Data RepresentationScalar product of vector of complex numbersScalar product of vector of complex numbers
dot(u: Vector Complex R, v: Vector Complex R): Complex R == {dot(u: Vector Complex R, v: Vector Complex R): Complex R == { ss: Complex R := 0;: Complex R := 0; for i in 1..n repeat for i in 1..n repeat ss := := ss + u.i*v.i; + u.i*v.i; return s; return s;}}
dot(u: Vector Complex R, v: Vector Complex R): Complex R == {dot(u: Vector Complex R, v: Vector Complex R): Complex R == { xx: R := 0; : R := 0; yy: R := 0;: R := 0; for i in 1..n repeat { for i in 1..n repeat { xx := := xx + real(u.i)*real(v.i) - imag(u.i)*imag(v.i); + real(u.i)*real(v.i) - imag(u.i)*imag(v.i); yy := := yy + real(u.i)*imag(v.i) + imag(u.i)*real(v.i); + real(u.i)*imag(v.i) + imag(u.i)*real(v.i); } } return complex(x,y); return complex(x,y);}}
ConclusionConclusion
Generics important for scientific computing – rich Generics important for scientific computing – rich mathematical models – easy to implement with generic mathematical models – easy to implement with generic codecode
Need a tool to measure the compiler ability to produce Need a tool to measure the compiler ability to produce efficient codeefficient code
We have seen difference of 6-18 times between We have seen difference of 6-18 times between generic and specialized code – room for improvement generic and specialized code – room for improvement in compilers capabilitiesin compilers capabilities
Presented some optimizations ideasPresented some optimizations ideas
http://www.orrca.on.ca/benchmarks/scigmark/1.0/http://www.orrca.on.ca/benchmarks/scigmark/1.0/