Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick,...

Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley

Titanium

Titanium: A High Performance Language Based on Java

Kathy Yelickhttp://titanium.cs.berkeley.edu/

U.C. Berkeley

Also the UPC project at LBNLhttp://upc.nersc.gov

Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium

Titanium Group (Past and Present)• Susan Graham• Katherine Yelick• Paul Hilfinger• Phillip Colella (LBNL)• Alex Aiken

• Greg Balls• Andrew Begel• Dan Bonachea• Kaushik Datta• David Gay• Ed Givelberg• Arvind Krishnamurthy

• Ben Liblit• Peter McQuorquodale (LBNL)• Sabrina Merchant• Carleton Miyamoto• Chang Sun Lin• Geoff Pike• Luigi Semenzato (LBNL)• Jimmy Su• Tong Wen (LBNL)• Siu Man Yau

(and many undergrad researchers)


Context• Most parallel programs are written using explicit

parallelism, either:• Message passing with a SPMD model

• Usually for scientific applications with C++/Fortran• Scales easily

• Shared memory with threads in C or Java • Usually for non-scientific applications• Easier to program, but usually provide less scalable performance

• Global Address Space Languages take the best of both• global address space like threads (programmability)• SPMD parallelism like MPI (performance)• local/global distinction, i.e., layout matters (performance)


Titanium• Based on Java, a cleaner C++

• classes, automatic memory management, etc.• compiled to C and then native binary (no JVM)

• Scalable parallelism model• SPMD with a global address space

• Optimizing compiler• static (compile-time) optimizer, not a JIT• communication and memory optimizations• synchronization analysis (e.g. static barrier analysis)• cache and other uniprocessor optimizations


Summary of Features Added to Java

1. Scalable parallelism (Java threads replaced)2. Multidimensional arrays with iterators 3. Checked Synchronization 4. Immutable (“value”) classes5. Operator overloading6. Templates7. Zone-based memory management (regions)8. Libraries for collective communication,

distributed arrays, bulk I/O


Immutable Classes in Titanium• For small objects, would sometimes prefer

• to avoid level of indirection • pass by value (copying of entire object)• especially when immutable -- fields never modified

• Examples:• complex type• multiple fields (pressure, velocity, force) in a grid

• Titanium introduces immutable classes• all fields are final (constant) plus • compiler implements as above

• Note: considering extension to allow mutation


Example of Immutable Classes

• An immutable class has few additionsimmutable class Complex {

Complex () {real=0; imag=0; }...

}

• Use of immutable complex valuesComplex c1 = new Complex(7.1, 4.3);c1 = c1.add(c1);

• Addresses performance and programmability• Similar to structs in C in terms of performance• Adds support for complex types

Zero-argument constructor required

new keyword

Rest unchanged. No assignment to fields outside of constructors.


Operator Overloading• Titanium adds operator overloading, important

for readability in scientific code• Very similar to operator overloading in C++

public Complex operator+(Complex c) { return new Complex(c.real + real, c.imag + imag);

}Complex c1 = new Complex(7.1, 4.3);c1 = c1 + c1;

• Adds to programmability, not performance• Must be used judiciously


Templates

• Many applications use containers:• E.g., arrays parameterized by dimensions, element types• Java supports this kind of parameterization through

inheritance; Java templates based on this as well• May only put Object types into containers• Inefficient when used extensively

• Titanium provides a template mechanism closer to that of C++• E.g., can instantiate with “double” or “immutable Complex”


Example of Templatestemplate <class Element> class Stack {

. . .public Element pop() {...}public void push( Element arrival ) {...}

}

template Stack<int> list = new template Stack<int>();list.push( 1 );int x = list.pop();

• Addresses programmability and performance

Not an object

Strongly typed, No dynamic cast


Multidimensional Arrays• Arrays in Java are objects• Only 1D arrays directly supported• Array bounds are checked

• Safe but potentially slow

• Multidimensional arrays as arrays-of-arrays• General, but may be slow due to memory layout and

difficulty of compiler analysis• Hand-coding (array libraries) can confuse optimizer


Multidimensional Arrays in Titanium

• New kind of multidimensional array added• Sub-arrays are supported • Indexed by Points (tuple of ints)

• Very expressive sub-array support, e.g., • Can refer to a row or column as a sub-array• refer to the boundary region of an array

• Optimized by the compiler for caches

• Addresses programmability and performance


Unordered iteration

• With arrays, Titanium adds unordered iteration• Helps compiler with loop analysis• Also avoids some indexing details

foreach (p within A.domain()) { A[p]... }

• p is a Point (tuple of ints) that can be used to index arrays • Works for any dimension array

• Provides programmability and performance


Parallelism Model• Titanium starts a copy of “main” on each

processor (SPMD parallelism)• Only major restriction to Java semantics• Replaced Java’s thread model • Many programs written with more general threads do:

for i = 1 to p fork • Handling dynamic thread creation on 1000s of processors

is difficult

• Design is purely a performance consideration, dynamic threads are a future direction


Global Address Space

• Shared address space is partitioned • References (pointers) are either local or global

(meaning possibly remote)

Object heapsare shared

Glo

bal

ad

dre

ss s

pac

e

x: 1y: 2

Program stacks are private

l: l: l:

g: g: g:

x: 5y: 6

x: 7y: 8

p0 p1 pn


Communication• Titanium has explicit global communication:

• Broadcast, reduction, etc.• Primarily used to set up distributed data structures

• Most communication is implicit through the shared address space• Dereferencing a global reference, g.x, can generate

communication• Arrays have copy operations, which generate bulk

communication: A1.copy(A2)


Distributed Data Structures• Building distributed arrays:

Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d];

Particle [1d] myParticle = new Particle [0:myParticleCount-1];

allParticle.exchange(myParticle);

• Now each processor has array of pointers, one to each processor’s chunk of particles

P0 P1 P2

All to all broadcast


Global Address Space

• Communication through global address space designed for• Productivity: explicit representation of distributed data

structures• Performance: exploits efficient one-sided communication

(remote put/get) when it exists• Tunability: shared memory style uses more global

dereferences; distributed style uses more array copies


Region-Based Memory Management• Extension of Java’s implicit memory management• Regions are still “safe”, but can avoid or reduce

need for distributed garbage collectionPrivateRegion r = new PrivateRegion();for (int j = 0; j < 10; j++) {int[] x = new ( r ) int[j + 1];work(j, x);

}try { r.delete(); }catch (RegionInUse oops) {

System.out.println(“failed to delete”);}

}• Designed for performance


Applications in Titanium• Several applications and benchmarks

• Heart simulation• Fluid solvers with Adaptive Mesh Refinement (AMR)• Dense linear algebra: LU, MatMul• Unstructured mesh kernel: EM3D• Finite element benchmark• Genetics: micro-array selection• Tree-structure n-body code


3D AMR Gas Dynamics• Hyperbolic Solver [McCorquodale and Colella]

• Implementation of Berger-Colella algorithm• Mesh generation algorithm included

• 2D Example (3D supported) • Mach-10 shock on solid surface

at oblique angle

• Future: Self-gravitating gas dynamics package


• Immersed Boundary Method [Peskin/MacQueen, Yau]• Fibers (e.g., heart muscles) modeled by

list of fiber points• Fluid space modeled by a regular lattice

• Irregular fiber lists need to interact with regular fluid lattice• Trade-off between load balancing of

fibers and minimizing communication• memory and communication intensive

• Uses several parallel numerical kernels• Navier-Stokes solver• 3-D FFT solver• Soon to be enhanced using an adaptive

multigrid solver (possibly written in KeLP)

Heart Simulation


Productivity Measures

• Performance • Programmability• Robustness• Portability


Serial Performance (Pure Java)Performance on a Pentium IV (1.5GHz)

050

100150200250300350400450

Overall FFT SOR MC Sparse LU

MF

lop

s

java C (gcc -O6) Ti Ti -nobc

Note the Ti/Java numbers use Java arrays, not Titanium arrays

Ti -nobc is with bounds-checking disabled


Parallel Performance and Scalability• Poisson solver using “Method of Local Corrections”• Communication < 5%; Scaled speedup nearly ideal (flat)

IBM SP at SDSC Cray T3E at NERSC


Performance Tuning by Compiler

Scale performance

0

2

4

6

8

10

12

5 10 30 50 80 90 95

percentage of remote accesses

time

(sec

onds

)

1 thread

2 thread

4 thread

1 thread, writepipeline



1 thread, readpipeline



Advantage of compiled languages (Berkeley UPC compiler)

Scaled version of

GUPS


Programmability

• Heart simulation developed in ~1 year• Extended to support 2D structures for Cochlea model in

~1 month• Preliminary code length measures

• Simple torus model• Serial torus code is 17045 lines long (2/3 comments)• Parallel Titanium torus version is 3057 lines long.

• Full heart model• Shared memory Fortran heart code is 8187 lines long• Parallel Titanium version is 4249 lines long.

• Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism


Robustness

• Robustness is the primary motivation for language “safety” in Java• Type-safe, array bounds checked, auto memory management• Study on C++ vs. Java from Phipps at Spirus:

• C++ has 2-3x more bugs per line than Java• Java had 30-200% more lines of code per minute

• Extended in Titanium• Checked synchronization avoids barrier deadlocks• More abstract array indexing, retains bounds checking

• No attempt at quantify for Titanium yet• Would like to measure speed of error detection (compile time, runtime

exceptions, etc.)


Portability

• Heart code and other applications run anywhere Titanium runs• Runs on serial or shared memory machines with native C

compiler• Including my laptop!• Very important for programmer productivity

• For distributed memory, requires communication layer• Alpha/Quadrics, IBM SP, Cray T3E, PC/Myrinet, anything with MPI• Global Address Space Networking layer (GASNet)

– With C compiler, get Titanium and LBNL/UPC compilers

• FFTW used in heart code: strategy for performance and portability


Performance and Portability Approach

• Use machines, not humans for architecture-specific tuning• Code generation + search-based selection

• Can adapt to cache size, # registers, network buffering

• Used in • Signal processing: FFTW, SPIRAL, UHFFT• Dense linear algebra: Atlas, PHiPAC• Sparse linear algebra: Sparsity• Rectangular grid-based computations: Titanium compiler• Global communication: Atlas-derivative


Summary

• Titanium designed for performance and programmability• Some compromises (regions, local/global refs)

• Retains robustness (safety) of Java• Also a big help in learning, avoiding certain kinds of bugs

• Tunability and performance transparency• Aggressive automatic optimizations can make this worse

• Advertising (all open source):• Titanium compiler: http://titanium.cs.berkeley.edu• Berkeley UPC compiler: http://upc.nersc.gov• Automatic tuning: http://www.cs.berkeley.edu/~richie/bebop

Date post:	25-Mar-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Titanium: A High Performance Language Based on Javayelick/talks/titanium/hpcs-sp03.pdfKathy Yelick,...

Documents