Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley
Titanium
Titanium: A High Performance Language Based on Java
Kathy Yelickhttp://titanium.cs.berkeley.edu/
U.C. Berkeley
Also the UPC project at LBNLhttp://upc.nersc.gov
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Titanium Group (Past and Present)• Susan Graham• Katherine Yelick• Paul Hilfinger• Phillip Colella (LBNL)• Alex Aiken
• Greg Balls• Andrew Begel• Dan Bonachea• Kaushik Datta• David Gay• Ed Givelberg• Arvind Krishnamurthy
• Ben Liblit• Peter McQuorquodale (LBNL)• Sabrina Merchant• Carleton Miyamoto• Chang Sun Lin• Geoff Pike• Luigi Semenzato (LBNL)• Jimmy Su• Tong Wen (LBNL)• Siu Man Yau
(and many undergrad researchers)
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Context• Most parallel programs are written using explicit
parallelism, either:• Message passing with a SPMD model
• Usually for scientific applications with C++/Fortran• Scales easily
• Shared memory with threads in C or Java • Usually for non-scientific applications• Easier to program, but usually provide less scalable performance
• Global Address Space Languages take the best of both• global address space like threads (programmability)• SPMD parallelism like MPI (performance)• local/global distinction, i.e., layout matters (performance)
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Titanium• Based on Java, a cleaner C++
• classes, automatic memory management, etc.• compiled to C and then native binary (no JVM)
• Scalable parallelism model• SPMD with a global address space
• Optimizing compiler• static (compile-time) optimizer, not a JIT• communication and memory optimizations• synchronization analysis (e.g. static barrier analysis)• cache and other uniprocessor optimizations
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Summary of Features Added to Java
1. Scalable parallelism (Java threads replaced)2. Multidimensional arrays with iterators 3. Checked Synchronization 4. Immutable (“value”) classes5. Operator overloading6. Templates7. Zone-based memory management (regions)8. Libraries for collective communication,
distributed arrays, bulk I/O
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Immutable Classes in Titanium• For small objects, would sometimes prefer
• to avoid level of indirection • pass by value (copying of entire object)• especially when immutable -- fields never modified
• Examples:• complex type• multiple fields (pressure, velocity, force) in a grid
• Titanium introduces immutable classes• all fields are final (constant) plus • compiler implements as above
• Note: considering extension to allow mutation
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Example of Immutable Classes
• An immutable class has few additionsimmutable class Complex {
Complex () {real=0; imag=0; }...
}
• Use of immutable complex valuesComplex c1 = new Complex(7.1, 4.3);c1 = c1.add(c1);
• Addresses performance and programmability• Similar to structs in C in terms of performance• Adds support for complex types
Zero-argument constructor required
new keyword
Rest unchanged. No assignment to fields outside of constructors.
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Operator Overloading• Titanium adds operator overloading, important
for readability in scientific code• Very similar to operator overloading in C++
public Complex operator+(Complex c) { return new Complex(c.real + real, c.imag + imag);
}Complex c1 = new Complex(7.1, 4.3);c1 = c1 + c1;
• Adds to programmability, not performance• Must be used judiciously
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Templates
• Many applications use containers:• E.g., arrays parameterized by dimensions, element types• Java supports this kind of parameterization through
inheritance; Java templates based on this as well• May only put Object types into containers• Inefficient when used extensively
• Titanium provides a template mechanism closer to that of C++• E.g., can instantiate with “double” or “immutable Complex”
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Example of Templatestemplate <class Element> class Stack {
. . .public Element pop() {...}public void push( Element arrival ) {...}
}
template Stack<int> list = new template Stack<int>();list.push( 1 );int x = list.pop();
• Addresses programmability and performance
Not an object
Strongly typed, No dynamic cast
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Multidimensional Arrays• Arrays in Java are objects• Only 1D arrays directly supported• Array bounds are checked
• Safe but potentially slow
• Multidimensional arrays as arrays-of-arrays• General, but may be slow due to memory layout and
difficulty of compiler analysis• Hand-coding (array libraries) can confuse optimizer
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Multidimensional Arrays in Titanium
• New kind of multidimensional array added• Sub-arrays are supported • Indexed by Points (tuple of ints)
• Very expressive sub-array support, e.g., • Can refer to a row or column as a sub-array• refer to the boundary region of an array
• Optimized by the compiler for caches
• Addresses programmability and performance
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Unordered iteration
• With arrays, Titanium adds unordered iteration• Helps compiler with loop analysis• Also avoids some indexing details
foreach (p within A.domain()) { A[p]... }
• p is a Point (tuple of ints) that can be used to index arrays • Works for any dimension array
• Provides programmability and performance
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Parallelism Model• Titanium starts a copy of “main” on each
processor (SPMD parallelism)• Only major restriction to Java semantics• Replaced Java’s thread model • Many programs written with more general threads do:
for i = 1 to p fork • Handling dynamic thread creation on 1000s of processors
is difficult
• Design is purely a performance consideration, dynamic threads are a future direction
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Global Address Space
• Shared address space is partitioned • References (pointers) are either local or global
(meaning possibly remote)
Object heapsare shared
Glo
bal
ad
dre
ss s
pac
e
x: 1y: 2
Program stacks are private
l: l: l:
g: g: g:
x: 5y: 6
x: 7y: 8
p0 p1 pn
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Communication• Titanium has explicit global communication:
• Broadcast, reduction, etc.• Primarily used to set up distributed data structures
• Most communication is implicit through the shared address space• Dereferencing a global reference, g.x, can generate
communication• Arrays have copy operations, which generate bulk
communication: A1.copy(A2)
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Distributed Data Structures• Building distributed arrays:
Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d];
Particle [1d] myParticle = new Particle [0:myParticleCount-1];
allParticle.exchange(myParticle);
• Now each processor has array of pointers, one to each processor’s chunk of particles
P0 P1 P2
All to all broadcast
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Global Address Space
• Communication through global address space designed for• Productivity: explicit representation of distributed data
structures• Performance: exploits efficient one-sided communication
(remote put/get) when it exists• Tunability: shared memory style uses more global
dereferences; distributed style uses more array copies
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Region-Based Memory Management• Extension of Java’s implicit memory management• Regions are still “safe”, but can avoid or reduce
need for distributed garbage collectionPrivateRegion r = new PrivateRegion();for (int j = 0; j < 10; j++) {int[] x = new ( r ) int[j + 1];work(j, x);
}try { r.delete(); }catch (RegionInUse oops) {
System.out.println(“failed to delete”);}
}• Designed for performance
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Applications in Titanium• Several applications and benchmarks
• Heart simulation• Fluid solvers with Adaptive Mesh Refinement (AMR)• Dense linear algebra: LU, MatMul• Unstructured mesh kernel: EM3D• Finite element benchmark• Genetics: micro-array selection• Tree-structure n-body code
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
3D AMR Gas Dynamics• Hyperbolic Solver [McCorquodale and Colella]
• Implementation of Berger-Colella algorithm• Mesh generation algorithm included
• 2D Example (3D supported) • Mach-10 shock on solid surface
at oblique angle
• Future: Self-gravitating gas dynamics package
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
• Immersed Boundary Method [Peskin/MacQueen, Yau]• Fibers (e.g., heart muscles) modeled by
list of fiber points• Fluid space modeled by a regular lattice
• Irregular fiber lists need to interact with regular fluid lattice• Trade-off between load balancing of
fibers and minimizing communication• memory and communication intensive
• Uses several parallel numerical kernels• Navier-Stokes solver• 3-D FFT solver• Soon to be enhanced using an adaptive
multigrid solver (possibly written in KeLP)
Heart Simulation
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Productivity Measures
• Performance • Programmability• Robustness• Portability
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Serial Performance (Pure Java)Performance on a Pentium IV (1.5GHz)
050
100150200250300350400450
Overall FFT SOR MC Sparse LU
MF
lop
s
java C (gcc -O6) Ti Ti -nobc
Note the Ti/Java numbers use Java arrays, not Titanium arrays
Ti -nobc is with bounds-checking disabled
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Parallel Performance and Scalability• Poisson solver using “Method of Local Corrections”• Communication < 5%; Scaled speedup nearly ideal (flat)
IBM SP at SDSC Cray T3E at NERSC
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Performance Tuning by Compiler
Scale performance
0
2
4
6
8
10
12
5 10 30 50 80 90 95
percentage of remote accesses
time
(sec
onds
)
1 thread
2 thread
4 thread
1 thread, writepipeline
2 thread, writepipeline
4 thread, writepipeline
1 thread, readpipeline
2 thread, readpipeline
4 thread, readpipeline
Advantage of compiled languages (Berkeley UPC compiler)
Scaled version of
GUPS
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Programmability
• Heart simulation developed in ~1 year• Extended to support 2D structures for Cochlea model in
~1 month• Preliminary code length measures
• Simple torus model• Serial torus code is 17045 lines long (2/3 comments)• Parallel Titanium torus version is 3057 lines long.
• Full heart model• Shared memory Fortran heart code is 8187 lines long• Parallel Titanium version is 4249 lines long.
• Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Robustness
• Robustness is the primary motivation for language “safety” in Java• Type-safe, array bounds checked, auto memory management• Study on C++ vs. Java from Phipps at Spirus:
• C++ has 2-3x more bugs per line than Java• Java had 30-200% more lines of code per minute
• Extended in Titanium• Checked synchronization avoids barrier deadlocks• More abstract array indexing, retains bounds checking
• No attempt at quantify for Titanium yet• Would like to measure speed of error detection (compile time, runtime
exceptions, etc.)
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Portability
• Heart code and other applications run anywhere Titanium runs• Runs on serial or shared memory machines with native C
compiler• Including my laptop!• Very important for programmer productivity
• For distributed memory, requires communication layer• Alpha/Quadrics, IBM SP, Cray T3E, PC/Myrinet, anything with MPI• Global Address Space Networking layer (GASNet)
– With C compiler, get Titanium and LBNL/UPC compilers
• FFTW used in heart code: strategy for performance and portability
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Performance and Portability Approach
• Use machines, not humans for architecture-specific tuning• Code generation + search-based selection
• Can adapt to cache size, # registers, network buffering
• Used in • Signal processing: FFTW, SPIRAL, UHFFT• Dense linear algebra: Atlas, PHiPAC• Sparse linear algebra: Sparsity• Rectangular grid-based computations: Titanium compiler• Global communication: Atlas-derivative
Kathy Yelick, Computer Science Division, EECS, University of California, BerkeleyTitanium
Summary
• Titanium designed for performance and programmability• Some compromises (regions, local/global refs)
• Retains robustness (safety) of Java• Also a big help in learning, avoiding certain kinds of bugs
• Tunability and performance transparency• Aggressive automatic optimizations can make this worse
• Advertising (all open source):• Titanium compiler: http://titanium.cs.berkeley.edu• Berkeley UPC compiler: http://upc.nersc.gov• Automatic tuning: http://www.cs.berkeley.edu/~richie/bebop