Microarchitectural Characterization
of Production JVMs and Java Workload
work in progress
Jungwoo Ha (UT Austin)Magnus Gustafsson (Uppsala Univ.)
Stephen M. Blackburn (Australian Nat’l Univ.)
Kathryn S. McKinley (UT Austin)
2/22/08 2
Challenges of JVM Performance Analysis
Controlling nondeterminism Just-In-Time Compilation driven by nondeterministic
sampling Garbage Collectors Other Helper Threads
Production JVMs are not created equal Thread model (kernel, user threads) Type of helper threads
Need a solid measurement methodology! Isolate each JVM part
2/22/08 3
Forest and Trees
What performance metrics explain performance differences and bottlenecks? Cache miss? L1 or L2? TLB miss? # of instructions?
Inspecting one or two metrics is not always enough
Performance counters give us only small number of counters at a time Multiple invocation for the measurement inevitable
2/22/08 4
Case Study: jython
Application performance (Cycles)
2/22/08 5
Case Study: jython
L1 Instruction cache miss/cyc
2/22/08 6
Case Study: jython
L1 Data cache miss/cyc
2/22/08 7
Case Study: jython
Total Instruction executed (retired)
2/22/08 8
Case Study: jython
L2 Data cache miss/cycle
2/22/08 9
Project Status
Established methodology to characterize application code performance Large number of metrics (40+) measured from
hardware performance counters apples to apple comparison of JVMs using
standard interface (JVMTI, JNI)
Simulator data for detail analysis Limit studies
What if L1 cache had no misses?
More performance metrics e.g. uop mix
2/22/08 10
Performance Counter Methodology
Warmup JVM
Stop JIT
Full Heap GC
Measured Run
change metric
Invoke JVMy times
1st – xth iteration
(x+1)th iteration
(x+2)th – (x+2+(n/p)k)thiteration
Collecting n metric x warmup iterations (x = 10) p performance counters (can measure at most p metrics per iter.) n/p iterations needed for measurement k redundant measurement for statistical validation (k = 1)
Need to hold workload constant for multiple measurements
2/22/08 11
Performance Counter Methodology
Stop-the-world Garbage Collector No concurrent marking
One perfctr instance per pthread JVM internal threads are different pthreads from the
application
JVMTI Callbacks Thread start - start counter Thread finish - stop counter GC start - pause counter, only for userlevel thread GC stop - resume counter, only for userlevel thread
2/22/08 12
Methodology Limitations
Cannot factor out memory barrier overhead Use garbage collector with the least application
overhead
If a helper thread runs in the same pthread with the application (user-level thread), it will cause perturbation No evidence in J9, HotSpot, JRockit
Instrumented code overhead Must be included in the measurement
2/22/08 13
Performance Counter Experiment Pentium-M uni-processor
32KB 8-way L1 cache (data & instruction) 2MB 4-way L2 cache 2 hardware counter (18 if multiplexed)
1GB Memory 32bit Linux 2.6.20 with perfctr patch PAPI 3.5.0 Library
Simulator Experiment PTLsim (http://www.ptlsim.org) x86 simulator 64bit AMD Athlon
Experiment
2/22/08 14
Experiment
3 Production JVMs * 2 versions IBM J9, Sun HotSpot JVM, JRockit (perfctr only) 1.5 and 1.6 Heap Size = max (16MB, 4*minimum heap size)
18 Benchmarks 9 DaCapo benchmarks 8 SPEC JVM 98 1 PseudoJBB
2/22/08 15
Experiment
40+ Metrics 40 distinct metrics from performance counter
L1 or L2 Cache misses (Instruction, Data, Read, Write) TLB-I miss Branch predictions Resource Stalls
More rich metrics from the simulator Micro operation mix Load to store
2/22/08 16
Performance Counter Results (Cycle Counts)
PseudoJBB pmd
jython jess
2/22/08 17
Performance Counter Results (Cycle Counts)
jack hsqldb
compress db
2/22/08 18
Performance Counter Results
IBM J9 1.6 performed better than Sun HotSpot 1.6 in the average
JRockit has the most variation in performance
Full results ~800 graphs Full jython results in the paper http://z.cs.utexas.edu/users/habals/jvmcmp or Google my name (Jungwoo Ha)
2/22/08 19
Future Work
JVM activity characterization Garbage collector JIT
Statistical analysis of performance metrics metrics correlation Methodology to identify performance bottleneck
Multicore performance analysis
2/22/08 20
Conclusions
Methodology for production JVM comparison
Performance evaluation data
Simulator results for deeper analysis
Thanks you!
2/22/08 22
2/22/08 23
Simulation Result
2/22/08 24
Perfect Cache - compress
2/22/08 25
Perfect Cache - db