Con5388 maier

transcript

Java Application Design Practices to Avoid When Dealing with Sub-100 ms SLAs

Daryl Maier (IBM Canada Lab), Anil Kumar (Intel Corporation)

1st October, 2012

Important Disclaimers

§ THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY.

§ WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS-IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESSED OR IMPLIED.

§ ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED IN A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED ON HARDWARE, SOFTWARE, OR INFRASTRUCTURE DIFFERENCES.

§ ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A GUIDE.

§ IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM, WITHOUT NOTICE.

§ IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION.

§ NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE EFFECT OF: – CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS

Introduction to the speakers

Daryl Maier

– 12 years experience developing and deploying Java SDKs at IBM Canada Lab

– Recent work focus:• X86 Java just-in-time compiler development and performance• Java benchmarking

– Contact: maier@ca.ibm.com

Anil Kumar

– 10 years experience in server Java performance ensuring best customer experience on all Intel Architecture based platforms

– Contact: anil.kumar@intel.com

The contents of this presentation were jointly produced with

Credits

Elena Sayapina. Java Performance / Intel

Intel and IBM collaborate to ensure the best user experience across all Intel Architecture based platforms.

What this talk is about…

§ Learn what contributes to higher transactional response times within a Java application

§ How to measure response time

§ Java application design practices that lead to lower response times

§ How to tune the environment in which your application runs for better response time

§ How to determine if you can achieve an even better response time

§ Lots of practical examples

Service Level Agreements

§ SLA == Service Level Agreement– A commitment to provide a service that meets a prescribed level of performance– Can be informal or contractually obligated

AvailabilityStorage

ConcurrentUsers

ResponseTime

Response time

§ Measure of time needed to complete a transaction in response to a request to do work

§ Lower response times generally have positive effects

§ Different perceptions of response time: user interface, real time event, service level

§ Different perceptions of response time: user interface, real time event, service level commitments, …

§ Isn’t improving response time simply a matter of increasing throughput? Not necessarily…

How do you measure response time?

§ Be sure what you’re measuring is the response time you’re interested in

Transaction A

Requests Responses

Transaction B

Transaction CRequestQueue

TransactionQueue

ResponseQueue

Executor Thread Pool

Requests Responses

Transaction A

Requests Responses

Transaction B

TransactionQueue

ResponseQueue

Requests Responses

Measuring response time from request made to response received?

Transaction A

Requests Responses

Transaction B

TransactionQueue

ResponseQueue

Requests Responses

Measuring response time from transaction submitted to response received?

Transaction A

Requests Responses

Transaction B

TransactionQueue

ResponseQueue

Requests Responses

Measure time to complete the transaction?

§ Make sure your timing measurement isn’t part of the response time!

§ Be aware of accuracy and precision of Java timing methods– System.nanotime()– System.currentTimeMillis()– …and don’t use too many timers!

– …and don’t use too many timers!

§ Beware of clock skew in virtual environments– May need to keep time on an external system

Sample of transaction response times for an IR of 3000 ops/sec. Most long transactions above 95th percentile.

Framework

Influences on response time are not localized

Application

Hardware

Operating System

Java VM

FrameworkYou must design and tune the entire stack in order to achieve your response time targets

SPECjbb2012

§ Next generation Java business logic benchmark from SPEC

§ Business model is a supermarket supply chain: headquarters, supermarkets, suppliers

§ Scalable, self-injecting workload with multiple supported configurations

§ Customer relevant technologies: security, XML, JDK 7 features

§ Metrics: max-jOPs (throughput) and critical-jOPs (response time)

§ Will be used for case studies in this presentation

Framework

Application design influences response time

Application• design for scalability

• eliminate serial bottlenecks

• use appropriate JCL packages

Hardware

Operating System

Java VM

Framework• use appropriate JCL packages

• avoid needless synchronization

• avoid excessive object allocations

• cache data locally

• use non-blocking I/O

• be careful with logging and tracing

Design for scalability

§ Scalability : the ability to increase throughput as more resources are applied

§ Prepare your application to run on modern multi-core architectures

§ Create more parallelism in your application and eliminate serial bottlenecks– Change algorithms

– Change algorithms

§ Organize your application into parallel tasks– Leverage TaskExecutor framework for high-level tasks– Consider ForkJoin in Java 7 for fine-grained task decomposition

Use the java/util/concurrent package

§ j/u/c introduced in Java 5, additional features in Java 6/7

§ Contains building blocks for developing scalable applications– Uses state-of-the-art concurrency algorithms using non-blocking sync algorithms– More variety in locking operations (Lock interface, multiple Conditions)– Atomic variables (atomic math ops such as increment, test-and-set)

– Atomic variables (atomic math ops such as increment, test-and-set)– Concurrent collections– Coarse and fine-grained task management

§ Use j/u/c classes as base classes for new data structures

§ Optimized by modern JVMs

Avoid unnecessary Java synchronization

§ Required for correctness so it can’t always be done

§ Built-in Java synchronization is coarse grained and can inhibit scalability– Useful when true mutual exclusion is the goal– JVMs can help

§ Strongly consider using j/u/c for finer-grained locking– Building blocks for scalable locking

– Building blocks for scalable locking

§ Eliminate contended locks

§ Use volatile fields when appropriate– No locking– May be suitable for single writer, multiple-reader (e.g., time stamps)

Avoid excessive object allocations

§ Understand the effect of object creation on the heap and the strain on garbage collection

§ Consider hoisting allocations from loops

§ Consider using weak/soft references when appropriate

§ Consider using weak/soft references when appropriate– Useful for caches, object metadata, or easily rematerializable data

§ Be aware of immutable classes that implicitly return new objects– e.g., BigDecimal, Integer

Case study: SPECjbb2012

§ Example of design choices around receipt storage in the benchmark

• Some impact on throughput

• No impact on median response time

response time

• Significant impact on 99th-percentile response time

§ Example of design choices where background tasks become more heavy– Increase in background task of Data Mining (DM)

• Some impact on throughput

response time

• Significant impact on 99th-percentile response time

Reduce data access latency

§ Often a problem in client/server systems

§ Cache data locally to avoid remote communication– Particularly effective with data unlikely to change

§ Pitfall : Tradeoff between caching too much to improve remote access latency and accumulating too much that strains garbage collection– an example of where local benefits to throughput have broader negative effects

§ Use Java NIO (Java SE 1.4) and NIO2 (Java SE 7)– Can leverage high performance features

§ Carefully consider non-blocking, unbounded data structures (e.g., ConcurrentLinkedQueue)

Performance effects of caching supermarket data over not caching it

• Throughput reduces by half

• Minor impact on median response time

median response time

• Some impact on 99th-percentile response time

Framework

Application frameworks

Application

• application containers (e.g., application

Hardware

Operating System

Java VM

Framework• application containers (e.g., application servers, Eclipse)

• 3rd party packages (e.g., Apache commons), Grizzly

• understand thread management and local caching policies

Framework

Java virtual machine tuning

Application

Hardware

Operating System

Java VM

Framework

• garbage collection

• heap tuning

• 64-bit addressing

Java virtual machine architecture

Debugger Profilers Java Application Code

JVMTI JSE6 Classes

JSE6 Classes

Harmony Classes

User Natives

GC / JIT / Class Lib. Natives Java Native Interface (JNI)

Core VM (Interpreter, Verifier, Stack Walker)

Trace & Dump EnginesJava Runtime

Java APIe.g. Java6/Java7

User Code

Trace & Dump Engines

Port Library (Files, Sockets, Memory)

Thread Library

AIX Linux Windows z/OS

PPC-32PPC-64

x86-32x86-64

PPC-32PPC-64

zArch-31zArch-64

x86-32x86-64

zArch-31zArch-64

Operating Systems /Architecture

Environmente.g. J9 R26

= User Code

= Java Platform API

= VM-aware

= Core VM

Garbage collection

§ Determine the best garbage collection policy to use for your application– Often a response time vs. throughput tradeoff

§ Most GC policies involve a “stop-the-world” phase that works against response times– “throughput” policies tend to incur longer pauses but fewer interruptions– “concurrent” policies lower average pause times by completing some tasks concurrently

– “concurrent” policies lower average pause times by completing some tasks concurrently– “balanced” policies carve heap into regions to improve parallelism and reduce pauses

§ Tune your heap parameters

§ -verbose:gc to correlate GC events with application events

§ Example showing the effect of different GC policies and heap tunings

• Small throughput reduction from ConMarkSweep

• No impact on median

• ConMarkSweep 99th-percentile response time higher but consistent

64-bit addressing

§ Heap addressability beyond 32-bits (> 3.5GB)– Common for applications with large in-memory working set (e.g., databases, object caches)

§ 64-bit addressing is a less efficient representation than 32-bit– Cache & TLB effects stress hardware

– Cache & TLB effects stress hardware

§ Solution: build a 64-bit JVM with near 32-bit efficiency– Use 32-bit values (offsets) to represent object fields– With scaling, between 4 GB and 32 GB can be addressed

§ Enable with –XX:+UseCompressedOops or -Xcompressedrefs

Framework

Operating system tuning

Application

Hardware

Operating System

Java VM

Framework

• large pages

• thread scheduling

Large data and code pages

§ OS paging architecture requires memory addresses to be mapped to more granular “pages” that are mapped to physical memory– Translation Lookaside Buffers (TLBs)– Using larger page sizes increases TLB effectiveness

§ Large pages must be enabled by the OS

§ Large pages must be enabled by the OS– BUT require enough physical pages to be allocated together to be most effective

§ Modern JVMs place both heap and compiled code in large pages

§ -Xlp (J9) or –XX:+UseLargePages (HotSpot)

§ Example showing the effect of large pages

• Increase throughput by ~13%

response time

• Helps in keeping 99th-percentile response time lower at higher load

Thread scheduling

§ Context switches

– Voluntary (e.g., preemption during locking)

– Involuntary (e.g., too many active threads)

§ Watch for thread migration

Framework

Hardware tuning

Application

Hardware

Operating System

Java VM

Framework

• power management

• BIOS settings

Hardware tuning

§ Power management

§ Insufficient resources– Physical memory, amount and latency– I/O storage latency

• RAID• SSDs

– Network I/O bandwidth

§ Tune your BIOS settings carefully

§ Tune your BIOS settings carefully– Hyperthreading– Prefetching– Power management

Know your Intel® Xeon® Processor Family

Know your Intel® Xeon® Processor SKU:

§ Example showing the effect of 8 cores vs. 4 cores– Assumes application leveraging parallelism of multiple cores

• Increases throughput by ~100%

response time

• 8 cores deliver much lower 99th-percentile response

Leveraging your hardware topology

§ Understand the underlying hardware topology to reduce latency and increase throughput

§ For NUMA, affinitize JVMs to core/memory subsets to improve performance– Improve NUMA performance– Optimize the cache hierarchy of the underlying processors

• Increases throughput by ~12%

• Much lower 99th-percentile response

Evaluating your response time

§ Even though you may be achieving an acceptable SLA are there tell-tale signs that you could be achieving even better?– Lack of multi-threadedness in your application– Lock contention– Low CPU utilization– Excessive time (>10%) being spent in OS kernel

§ Tooling to help diagnose response time issues– IBM HealthCenter

– What is my JVM doing? Is everything ok?– Why is my application running slowly? Why is it not scaling?– Am I using the right options?

– Garbage Collector and Memory Visualizer• Online analysis of heap usage, pause times, many others

– Memory Analyzer• Offline tool providing insight into Java heaps

Questions?

References

§ Get Products and Technologies– IBM Java Runtimes and SDKs:• https://www.ibm.com/developerworks/java/jdk/

– IBM Monitoring and Diagnostic Tools for Java:• https://www.ibm.com/developerworks/java/jdk/tools/

– SPEC benchmarking• http://www.spec.org

• http://www.spec.org

§ Learn– IBM Java InfoCenter:• http://publib.boulder.ibm.com/infocenter/javasdk/v6r0/index.jsp

§ Discuss– IBM Java Runtimes and SDKs Forum:• http://www.ibm.com/developerworks/forums/forum.jspa?forumID=367&start=0

Copyright and Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., and registered in many jurisdictions worldwide.

Other product and service names might be trademarks of IBM or other companies.

A current list of IBM trademarks is available on the Web – see the IBM “Copyright and trademark information” page at URL: www.ibm.com/legal/copytrade.shtml

SPECjbb2012 architecture

Single Application Set Multi-Application Set

Ctrl BETxI Ctr

Controller (Ctrl)–Controls and evaluates the runs

Transaction Injector (TxI)– Issues “Requests” at a given rate–Measures response time by sending probe requests

Backend SUT (BE) –Some % of transactions go across BEs exercising inter-JVM process communication

SPECjbb2012 architecture

HQSM 2

SP 1 SP 2

Backend 1

Group 1

SM: SupermarketHQ: HeadquartersSP: Supplier

SP 1 SP 2

Backend 2

Group 2

Group 1

Be aware of the impact of logging and tracing

§ Tracing and logging events from your application can have hidden costs– I/O latency

– Storage requirements

– Overhead of test guarding tracing code

– Impact on JIT compilation

§ Do try to correlate application tracing information with events in other system or JVM logs

Con5388 maier

Technology