Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | srinivaskothuri |
View: | 959 times |
Download: | 2 times |
Android JIT
Introduction
Just-In-Time (JIT)/Dynamic Compilation
JIT Design
Dalvik JIT
JIT Compiler
Intermediate Representation
Optimization Techniques
Data- Control- Flow Analysis
Introduction:
The Java language is made to be interpreted to achieve the critical goal of application portability.
HW.java
javac
public class HW{ . . . . void hello(){ . . . . }}
HW.class
ca fe ba be 08 1a 42 ..
java
Other classes
Java Language Java Virtual Machine
Microprocessors have instruction sets that define the operations they can perform, so does the VM instructions compile into a format known as bytecodes.
It is through the VM that executable bytecode Java classes are executed and ultimately routed to appropriate native system calls.
Problem:“A Java program executing within the VM is executed a bytecode at a time”
Java Source file Class file(bytecode)
Problem (Contd.):
The conventional approach resulted in significantly lower performance when compared to compiled languages like C/C++ by the additional processor and memory usage during interpretation.
As a result, slow and space-constrained computing devices have tended not to include virtual computing technology(i.e. JVM).
Initiatives: JSR-30 : J2ME CLDC (Connected Limited Device Configuration) Specification Reference implementation of the J2ME CLDC (Connected Limited Device
Configuration) in April 1999, got approval in August 1999 Final public release of CLDC 1.0 in May 2003
The HotSpot engine was developed to address the perception that Java virtual machine performance was insufficient for many mainstream applications.
By implementing a host of performance enhancing techniques that went beyond innovations like just-in-time (JIT) compilers, the performance of the Java virtual machine increased by an order of magnitude
Just-In-Time (JIT)/Dynamic Compilation :
The Just-In-Time (JIT) compiler is a component of the Java Runtime Environment. It improves the performance of Java applications by compiling bytecodes to native machine code at run time.
ByteCodes
JVM
GC
Just-In-Time CompilerIntermediate Representation
Generator
Optimizer
Code Generator
Runtime
Profiler
Just-In-Time (JIT) Compiler
Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
JIT Compilation Strategies:
With a JIT compiler, Java programs are compiled one block of code at a time as they execute into the native processor's instructions to achieve higher performance. The process involves generating an internal representation of a method that's different from bytecodes but at a higher level than the target processor's native instructions. The compiler performs optimization to improve quality and efficiency and finally a code-generation step to translate the optimized internal representation to the target processor's native instructions To avoid the overhead of compiling and optimizing all an application’s classes at a time, a number of incremental compilation strategies have evolved. The general strategy of only compiling the “hot” parts of an application will often
result in only a small percentage of an application being compiled, thus saving considerable compilation time.
“A continuously operating sampling profiler identifies programs hot regions for code reoptimization”
“The JIT compiler operates on a compilation thread that's separate from the application threads so that the application doesn't need to wait for a compilation to occur”
Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
The Just-In-Time (JIT) compiler is a component of the Java Runtime Environment. It improves the performance of Java applications by compiling bytecodes to native machine code at run time.
A Java class that has been loaded into memory by the VM contains a V-table (virtual table), which is a list of the addresses for all the methods in the class.
Method - 1
Method 1 Bytecode
Each address in the V-table points to the executable bytecode for the particular method
V-table
Method - 2Method 2 Bytecode
Method - 3 Method 3 Bytecode
Method - 4
Method 4 Bytecode
Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
When the JIT is loaded, bytecode address in the V-table is replaced with the address of the JIT compiler itself.
Method - 5
Method - 1
Method - 2
Method - 3
Method - 4
V-table
Just-In-Time Compiler
When the VM calls a method through the address in the V-table, the JIT compiler is executed instead.
Just-In-Time (JIT)/Dynamic Compilation (Contd.) :
The JIT compiler steps in and compiles the Java bytecode into native code and then patches the native code address back to the V-table.
Method - 5
Method - 1
Method - 2
Method - 3
Method - 4
V-table
Just-In-Time Compiler
Method 5 Native Code
From now on, each call to the method results in a call to the native version.
JIT Design :
Challenges (Price of Platform neutrality):
The time it takes to compile the code is added to the program's running time. JIT typically causes a slight delay in initial execution of an application, due to the time taken to load and compile the bytecode.
Optimizations:
Modern JIT compilers take one of two approaches1. Compile all the code but without performing any expensive analyses or transformations so that the
code is generated quickly.2. Devote compilation resources to only a small number of methods that execute frequently.
Combine interpretation and JIT compilation. The application code is initially interpreted, but the JVM monitors which sequences of bytecode are frequently executed and translates them to machine code for direct execution on the hardware.
JIT Design (Contd.) :
There are 4 reasons for why a JIT for the complete byte code set was not implemented and the combined usage of Interpreter and JIT has become unavoidable.
1. If thread context switching would have had to be performed whilst executing generated native code, this would have added complexity to code generation, runtime support, and the base VM code. By only performing context switching in the interpreter no changes were made to the way the thread scheduling was done in VM.
2. The generated machine code would have needed to be more rigorous in the way it dealt with error conditions and other exceptional conditions. As it is, the machine code only needs to check for error conditions. When they occur the error handling bytecodes can be then executed by the interpreter, which then can deal with the details of how the error should be processed.
3. A complete JIT would have required more complicated interactions between the generated machine code and the virtual machine as a whole. For example, the generated machine code could cause the compiler, class loader, garbage collector, or native code to run. In retrospect some of these restrictions were not strictly necessary, but the system probably has fewer undiscovered bugs, and it does not seem to have limited the performance of the type of compute-intensive software that is the target of the design.
(Contd.)
JIT Design (Contd.) :
4. A debugging technique (discussed below) was used which could not have been employed so easily with a complete JIT.
Therefore the system was designed to allow execution to pass from the compiled code to the interpreter at any time, and also for the interpreter to be able return to generated code in a timely fashion.
Additionally, to keep the interpreter from getting trapped in a long loop of bytecodes it was necessary to be able to return to compiled code in the middle of a method as well as at the start.
“JIT lets the interpreter to deal with complex tasks such as Class loading, Exception handling, Synchronization, Garbage Collection etc”
The basic interpreter loop is as follows:
Start:Try to enter compiled code.Interpret the next bytecode.goto Start.
If the current method has not been compiled then checks are performed to determine if it can be.
JIT Design (Contd.) :
Compilation may not be possible for one of the following reasons.
1. A native function was called.2. The method has more than a certain number of parameters or local variables, is unusually
large3. There is no available memory for more compiled code.4. An object could not be created without running the garbage collector.5. An operation was attempted that required a class to be initialized.6. The start of an exception handler was reached.7. An exception or error occurred. The interpreter always processes these.8. The part of a method was reached for which no corresponding machine code could be
generated.9. A function was called for which there was no compiled code.10.A method return was executed but there was no compiled code to return to because the
code buffer had been flushed.
Method - 1
Method 1 Bytecode
V-table
Method - 2
Method 2 Bytecode
Method - 3
Method 3 Bytecode
Method - 4
Method 4 Bytecode
Just-In-Time Compiler
JIT Design (Contd.) :
1. The JVM interprets a method until its call count exceeds a JIT threshold.2. After a method is compiled, its call count is reset to zero; subsequent calls to the method continue to increment its count. 3. When the call count of a method reaches a JIT recompilation threshold, the JIT compiles it a second time, this time applying a larger selection of optimizations than on the previous compilation (because the method has proven to be a significant part of the whole program)
Native Code
.class .class .class
JVM JVM
Operating System
Interpreter
JIT JIT=OFF JIT=ON Threshold=10
times >= 10 times < 10
JIT Design (Contd.) :
Dalvik JIT :
Dalvik Execution Environment:
1. Register based architecture (Register Machine) Stack-based machines (JVMs) must use instructions to load data on the stack and manipulate that data, and, thus, require more instructions than register machines.2. Very compact representation Java bytecode is converted into an alternate instruction set used by the Dalvik VM. dx is a tool used to convert some (but not all) Java .class files into the .dex format. 3. Emphasis on code/data sharing to reduce memory usage Multiple classes are included in a single .dex file.4. Highly-tuned very fast (2x similar) Dalvik Interpreter, good enough for most of the
applications. For compute-intensive applications, Native Development Kit was released to allow Dalvik applications to call out statically-compiled(native) methods.
Dalvik JIT (Contd.):
Other part of solution is Dalvik JIT:
Translates byte code to optimized native code at run time.
1. Method Compiler2. Trace Compiler
3. Method Compiler- Most common model for server JITs- Interprets with profiling to detect hot methods- Compile & optimize method-sized chunks
- Strengths• Larger optimization window• Machine state sync with interpreter only at method call boundaries - Weaknesses• Cold code within hot methods gets compiled• Much higher memory usage during compilation & optimization• Longer delay between the point at which a method goes hot and the
point that a compiled and optimized method delivers benefits
Dalvik JIT (Contd.):
2. Trace Compiler - Most common model for low-level code migration systems - Interprets with profiling to identify hot execution paths - Compiled fragments chained together in translation cache - Strengths
• Only hottest of hot code is compiled, minimizing memory usage• Tight integration with interpreter allows focus on common cases• Very rapid return of performance boost once hotness detected - Weaknesses• Smaller optimization window limits peak gain• More frequent state synchronization with interpreter• Difficult to share translation cache across processes
Dalvik JIT (Contd.):
(Method Vs Trace):
Full Program4,695,780 bytes
Hot Methods396,230 bytes
26% of Hot methods2% of program
Hot Traces396,230 bytes
Method JIT: Best optimization windowTrace JIT: Best speed/space tradeoff
8% of program
Dalvik JIT (Contd.):
The provisional decision was to start with trace for the following reasons:
• Minimizing memory usage critical for mobile devices• Important to deliver performance boost quickly
- User might give up on new app if we wait too long to JIT• Leave open the possibility of supplementing with method-based JIT
- The two styles can co-exist- A mobile device looks more like a server when it’s plugged in- Best of both worlds
• Trace JIT when running on battery• Method JIT in background while charging
The Dalvik JIT can be considered as an extension of the Interpreter because it is the Interpreter which profiles and triggers trace selection mode when a potential trace head goes hot.
Dalvik JIT (Contd.):
Dalvik Trace JIT Flow:
Start
Update Profile count for this
location
Interpret/buildTrace request
Threshold?
Xlationexists?
Interpret until next potential
trace head
Translation
Exit 0Exit 1
Translation
Exit 0Exit 1
Translation
Exit 0Exit 1
Compiler Thread
NO
YES
YESNO
Submit Compilation Request
Install new translation
Translation Cache
Dalvik JIT (Contd.):
Features:• Trace request is built during interpretation
- Allows access to actual run-time values- Ensures that trace only includes byte codes that have successfully executed at least once (useful for some optimizations)
• Trace requests handed off to compiler thread, which compiles and optimizes into native code• Compiled traces chained together in translation cache• Per-process translation caches (sharing only within security sandboxes)• Simple traces - generally 1 to 2 basic blocks long• Local optimizations
- Register promotion- Load/store elimination- Redundant null-check elimination- Heuristic scheduling
• Loop optimizations- Simple loop detection- Invariant code motion- Induction variable optimization
JIT Compiler:
JIT Compiler Work Flow:
In order to execute bytecode, JIT compiler goes through three stages.
1. Baseline: Generates code that is “Obviously correct”The process involves generating an internal representation of a java code that is
different from bytecodes but at a higher level than the target processor's native instructions (Intermediate Representation(IR)). “IR allows more effective machine-specific optimizations”
2. Optimizing: Applies a set of optimizations to a class when it is loaded at run time
3. Adaptive: Methods are compiled with a non-optimizing compiler first and then selects “hot” methods for recompilation based on run-time profiling information.
“A key part of the JIT design was to split the compilation process into two passes. The first pass transforms the standard, stack-based bytecodes into a simple 3-address intermediate representation in which all temporary statement results are placed into new local variables instead of entries on an evaluation stack. The second pass converts this three-address form into native machine code.”
Intermediate Representation:
An IR instruction is an N-tuple (a simple mathematical set), consisting of an operator, and some number of operands.
“The Intermediate Representation is a machine- and language-independent version of the original source code”
An Operator is the instruction to performOperands are used to represent Symbolic Register, Physical Registers, Memory Locations, Constants, Branch targets, Method Signatures, Types etc
An IR code must be convenient to translate into real assembly code for all
desired target machines
Intermediate Representation (contd.):
Three Address Code (TAC or 3AC):
1. Three-address code is a form of representing intermediate code(IR) used by compilers to aid in the implementation of code-improving transformations. 2. Each instruction in three-address code can be described as a 4-tuple: (operator, operand1,
operand2, result) as shown.
result := operand1 operator operand2
such asx := y + z
3. Expressions containing more than one fundamental operation, such as:p = x + y * z
are not representable in three-address code as a single instruction. Instead, they are decomposed into an equivalent series of instructions,
such ast1 := y * zp := x + t1
“The key features of three-address code are that every instruction implements exactly one fundamental operation, and that the source and destination may refer to any available register”
Intermediate Representation (contd.):
Static Single Assignment form (SSA):
1. A refinement of three-address code and a property of an intermediate representation (IR), which says that each variable is assigned exactly once2. Existing variables in the original IR are split into versions, new variables typically indicated
by the original name with a subscript in textbooks, so that every definition gets its own version
Benefits (by Example):
y := 1y := 2x := y
TAC
y1 := 1y2 := 2x := y2
SSA
1. Humans can see that the first assignment is
not necessary2. The value of y being used in the third line comes from the second assignment of y. A program would have to perform “reaching
definition analysis” to do these optimizations
With SSA, 1 and 2 are immediate as it identifies “y1” is used only once and omitting it wont affect other part of code
Intermediate Representation (contd.):
3 levels of IR:
Levels of IR:
H
I
R
M
I
R
bytecode
L
I
R
Machine
1. IRs that are close to a high-level language are called high-level IRs, and IRs that are close to assembly are called low-level IRs.
2. A high-level IR might preserve things like array subscripts or field accesses whereas a low-level IR converts those into explicit addresses and offsets.
Original HIR MIR LIR
float a[10][20] t1 = a[i, j+2] t1 = j+2 r1 = [fp-4]a[i][j+2] t2 = i*20 r2 = [r1+2] t3 = t1+t2 r3 = [fp-8] t4 = 4*t3 r4 = r3*20 t5 = addr a r5 = r4+r2 t6 = t5+t4 r6 = 4*r5 t7 = *t6 r6 = fp–216 f1 = [r7+r6]
Intermediate Representation (contd.):
1. HIR (High Level IR)a) IR that are closer to high-level language (Operators similar to Java bytecode)b) Usually preserves information such as loop-structure and if-then-else
statementsc) Operate on symbolic registers instead of an implicit stack
HIR Generation:
class AdditionMethodTest { public static void main(String args[]) { int a = 3; int b = 4; int c = a + b; int d = getNewValue(c); return; } // End method main
public static int getValue(int var) { return var * var; } // End method getNewValue}
Java Code (.java) Bytecode (.class)
Method void main(java.lang.String[]) 0 iconst_3 1 istore_1 2 iconst_4 3 istore_2 4 iload_1 5 iload_2 6 iadd 7 istore_3 8 iload_3 9 invokestatic #2 <Method int getValue(int)> 12 istore 4 14 return Method int getNewValue(int) 0 iload_0 1 iload_0 2 imul 3 ireturn
Intermediate Representation (contd.):
Conversion from Java bytecode to HIR:
Compiler that performs this conversion contains 2 parts.1. The BC2IR algorithm that translates bytecode to HIR and performs on-the-fly optimizations during translation.2. Additional optimizations perform on the HIR after translation.
BC2IR Translation:
3. Discovers extended-basic-blocks4. Constructs an exception-table for the method5. Creates HIR instructions for bytecodes6. Performs On-the-fly optimizations
a) Copy propagationb) Constant propagationc) Register renaming for local variablesd) Dead-Code eliminatione) Short final or static methods are in-lined
Note: Even though these optimizations are performed in later phases, doing so here reduces the size of the HIR generated and thus compile time.
Intermediate Representation (contd.):
Example of on-fly-optimization:
Copy propagation algorithm can be noticed here
y = x + 5
Generated IR(optimization off)Java Bytecode
iload xiconst 5iaddistore y
INT_ADD tint, xint 5INT_MOVE yint, tint
INT_ADD yint, xint, 5
Generated IR(optimization on)
********* START OF IR DUMP Initial HIR FOR AdditionMethodTest.getValue (I)I-13 LABEL0 Frequency: 0.0-2 EG ir_prologue l0i(I,d) = 2 int_mul t2i(I) = l0i(I,d), l0i(I,d)3 int_move t1i(I) = t2i(I)-3 return t1i(I)-1 bbend BB0 (ENTRY)********* END OF IR DUMP Initial HIR FOR AdditionMethodTest.getValue (I)I
Intermediate Representation (contd.):
The HIR generated code for AdditionMethodTest.java:
********* START OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V-13 LABEL0 Frequency: 0.0-2 EG ir_prologue l0i([Ljava/lang/String;,d) = 1 int_move l1i(B) = 33 int_move l2i(B) = 47 int_move l3i(B) = 79 EG call l5i(I) AF CF OF PF SF ZF = 66668, static"AdditionMethodTest.getValue (I)I", <unused>, 7-3 return <unused>-1 bbend BB0 (ENTRY)********* END OF IR DUMP Initial HIR FOR AdditionMethodTest.main ([Ljava/lang/String;)V
Intermediate Representation (contd.):
Optimizations for HIR:
Following optimizers are provided for the basic optimization.
1.CF // Constant Folding 2.CPF // Constant Propagation and Folding (triggered by the propagation)3.CSE // Common Sub-expression Elimination (within basic blocks) 4.DCE // Dead Code Elimination5.GT // Global Variable Temporalization (within basic block)
The optimizers CF and GT do not require data flow analysis, however, CPF, CSEand DCE require some result of data flow analysis.
Complete Description can be available @ http://www.coins-project.org/international/COINSdoc.en/hiropt/hiropt.html
Intermediate Representation (contd.):
2.Medium-Level IRs (MIR)a) Support range of features in a set of source languages, but in a language-
independent way.b) Good basis for generation of efficient machine code for one or more
architectures. Example: register transfer languages
3.Low-Level IRs (LIR)a) Almost one-to-one correspondence to target-machine instructions: quite
architecture-dependent.
<MIR & LIR to be added>
Optimization Techniques:
Why Optimization:
1. Programmers do not always write optimal code.a) For example, ways to improve code are not always recognized
(e.g. move loop-invariant code out of loops, avoiding re-computation of the same expression).2. High-level language may not allow a programmer to avoid redundant computation (or make it inconvenient)
a[i][j] = a[i][j] + 13. The programmer should not be bothered with the target machine architecture.
Moreover, modern machine architectures assume optimization; it has become hard to optimize by hand.
Goal:
Let programmers write clean, high-level source code, produce programs that approach assembly-code performance.Optimization: the transformation of a program P into a program P´, that has the same input/output behavior, but is somehow “better”. Better might mean:
• faster, or• smaller, or• uses less power, or• whatever you care about
P´ is not optimal, may even be worse than P.
1. In-lining (also at lower levels)2. Specialization3. Constant folding4. Constant propagation5. Value numbering6. Dead code elimination7. Loop-invariant code motion8. Common sub-expression elimination9. Strength reduction10.Branch prediction/optimization11. Register allocation12.Loop unrolling13.Cache optimization
Optimization Techniques: