JIT Instrumentation – A Novel Approach To Dynamically Instrument Operating Systems
Marek OlszewskiKeir MierleAdam CzajkowskiAngela Demke Brown
University of Toronto
2
Instrumenting Operating Systems
Operating systems are growing in complexity
Kernel instrumentation can help
Used for: debugging, profiling, monitoring, and security auditing...
Dynamic instrumentation
No recompilation & no reboot
Good for debugging systemic problems
Feasible in production settings
3
Current Approach: Probe-Based
Dynamic instrumentation tools for OSs are
probe based Overwrite existing code with jump/trap
Efficient on fixed length architectures Slow on variable length architectures
Not safe to overwrite multiple instructions with jump: Branch to between instructions might exist
Thread might be sleeping in between the instructions
Must use trap instruction
4
Current Approach: Trap-based
Area of interest InstrumentationCode
Trap Handler
1. Save processor state2. Lookup which
instrumentation to call
3. Call instrumentation4. Emulate overwritten
instruction5. Restore processor
state
add $1,count_ladc $0,count_hinc 14(edx)int3
mov $ffffe000,edxand esp,edx
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
sub $6c,esp
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
Very Expensive!
5
Alternative: JIT Instrumentation
Propose to use just-in-time dynamic instrumentation
Rewrite code to insert new instructions in between existing ones
More Efficient.
More Powerful. Supports:
Instrumenting branch directions
Basic block-level instrumentation
Per execution-path instrumentation
Proven itself in user space (Pin, Valgrind)
6
JIT InstrumentationInstrumentationCode
Area of Interest
mov $ffffe000,edxand esp,edx
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
sub $6c,esp
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)
mov $ffffe000,edxand esp,edx
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
sub $6c,esp
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)adc $0,count_hadd $1,count_l
Code Cache
7
popf
JIT InstrumentationInstrumentationCode
Area of Interest
mov $ffffe000,edxand esp,edx
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
sub $6c,esp
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)
mov $ffffe000,edxand esp,edx
sub $6c,esp
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)adc $0,count_hadd $1,count_l
call instrmtnpushf
Code Cache
8
JIT InstrumentationInstrumentationCode
Area of Interest
mov $ffffe000,edxand esp,edx
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
sub $6c,esp
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)
mov $ffffe000,edxand esp,edx
sub $6c,esp
adc $0,count_hadd $1,count_l
popf
mov 28(edi),eax
add $1,eax
or $c, eaxand $3,eax
mov eax,(ebx)
or $f, ebpmov ebp,4(ebx)
mov 2c(edi),ebxmov 30(edi),ebp
add $2,ebp
inc 14(edx)
call instrmtnpushf
Code Cache
9
Dynamic Binary Rewriting
Use dynamic binary rewriting to insert the new instructions.
Interleaves binary rewriting with execution
Performed by a runtime system
Typically at basic block granularity
Code is rewritten into a code cache
Rewritten code must be:
Efficient
Unaware of its new location
10
Dynamic Binary Rewriting
Original Code: Code Cache:
bb1
Runtime System
bb3bb2
bb4
bb1 bb1
11
Dynamic Binary Rewriting
Original Code: Code Cache:
bb1
Runtime System
bb3bb2
bb4
bb1
bb2 bb2
12
Dynamic Binary Rewriting
Original Code: Code Cache:
bb1
Runtime System
bb3bb2
bb4
bb1
bb2
bb4 bb4
bb1
bb2
No longer need to enter runtime system
13
Dynamic Binary Rewriting
Used for rewriting operating systems
Virtualization (VMware)
Emulation (QEMU)
Never used for instrumentation of OSs
Never used to rewrite host OS in a general manner
Allows instrumentation of live system
14
Outline
Prototype (JIFL)
Design
OS Issues
Performance comparison
Kprobes vs JIFL
Example Plugin
Checking branch hint directions
15
Prototype Design
JIFL - JIT Instrumentation Framework for Linux Instruments code reachable from system calls
16
JIFL Software Architecture
JIFL Plugin(Loadable Kernel Module)
Linux Kernel
(All codereachable
from systemcalls)
JIFL Plugin Starter User Space
Kernel Space
CodeCache
Runtime System
JIFL (Loadable Kernel Module)
HeapMemory Allocator
Dispatcher
JIT Compiler
JIFL Plugin Starter
17
Gaining Control
Runtime System must gain control before it can start rewriting/instrumenting OS
Update system call table entry to point to dynamically emitted entry stub
Calls per-system call instrumentation
Calls dispatcher
Passing original system call pointer
18
Dispatcher
Saves registers and condition code states
Dispatcher checks if target basic block is in code cache
If so it jumps to this basic block
Otherwise it invokes the JIT to compile and instrument the new basic block
19
JIT Compiler
Like conventional JIT compiler, except its input/output is x86 machine code
Compiles at a dynamic basic block granularity
All but the last control flow instruction are copied directly into the code cache
Control flow instructions are modified to account for the new location of the code
Communicates with the JIFL plugin to determine what instrumentation to insert
20
JIT: Inserting Instrumentation
Instrumentation is added by inserting a call instruction into the basic block
Additional instructions are also needed to:
Push/Pop instrumentation parameters
Save/Restore volatile registers (eax, edx, ecx)
Save/Restore condition code register
Several optimizations can be performed to reduce instrumentation cost
21
Eliminating Redundant State Saving
Eliminate dead register and condition code saving code
Perform liveness analysis
Reduce state saving overhead
Per-basic block Instrumentation
Search for the cheapest place to insert it
22
Inlining Instrumentation
Small instrumentation can be inlined into the basic block
Removes the call and ret instructions
Constant parameters are propagated to remove stack accesses
Copy propagation and dead-code elimination is applied to specialize the instrumentation routine for context
All done on native x86 code. No IR!
23
Effect of Optimizations
6.82
3.53
2.66
012345678
Baseline Eliminating RedundantSaves
Eliminating RedundantSaves and Inlining
Nor
mal
ized
Exe
cutio
n
Tim
e
Average system call latencies with per-basic block instrumentation
24
Prototype
Operating System Issues
25
Memory Allocator
While JITing JIFL often needs to allocate dynamic memory
Cannot rely on Linux kmalloc and vmalloc routines as they are not reentrant
Instead, we created our own memory allocator
Pre-allocate a heap when JIFL starts up
26
Releasing Control
Calls to schedule() have to be redirected
Otherwise, JIFL keeps control even after context switch
Have to:
Save return address in hash table
Call schedule()
Look up and call dispatcher
27
Performance Comparison
JIFL vs. Kprobes
28
Performance Evaluation
Instrument every system call with three types of instrumentation: System Call Monitoring (Coarse Grained)
Call Tracing (Medium Grained)
Basic Block Counting (Fine Grained)
LMbench and ApacheBench2 benchmarks Test Setup
4-way Intel Pentium 4 Xeon SMP - 2.8GHz
Linux 2.6.17.13
With SMP support and no preemption
29
System Call MonitoringN
orm
aliz
ed E
xecu
tion
Tim
e
1.5
3
1.5
1
1.3
1
1.2
6 1.4
7
1.3
8
1.4
5
1.4
0
1.4
1
1.7
8
1.8
5
1.1
8
1.6
5
1.0
7
1.1
6
1.2
7
1.0
0
1.3
4
0.0
0.5
1.0
1.5
2.0
read write stat fstat select (500 fd's)
select (500 tcp
fd's)
open +close
fork +execve
Geometricmean
JIFL: System Call Monitoring Kprobes: System Call Monitoring
30
Call TracingN
orm
aliz
ed E
xecu
tion
Tim
e
Log
Sca
le
1.5
4
1.9
7
1.7
0
1.3
8
1.7
7
1.7
6
2.3
5
1.5
0
1.7
3
8.7
8.3
16
.6
7.5
46
.3
45
.0
17
.4
9.4
15
.3
1.0
10.0
100.0
read write stat fstat select (500 fd's)
select (500 tcp
fd's)
open +close
fork +execve
Geometricmean
JIFL: Call Tracing Kprobes: Call Tracing
31
Basic Block CountingN
orm
aliz
ed E
xecu
tion
Tim
eLo
g S
cale
1.9
8
2.0
2
2.1
7
1.6
3 7.0
2
5.0
6
2.8
3
1.8
0
2.6
6
91
84
22
2
69
14
90
10
51
20
8
14
2
22
0
1
10
100
1000
10000
read write stat fstat select (500 fd's)
select (500 tcp
fd's)
open +close
fork +execve
Geometricmean
JIFL: Basic Block Counting Kprobes: Basic Block Counting
32
Apache Throughput
0.97 0.95
0.83
1.00
0.22
0.030.0
0.2
0.4
0.6
0.8
1.0
System Call Monitoring(Coarse Grained)
Call Tracing (Medium Grained)
Basic Block Counting(Fine Grained)
JIFL Kprobes
Nor
mal
ized
Req
uest
s /
Sec
ond
33
Example Plugin
Checking Correctness of Branch Hints
34
Example Plugin: Checking Branch Hints
Int correct_count 0Int incorrect_count 0
// Called for every newly discovered basic block.Procedure Basic_Block_Callback if last instruction is not a hinted branch return if hinted in the branch not taken direction call Insert_Branch_Not_Taken_Instrumentation(
Increment_Counter, &correct_count) call Insert_Branch_Taken_Instrumentation(
Increment_Counter, &incorrect_count) else // Insert same instrumentation but for reverse // branch directions
// Executed for every instrumented branch.Procedure Increment_Counter(Int *Counter) *Counter *Counter + 1
35
Example Plugin: Checking Branch Hints
5 system calls with bad branch hint performance
Misprediction rates > 75%
Contained > 30% of hinted branch executed
Examined using a second plugin
Monitored individual branches
Found 4 greatest contributors
Mapped back to source code
Can’t fix: Not hinted by programmer!
36
Conclusions
JIT instrumentation viable for operating systems
Developed a prototype for the Linux kernel (JIFL)
Results are very competitive
JIFL outperforms Kprobes by orders of magnitude
Enables more powerful instrumentation
e.g. Branch Hints