PinOS: A Programmable Framework forWhole-System Dynamic Instrumentation
Prashanth P. Bungale14th June 2007
Joint work with Chi-Keung Luk
2
Outline
Pin Overview
PinOS motivation and goals
Architecture
Design Issues
Evaluation
Future work
3
What is Pin?
A Dynamic Binary Instrumentation System Inject and delete instruction stream at run-time without source code
Programmable Instrumentation Provides APIs to write instrumentation tools (called PinTools) in C/C++
Multiplatform Supports 32-bit and 64-bit x86, Itanium Supports Linux, Windows, MacOS
Robust Instruments real-life and multithreaded applications
Database, search engines, web browsers
Increasingly Popular Over 10000 downloads since Pin was released in 2004 June
4
Pin Instrumentation Uses
Computer Architecture Research– Branch predictor simulation– Cache simulation– Trace generation– Instruction Emulation
• E.g., emulate newly proposed instructions
Software Instrumentation– Profiling for optimization
• Basic block counts, edge counts– Bug checking
5
PinOS Goals
Extend Pin to instrument OS code as well Programmable through extended Pintool API
Fine-grain instrumentation of both kernel- and user-level code No limitation on where and what kind of instrumentation can be inserted Not achievable by existing probe-based tools (e.g., Dtrace and Kprobe)
Only active when needed Attach/detach PinOS to/from the guest as and when needed
Generalized Infrastructure• Single framework to instrument Linux, Windows, etc.
6
PinTool on PinOS: Tracing Memory WritesFILE * trace;
// Print a memory write recordVOID RecordMemWrite(VOID * ip, VOID * va, VOID * pa, UINT32 size) {
Host_fprintf(trace,"%p: W %p %p %d\n", ip, va, pa, size);}
// Called for every instructionVOID Instruction(INS ins, VOID *v) {
if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE,
AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_VA, IARG_MEMORYWRITE_PA,IARG_MEMORYWRITE_SIZE, IARG_END);
}
int main(int argc, char *argv[]) {PIN_Init(argc, argv);trace = Host_fopen("atrace.out", "w");INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); // Never returnsreturn 0;
}
7
Architecture
Xen-Domain0
Host OS
Xen-DomainU
Xen Virtual Machine Monitor (VMM)
H a r d w a r e
Guest OS
PinOS
1
To run PinOS between guest and hardware: Use Xen
Virtualize and present a fake processor to the guest OS
1
2
2PinTool
I/O
Engine
CodeCache
8
Xen 3.0 - A Convenient Environment
Uses Intel VT to run unmodified operating systems
Open-source availability
We modify Xen 3.0 to customize for PinOS purposes: Steal physical and virtual memory for PinOS Provide I/O services to PinOS Hijack initial control of guest domain Perform PinOS attach/detach
Provides support for debugging PinOS
9
Stealing Physical Memory
Memory requirements PinOS exe, Pintool exe, Code Cache, PinOS stack, heap, I/O buffers
Physical Memory Pre-allocate a separate range of machine pages for PinOS
Machine Pages
Physical Pages
10
Stealing Virtual Memory
Steal some portion of guest address space Current strategy: steal part of guest’s kernel address space
– Minimizes chance of VA space conflicts
Map stolen VA space to pre-allocated pages in Xen shadow
Propagate stealing to every shadow table i.e., in every address space ever encountered in the guest OS
Detect and report any conflicts No guest OS mapping activity encountered so far in stolen VA space Should be less of an issue with 64-bit address space
11
Memory Virtualization
PiVi
……
P1V1
P0V0
MnVn
……
MK+1Vk+1
MkVk
……
MiVi
……
M1V1
M0V0
Page Table
Guest OS
Shadow Page Table
Xen
PinOSMemory
12
I/O Services for PinOS
I/O Service requirements PinOS’s own debugging log, Pintools’ input/output
I/O channels implemented as shared ring buffers PinOS writes I/O requests to buffer shared b/w guest and host domains Daemon process in host domain periodically polls and processes requests
Sharing the ring buffers Allocated in guest domain “Mapped in” by host domain
Host Domain Guest Domain
PinOSDaemon Process
13
PinOS Attach/Detach
Attach/Detach allows PinOS to be used only on subject execution Avoid overhead
e.g., can avoid PinOS being active during OS boot every time Precision / accuracy
PinOS on entire run may pollute instrumentation data collections
Implementing Attach Read entire state of guest machine Start PinOS activity from that point on Use VT support for reading and setting hidden register state
Attach Detach
PinOSNative Native
14
Code-Cache Indexing and Sharing
Pin uses VA as code cache index
In PinOS, different processes can use same VA for different code Virtual address alone is not sufficient to distinguish code
Option 1: <AddressSpaceID, VirtualAddress> Easy to implement (On x86, use the CR3 value) But, no sharing of code across address spaces
Option 2: <PhysicalAddress, VirtualAddress> Can share code across address spaces Persistence across application runs But, much more challenging to implement
15
Results on booting FC4-Linux
5340
840
0
1000
2000
3000
4000
5000
6000
AddressSpaceID PhysicalAddress
Exe
cuti
on
Tim
e (s
ecs)
<PhysicalAddress, VirtualAddress> is the Clear Winner!
Execution time Code cache space used
71
1538
0
200
400
600
800
1000
1200
1400
1600
1800
AddressSpaceID PhysicalAddressC
od
e ca
che
spac
e u
sed
(M
B)
16
Correctness Issue with Trace Linking
<V1, P1>
jmp V2
<V2, P2>
Guest Code in Process A
<V1, P1>
jmp V2
<V2, P3>
Guest Code in Process B
V1’: Translation
of <V1, P1> jmp V2’
Code Cache
V2’: Translation
of <V2, P2>
Step 1:Process A is instrumented
and its translation is cached.
Step 2: Process B is instrumented and finds that <V1,P1> is already translated. So, no need to re-translate.
However, the jump to V2’ is incorrect because V2 is now mapped to P3 instead of P2!
17
Code-Cache Indexing and Sharing
V2’:Translation of <V2, P2>
if (SoftTLB[V2] != P2){ // <V2,P2> is invalid.
call PinOS();
// Never return}
// <V2, P2> is still valid. //Execute the rest of the trace.
A Translated Trace in Code CacheSoftTLB
P3V2
P1V1
PAVA
Our solution: Check predicted page mapping against actual one at each trace entry Maintain “SoftTLB” that caches current guest page mappings Assign once and always use same TLB entry for a given VA->PA mapping
So that the trace entry check can involve a constant address lookup
18
Coherence: Handling Page-Mapping Changes
Problem Guest’s page mappings may change after PinOS caches them in
SoftTLB
Solution Xen already marks guest page-table pages as read-only and thus
tracks all writes to them Modify Xen to inform PinOS once it figures out which page-table
entries get changed PinOS then invalidates these page mappings in its SoftTLB
19
Interrupt/Exception Virtualization
PinOS virtualizes interrupts and exceptions: Maintaining control
Ex: Timer interrupt triggering process preemption Maintaining transparency
Ex: Guest interrupt handler attempting to identify thread ID based on ESP
Install own interrupt handlers in Interrupt Descriptor Table (IDT) So all interrupts and exceptions are routed through PinOS
Handling interrupts (asynchronous) When received by PinOS, put it on a queue Add a pending interrupts check at every trace entry Setup interrupted guest context with trace address and context Continue instrumentation at corresponding guest interrupt handler
Handling exceptions (synchronous) Recover excepting guest address and context and setup context Continue instrumentation at corresponding guest exception handler
20
Exception Virtualization
Precise Exception Delivery In the face of “pseudo” instruction boundaries Log and Rollback all guest-visible state changes until most recent guest
instruction boundary
Faithful Exception Delivery While emulating instructions, conditions must be checked, and exceptions
raised as guaranteed by hardware semantics
movw %ds, (%edx)
call proc
spill %eax
movw M.%ds, %ax
movw %ax, (%edx)
restore %eax
pushl <current-eip>
jmp xlated-proc
Original Guest CodeTranslated Code
“Pseudo” Instruction boundary
Guest Instruction boundary
21
Coherence: Handling Self-Modifying Code
Self-modifying code problem Content of a code page may change after Pin has cached that page
Write-monitoring Solution Standard page-table trick
Bookkeeping Maintain a reverse page-mapping table
i.e., a PA -> VA mapping table Upon bringing in code from given physical page:
Write-protect all virtual pages that ever map into this physical page
22
Experiment Setup Environment:
Xen 3.0.2 running on Intel VT-enabled machines Guest domain installed with Fedora Core 4 Linux
Benchmarks: Fedora Core 4 Linux boot Apache-bench (web-server) Mysql-test (database server)
Pintools: Insmix
Code profiler that collects basic-block and instruction mix info CMP$im
Cache simulator that models a multi-level cache hierarchy Results in paper
23
Distribution of Kernel and User-level Instructions
24
0.32%105170776__might_sleep0xc011d565
0.13%45170776__might_sleep + 0x1a0xc011d57f
0.16%55170776__might_sleep + 0x2a0xc011d58f
0.38%610177398ext3_do_update_inode + 0x820xc8aac20b
1.17%293531291delay_pit + 0x1a0xc0111a40
Ins % Contribution
Num-InsCountBbl Symbol NameBbl Addr
Top 5 hottest kernel-level basic blocks of mysql-test-alter-table
Basic Block Count Results
25
17777043RDTSC
801350CLTS
54923619INVLPG
4458207HLT
48403240INSW
31994311762IN
9824990104OUTSW
57181551209OUT
574204599646IRETD
8459212217286STI
28069918912950CLI
fc4-bootmysql-test-alter-table
Privileged Instruction
Insmix Results
NANAMOV DR
00WRMSR
20LLDT
20LIDT
NANAMOV CR
00LMSW
00RDPMC
150RDMSR
00WBINVD
00INVD
10LTR
20LGDT
fc4-bootmysql-test-alter-table
Privileged Instruction
26
Performance of PinOS
27
Related Work I
Dynamic Optimization Dynamo [2000], DynamoRIO [2003] Mojo [2000]
Software Dynamic Translation Strata [2003]
Dynamic Binary Analysis and Instrumentation Shade [1994] - SPARC & MIPS Walkabout [2002], Valgrind [2004] Pin [2005], HDTrans [2006]
Probe-based Dynamic Binary Instrumentation KernInst [1999], DynInst [2000], LTT [2000], DProbes [2001], KProbes [2004] DTrace [2004], SystemTap [2005]
28
Related Work II
Full Machine Simulation/Emulation Embra (SimOS) [1996] – MIPS Simics [2002] Bochs [2002], QEmu [2005]
Para-Virtualization Denali [2002], Xen [2003]
Full Virtualization VMware [2002]
Hardware-assisted Virtualization Intel Virtualization Technology (VT) [2006] AMD Pacifica Technology [2006]
29
Future Work
Make PinOS capable of instrumenting Windows
PinOS Infrastructure Support 64-bit support (x86_64) Multi-Processor support (MP)
Now that we have this powerful infrastructure, let’s write Pintools!
Interesting Pintools include debuggers, profilers, tracing tools, etc.
Plan to release to public Interesting users and uses may demand further enhancements
30
Acknowledgments
Thanks to the entire Pin team For giving us a robust Pin to start with
Thanks to: Mark Charney
For helping us better understand Xed For fixing XED issues (only a few) very promptly
Greg Lueck For many helpful discussions, esp. about signals For fixing related bugs in mainline Pin
Prof. Jonathan Shapiro and Swaroop Sridhar For collaboration on initial ideas about segmentation virtualization
31
Thank You!
Questions?
32
Backup Slides…
33
Virtualization of System-Level State
Segmentation Support
• Segment Registers
• GDT/LDT
Paging Support
• CR3 (PDBR)
• Page-table structures
Interrupt/Exception Delivery
• IDT
Task support
• TR
EFLAGS
• Including privileged bits like IF
34
Review of IA-32 Memory Management
35
Review of segment addressing
CS
DS
segment selector
segment selector
SS segment selector
ES segment selector
FS segment selector
GS segment selector
Segment Registers
segmentdescriptor
LDT GDT
…
…
segmentdescriptor
…
…
8K
En
trie
s
Courtesy: Gregory Lueck
36
Review of segment addressing
index
Table indicator0 – GDT1 – LDT
Privilege info
Segment Selector
base address
limit other
Segment Descriptor
Courtesy: Gregory Lueck
37
Review of segment addressing
index 1FS
base address
limit other
LDT
+
mov %fs:0x10, %eax
effective address
Courtesy: Gregory Lueck
38
Hidden Part of Segment Register
index, GDT/LDT base, limit, acc. rights
visible part hidden part
Hidden part “cached” from LDT / GDT
Might be out-of-sync, software depends on this!
Saving segment register writes only visible part to memory
Restoring reads hidden part from GDT / LDT
Asymmetry: save / restore may change contents!
Courtesy: Gregory Lueck
39
Irreversible Segmentation Problem
Instrumentation Engine
GDT
A0x10 B0x10
GDT
B0x10
GDT
Selector: 0x10
Desc. Cache: A
DS:
Selector: 0x10
Desc. Cache: A
DS:
Selector: 0x10
Desc. Cache: B
DS:
Guest Writes B into GDT[0x10]
Gratuitous Load performed by Instrumentation System
Wrong! Should still be A as the guest has not yet explicitly performed a load into DS!
Restore DS
Save DS
40
Segmentation Virtualization
DS Register
Guest GDT/LDT
0x10:
PinOS GDT active on H/W
CS Desc. Cache
DS Desc. Cache
ES Desc. Cache
FS Desc. Cache
GS Desc. Cache
SS Desc. Cache
LDTR Desc. Cache
TR Desc. Cache
mov 0x10 -> dsIssued by guest
PinOS Stolen Entries
mov 0x2 -> dsIssued on hardware
&Emulated DS
Register updated with 0x10
Emulated DS Register
Key Insight: Just virtualize hardware descriptor caches Don’t virtualize segmentation tables GDT/LDT at all!
As and when guest explicitly loads hardware registers: Copy guest segment descriptors into corresponding caches Issue hardware register load instructions with modified selector Use dynamic translation for doing this
41
Irreversible Segmentation Problem Solved
Instrumentation Engine
GDT
A0x10 B0x10
GDT
B0x10
GDT
Selector: 0x2
Desc. Cache: A
DS:
Selector: 0x2
Desc. Cache: A
DS:
Selector: 0x2
Desc. Cache: A
DS:
Guest Writes B into GDT[0x10]
Gratuitous Load performed by Instrumentation System
Correct!
Restore DS
Save DS
H/WGDT
A0x2
H/WGDT
A0x2
H/WGDT
A0x2
Selector: 0x10
Emulated DS:
Selector: 0x10
Emulated DS:
Selector: 0x10
Emulated DS:
42
Implications of Virtualization Scheme
Gratuitous loads now performed with cached descriptors Ensures preservation of guest-expected hardware semantics
Allows PinOS to easily steal rest of table for own descriptors
With this scheme, no need for tracking guest table writes!
However, need to tame/emulate all segmentation instructions lds/es/fs/gs/ss mov ds/es/fs/gs/ss, […] mov […], ds/es/fs/gs/ss pop ds/es/fs/gs/ss push ds/es/fs/gs/ss lgdt, sgdt lldt, sldt lar, lsl, verr, verw ltr, str, task gate transfer through interrupt Far jumps, calls and returns, iret, sysenter and sysexit Software interrupt: int n, into, int 3 Hardware interrupt / exception