Qin Zhao (MIT)Derek Bruening (VMware)Saman Amarasinghe (MIT)
Umbra: Efficient and Scalable Memory Shadowing
CGO 2010, Toronto, CanadaApril 26, 2010
Shadow Memory• Meta-data
– Track properties of application memory• Synchronized Update
– Application data and meta-data
CGO, Toronto, Canada, 4/26/2010 2
a.outa.out
stack stack
libc libc
Application Memory
Shadow Memory
heap heap
Examples• Memory Error Detection
– MemCheck [VEE’07]– Purify [USENIX’92]– Dr. Memory– MemTracker [HPCA’07]
• Dynamic Information Flow Tracking – LIFT [MICRO’39]– TaintTrace [ISCC’06]
• Multi-threaded Debugging– Eraser [TCS’97]– Helgrind
• Others– Redux [TCS’03]– Software Watchpoint [CC’08]
CGO, Toronto, Canada, 4/26/2010 3
Issues• Performance
– Runtime overhead• Example: MemCheck 25x [VEE’07]
• Scalability– 64-bit architecture
• Dependence– OS– Hardware
• Development– Implemented with specific analysis– Lack of a general framework
CGO, Toronto, Canada, 4/26/2010 4
Memory Shadowing System• Dynamic Instrumentation
– Context switch (application ↔ shadow)– Address calculation– Updating meta-data
• Memory Management– Memory allocation / free
• Monitor application memory management• Manage shadow memory
– Mapping translation scheme (addrA addrS)• DMS: Direct Mapping Scheme• SMS: Segmented Mapping Scheme
CGO, Toronto, Canada, 4/26/2010 5
Direct Mapping Scheme (DMS)• Single memory region for entire address space.• Translation:• Issue: address conflict between memA and memS
CGO, Toronto, Canada, 4/26/2010 6
dispaddraddr AS
lea [addr] %r1add %r1 disp %r1
DMS-32 SMS-32 DMS-64 SMS-640
1
2
3
4
5
1.802.40
4.67
Slow
down
rel
ativ
e to
na
tive
exe
cuti
on
Application
Shadow
DMS-32 SMS-32 DMS-64 SMS-640
1
2
3
4
5
1.802.40
4.67
Slow
down
rel
ativ
e to
na
tive
exe
cuti
onSegmented Mapping Scheme (SMS)
• Shadow segment per application segment• Translation:
– Segment lookup (address indexing)– Address translation
CGO, Toronto, Canada, 4/26/2010 7
lea [addr] %r1mov %r1 %r2shr %r2, 16 %r2add %r1, disp[%r2] %r1
segAS dispaddraddr
addrA
addrS
App 1
Shd 1
Shd 2
App 2Segment table
Umbra• Mapping Scheme
– Segmented mapping– Scale with actual memory usage
• Implementation– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Results
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 8
Kernel space
Shadow Memory Mapping• Scaling to 64-bit Architecture
– DMS• Infeasible due to memory layout
CGO, Toronto, Canada, 4/26/2010 9
a.out
Unusable space
stackUser space
vsyscall
247
264
CGO, Toronto, Canada, 4/26/2010
Shadow Memory Mapping• Scaling to 64-bit Architecture
– DMS• Infeasible due to memory layout
– Single-Level SMS• Too big (~4 billion entries)
CGO, Toronto, Canada, 4/26/2010 10
addrA
Shadow Memory Mapping• Scaling to 64-bit Architecture
– DMS• Infeasible due to memory layout
– Single-Level SMS• Too big (~4 billion entries)
– Multi-Level SMS• Even more expensive • Fast path on lower 32G (MemCheck)
CGO, Toronto, Canada, 4/26/2010 11DMS-32 SMS-32 DMS-64 SMS-64
0
1
2
3
4
5
1.802.40
4.67
Slow
down
relat
ive to
na
tive e
xecuti
on
addrA
Shadow Memory Mapping• Scaling to 64-bit Architecture
– DMS is infeasible – Single-Level SMS is too sparse– Multi-Level SMS is too expensive
• Umbra Solution– Eliminate empty entries– Compact table– Walk the table to find the entry
CGO, Toronto, Canada, 4/26/2010 12
Umbra• Mapping Scheme √
– Segmented mapping– Scale with actual memory usage
• Implementation– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 13
Implementation• Memory Manager
– Monitor and control application memory allocation• brk, mmap, munmap, mremap
– Allocate shadow memory– Maintain translation table
• Instrumenter– Instrument every memory reference
• Context save• Address calculation• Address translation• Shadow memory update• Context restore
CGO, Toronto, Canada, 4/26/2010 14
App 1
Shd 1
Shd 2
App 2
Umbra• Mapping Scheme √
– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 15
~100
Unoptimized System• Small overhead from DynamoRIO• Slower than SMS-64
– Need to walk the global translation table
• Why so slow?– 41.79% instructions are memory references– For each of these instructions
• Full context switch• Table lookup• Call-out instrumentation
16
Global translation
table
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
Tab
le
Mem
oiza
tion
C...
Refe
renc
e Ca
che
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
Optimization• Translation Optimization
– Thread-local translation cache– Hashtable lookup– Memoization mini-cache– Reference uni-cache
• Instrumentation Optimization– Context switch reduction– Reference grouping– 3-stage code layout
1717
Global translation
table
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
Tab
le
Mem
oiza
tion
C...
Refe
renc
e Ca
che
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
~100
~100
1. Thread-Local Translation Cache• Local translation table per
thread– Synchronize with global translation
table when necessary– Avoid lock contention– Walk table to find match entry
• Walk global table if not find in thread-local cache
• Inlined instrumentation
18
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
Tab
le
Mem
oiza
tion
C...
Refe
renc
e Ca
che
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
~100
2. Hashtable Lookup• Hashtable per thread• Fixed number of slots• Hash(addra) entry
in thread-local cache– If match, found – If no match, walk the local cache
19
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Hashtable
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
Tab
le
Mem
oiza
tion
C...
Refe
renc
e Ca
che
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
~100
3. Memoization Mini-Cache• Four-entry table per thread
– Stack– Heap– Application (a.out)– Units found in last table lookup
• If not match, hashtable lookup– 68.93% hit ratio
20
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
tabl
e
Mem
oiza
tion
Min.
..
Refe
renc
e Un
i-C...
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
~100
4. Reference Uni-Cache• Software uni-cache per instr
per thread– Last reference unit tag– Last translation displacement
• If not match, memoization mini-cache check– 99.93% hit ratio
21
Reference uni-cache
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBXSM
S-64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
tabl
e
Mem
oiza
tion
Min.
..
Refe
renc
e Un
i-C...
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
5. Context Switch Reduction• Register liveness analysis
– Use dead register– Avoid flags save/restore
22
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
~100
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
tabl
e
Mem
oiza
tion
Min.
..
Refe
renc
e Un
i-C...
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
Reference uni-cache
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBX
#/#Instr SPEC2006Memory Reference 41.79%Eflag Steal 2.55%Register Steal 8.20%
6. Reference Grouping• One reference cache for
multiple references– Stack local variables– Different members of the same
object
23
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
~100
SMS-
64
Dyna
moR
IO
Unop
timize
d
Loca
l Tra
nsla
tion.
..
Hash
tabl
e
Mem
oiza
tion
Min.
..
Refe
renc
e Un
i-C...
Cont
ext S
witc
h R.
..
Refe
renc
e Gr
ou...
02468
101214161820
4.7
1.1
100.0
15.8 15.212.0
8.3
3.1 2.5
Reference uni-cache
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBX
#/#Instr SPEC2006Memory Reference 41.79%Ref Uni-Cache Checks 22.76%
3-stage Code Layout• Inline stub (<10 instructions)
– Quick inline check code with minimal context switch• Lean procedure (~50 instructions)
– Simple assembly procedure with partial context switch• Callout (C function)
– C function with complete context switch
CGO, Toronto, Canada, 4/26/2010 24
uni-cache check memoization check
hashtable lookup
local cache lookup
<full context switch>c_function() { // global table // lookup . . . . . .}<full context switch>
app instruction
Inline stub Lean procedure Callout
Umbra• Mapping Scheme √
– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization √– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 25
Client API
Event Hooks Descriptionclient_init Process initializationclient_exit Process exitclient_thread_init Thread initializationclient_thread_exit Thread exitshadow_memory_create Shadow memory creationshadow_memory_delete Shadow memory deletioninstrument_update Insert meta-data update code
CGO, Toronto, Canada, 4/26/2010 26
Umbra Client: Shared Memory Detection
static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}
27CGO, Toronto, Canada, 4/26/2010
• Meta-data maintains a bit map to store which threads access the associated memory
Umbra• Mapping Scheme √
– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization √– Translation optimization– Instrumentation optimization
• Client API √• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 28
Performance Evaluation
CGO, Toronto, Canada, 4/26/2010 29
Slowdown relative to
native execution
DMS-32 SMS-32 SMS-64 Umbra-640.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1.802.40
4.67
2.49
EMS64:Efficient Memory Shadowing for 64-bit
• Translation– – Reference uni-cache hit rate: 99.93%– Still need a costly check to catch the 0.07%
• Reg steal; save flags; compare & jump; restore
• EMS64 (ISMM’10)– Speculatively use a disp without check– Notified by memory access violation fault for incorrect
disp
disprcaddraddr AS .
CGO, Toronto, Canada, 4/26/2010 30
EMS64 Preliminary ResultSlowdown relative to
native execution
CGO, Toronto, Canada, 4/26/2010 31
DMS-32 SMS-32 SMS-64 Umbra-64 EMS-640.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1.80
2.40
4.67
2.49
1.81
Thanks• Download
– http://people.csail.mit.edu/qin_zhao/umbra/
• Q & A
CGO, Toronto, Canada, 4/26/2010 32