+ All Categories
Home > Technology > SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Date post: 24-Jan-2017
Category:
Upload: linaro
View: 751 times
Download: 2 times
Share this document with a friend
97
Towards multi- threaded TCG Alex Bennée [email protected] Linaro Connect SFO15
Transcript
Page 1: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Towards multi-threaded TCG

Alex Benné[email protected] Connect SFO15

Page 2: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Introduction

Page 3: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Hello!Alex BennéeWorks for LinaroIRC: stsquad/ajb-linaroMostly ARM emulation, a little KVM on the sideUses Emacs

Page 4: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

What is multi-threaded TCG?

Page 5: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG?Tiny Code GeneratorRunning non-native code on your desktop

Page 6: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Current process model

Page 7: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How it looks

Page 8: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Multi-threaded TCG

Page 9: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Reality?

Page 10: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Why do we want it?

Page 11: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Living in a Multi-core world

Page 12: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Raspberry Pi 2

Quad-core Cortex A7 @900Mhz

$25

Page 13: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Dragonboard 410c

Quad-core Cortex A53 @ 1.4Ghz

$75

Page 14: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Nexus 5

Quad Core Krait 400 @ 2.26Ghz

$339

Page 15: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

My Desktop

Intel i7 (4 core + 4 hyperthreads) @ 3.4 Ghz

$600

Page 16: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Build Server

2 x Intel Xeon (6+6 hyperthreads) @ 3.46 Ghz

$2-3k

Page 17: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Android Emulation

Android emulator uses QEMU as baseMost modern Android devices are multi-core

Page 18: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Per-core performance

via @HenkPoly

Page 19: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Other reasons to care

Page 20: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Using QEMU for System bring upIncreasingly used for prototyping

new multi-core systemsnew heterogeneous systems

Want concurrent behaviourBad software should fail in QEMU!

Page 21: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

As a development toolInstrumentation and inspectionRecord and playbackReverse debugging

Page 22: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Cross Tooling

Page 23: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Building often complex

http://lukeluo.blogspot.co.uk/2014/01/linux-from-scratch-for-cubietruck-c4.html

Page 24: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Just use qemu-linux-user?Make sure binfmt_misc setupMess around with multilib/chrootsHope threads/signals not used

Page 25: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Or boot a multi-core system

Page 26: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Things in our wayGlobal State in QEMUGuest Memory Models

Page 27: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Global StateNumerous globals in TCG generationTCG Runtime StructuresDevice emulation structures

Page 28: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Guest Memory modelsAtomic behaviourLL/SC SemanticsMemory barriers

Page 29: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How can we do it?

Page 30: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

3 broad approaches

Page 31: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Use threads/locks

Page 32: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Use processes/IPC

http://ipads.se.sjtu.edu.cn/_media/publications/coremu-ppopp11.pdf

Page 33: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Re-write from scratch

Page 34: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Pros/Cons of each approach

Aproach Threads/Locks Process/IPC Re-write

Pros Performance Correctness Shiny andNew!

Cons Performance,Complexity

Performance,Invasive

WastedLegacy, Newproblems

Page 35: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

What we have doneProtected code generationSerialised the run loop

translated code multi-threadedNew memory semanticsMulti-threaded device emulation

Page 36: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Things in our wayGlobal State in QEMUGuest Memory Models

Page 37: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Code generator globalsThreads

TCG VariablesvCPU 1

cpu_V0write

vCPU 2

write

read

read

Page 38: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG Runtime structuresSoftMMU TLBTranslation Buffer Jump CacheCondition Variables (tcg_halt_cond)Flags (exit_request)

Page 39: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

per-CPU variablestcg_halt_cond -> cpu->halt_condexit_request -> cpu->exit_request

Page 40: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Quick reminder of how TCG works

Page 41: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Code Generationtarget machine codeintermediate form (TCG ops)generate host binary code

Page 42: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Input Codeldr r2, [r3]add r2, r2, #1str r2, [r3]bx lr

Page 43: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG Opsmov_i32 tmp5,r3qemu_ld_i32 tmp6,tmp5,leul,3mov_i32 r2,tmp6

movi_i32 tmp5,$0x1mov_i32 tmp6,r2add_i32 tmp6,tmp6,tmp5mov_i32 r2,tmp6

mov_i32 tmp5,r3mov_i32 tmp6,r2qemu_st_i32 tmp6,tmp5,leul,3

exit_tb $0x7ff368a0baab

Page 44: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Output Codemov (%rsi),%ebpinc %ebpmov %ebp,(%rsi)

Page 45: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Basic Block

Page 46: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Block Chainingblock

prologuecodeexit 1exit 2

blockprologuecodeexit 1exit 2

blockprologuecodeexit 1exit 2

blockprologuecodeexit 1exit 2

Page 47: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG Global StateCode generation globalsGlobal runtime

Page 48: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Translated code is safeOnly accesses vCPU structuresWe need to careful leaving the translated code

Page 49: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Exit DestinationsBack to Run LoopHelper Function

Page 50: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Exit to run loopEnter JIT Code

blockprologuecodeexit 1exit 2

blockprologuecodeexit 1exit 2

Return to runloop

Page 51: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Simplified Run Loop

Page 52: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Helper Functions

QEMU C CodevCPU State

Global State

cpu_tb_execblock

prologue

code

exit 1exit 2

Return to runloop

blockprologue

code

exit 1exit 2

Complex Op

System Op

Registers

Jump Cache

Page 53: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Types of HelperComplex Operations

should only touch private vCPU stateno locking required*

System Operationslocking for cross-cpu thingssome operations affect all vCPUs

Page 54: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Stop the World!Using locks

expensive for frequently read vCPU structurescomplex when modifying multiple vCPUs data

Ensure relevant vCPUs halted, modify at "leisure"

Page 55: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Deferred WorkExisting queued_work mechanism

add work to queuesignal vCPU to exit

New queued_safe_workwaits for all vCPUs to haltno lock held when run

Page 56: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG SummaryMove global vars to per-CPU/Thread

exit and condition variablesMake use of tb_lock

uses existing TCG context tb_lockprotects all code generation/patchingprotects all manipulation of tb_jump_cache

Add async safe work mechanismDefer tasks until all vCPUs halted

Page 57: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Things in our wayGlobal State in QEMUGuest Memory Models

Page 58: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

No Atomic TCG Ops

Page 59: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Atomic Behaviour is easy when Single Threaded

Page 60: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Considerably harder when Multi-threaded

Page 61: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Load-link/Store-conditional (LL/SC)RISC alternative to atomic CASMulti-instruction sequenceStore only succeeds if memory not touch since linkLL/SC can emulate other atomic operations

Page 62: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

LL/SC in QEMUIntroduce new TCG ops

qemu_ldlink_i32/64qemu_stcond_i32/64

Can be used to emulateload/store exclusiveatomic instructions

Page 63: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

SoftMMU

Page 64: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

What it doesMaps guest loads/stores to host memory

uses an addend offsetFast path in generated codeSlow path in C code

Victim cache lookupTarget page table walk

Page 65: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How it works: Stage one

Page 66: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How it works: Stage two

Page 67: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How it works: Stage three

Page 68: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

How does this help with LL/SC?Introduced new TCG ops

qemu_ldlink_i32/64qemu_stcond_i32/64

Using the SoftMMU slow path we can implement thebackend in a generic way

Page 69: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

LL/SC in Pictures

Page 70: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

LL/SC SummaryNew TLB_EXCL flag marks pageAll access now follows slow-path

trip exclusive flagStore conditional always slow-path

Will fail if flag tripped

Page 71: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Memory Model SummaryMulti-threading brings a number of challengesNew TCG ops to support atomic-like operationsSoftMMU allows fairly efficient implementationMemory barriers still an issue.

Page 72: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Device Emulation

Page 73: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

KVM already done it ;-)added thread safety to a number of systemsintroduced memory APIintroduced I/O thread

Page 74: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

TCG access to device memoryAll MMIO pages are flagged in the SoftMMU TLBThe slowpath helper passes the access to the memory APIThe memory API defines regions of memory as:

lockless (the eventual driver worries about concurrency)locked with the BQL

Page 75: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Thanks KVM!

Page 76: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Current state

Page 77: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

What's leftLL/SC PatchesMTTCG PatchesMemory BarriersEnabling all front/back endsTesting & Documentation

Page 78: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

LL/SC PatchesMajority of patch set independent from MTTCGBeen through a number of review cyclesHope to get merged soonish now tree is open

Who/where?

Alvise Rigo of Virtual Open Systems

Latest branch: slowpath-for-atomic-v4-no-mttcghttps://git.virtualopensystems.com/dev/qemu-mt.git

Page 79: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

MTTCG PatchesClean-up and rationlisation patches

starting to go into maintainer treesDelta to full MTTCG reducing

Who/where?

Frederic Konrad of Greensocs

Latest branch: multi_tcg_v7http://git.greensocs.com/fkonrad/mttcg.git

Page 80: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Page 81: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Page 82: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Emilo's PatchesRecent patch series posted to listAlternate solutions

AIE helpers for LL/SCExample implementation of barrier semantics

Page 83: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Memory BarriersSome example code (Emilo's patches)Use a number of barrier TCG opsHard to trigger barrier issues on x86 backend

Page 84: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Enabling all front/back endsCurrent testing is ARM32 on x86Aim to enable MTTCG on all front/backendsFront-ends need to use new TCG opsBack-ends need to support new TCG ops

may require incremental updates

Page 85: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Testing & DocumentationBoth important for confidence in designTorture tests

hand-rolledusing kvm-unit-tests

Want to have reference in docs/ on how it should work

Page 86: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Questions?

Page 87: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

The EndThank you

Page 88: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Extra Material

Page 89: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Full TLB Walk Diagram

Page 90: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Annotated TLB Walk Code (In)0x40000000: e3a00000 mov r0, #0 ; 0x00x40000004: e59f1004 ldr r1, [pc, #4] ; 0x40000010

Page 91: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Annotated TLB Walk Code (Ops)­­­­ prologueld_i32 tmp5,env,$0xfffffffffffffff4movi_i32 tmp6,$0x0brcond_i32 tmp5,tmp6,ne,$L0

­­­­ 0x40000000movi_i32 tmp5,$0x0mov_i32 r0,tmp5

­­­­ 0x40000004movi_i32 tmp5,$0x4000000cmovi_i32 tmp6,$0x4add_i32 tmp5,tmp5,tmp6qemu_ld_i32 tmp6,tmp5,leul,1mov_i32 r1,tmp6

Page 92: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Annotated TLB Walk Code (Opt Op)OP after optimization and liveness analysis: ­­­­ prologue ld_i32 tmp5,env,$0xfffffffffffffff4 movi_i32 tmp6,$0x0 brcond_i32 tmp5,tmp6,ne,$L0

­­­­ 0x40000000 movi_i32 r0,$0x0

­­­­ 0x40000004 movi_i32 tmp5,$0x40000010 qemu_ld_i32 tmp6,tmp5,leul,1 (val, addr, index, opc) mov_i32 r1,tmp6

Page 93: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Annotated TLB Walk Code (Out Asm)­­­­ prologue 0x7fffe1ba1000: mov ­0xc(%r14),%ebp 0x7fffe1ba1004: test %ebp,%ebp 0x7fffe1ba1006: jne 0x7fffe1ba10c9 ­­­­ 0x40000000 0x7fffe1ba100c: xor %ebp,%ebp 0x7fffe1ba100e: mov %ebp,(%r14) ­­­­ 0x40000004 ­ movi_i32 0x7fffe1ba1011: mov $0x40000010,%ebp ­ qemu_ld_i32 0x7fffe1ba1016: mov %rbp,%rdi ­ r0 0x7fffe1ba1019: mov %ebp,%esi ­ r1

0x7fffe1ba101f: and $0xfffffc03,%esi

­ index into tlb_table[mem_index][0]+target_page 0x7fffe1ba101b: shr $0x5,%rdi 0x7fffe1ba1025: and $0x1fe0,%edi

0x7fffe1ba102b: lea 0x2c18(%r14,%rdi,1),%rdi 0x7fffe1ba1033: cmp (%rdi),%esi 0x7fffe1ba1035: mov %ebp,%esi 0x7fffe1ba1037: jne 0x7fffe1ba111b ­­­ offset to "host address" 0x7fffe1ba103d: add 0x10(%rdi),%rsi ­­­ actual load 0x7fffe1ba1041: mov (%rsi),%ebp

Page 94: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

0x7fffe1ba1041: mov (%rsi),%ebp ­­­ mov_i32 r1, tmp6 0x7fffe1ba1043: mov %ebp,0x4(%r14)

­­­­­ slow path function call 0x7fffe1ba111b: mov %r14,%rdi 0x7fffe1ba111e: mov $0x21,%edx 0x7fffe1ba1123: lea ­0xe7(%rip),%rcx # 0x7fffe1ba1043 0x7fffe1ba112a: mov $0x555555653980,%r10 # helper_le_ldul_mmu 0x7fffe1ba1134: callq *%r10 0x7fffe1ba1137: mov %eax,%ebp 0x7fffe1ba1139: jmpq 0x7fffe1ba1043

Page 95: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

Locking in run loop

Page 96: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Page 97: SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU

SoftMMU Slowpath ReasonsMissing mapping

first access (fill)crossed target page (refill)

Mapping invalidatedPage not dirtyPage is MMIO


Recommended