U.S. Army Research, Development and Engineering Com mand
Cycle-Accurate 8080 Emulation Using an ARM11 Processor
With Dynamic Binary Translation
James Ross
Engility Corporation
David Richie
Brown Deer Technology
UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
This work was supported by the U.S. Army Research Laboratory (ARL) under the Advanced Computing research project. We expressly thank Song Park (ARL) and Dale Shires (ARL) for technical discussions during this effort.
2 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Fast Cycle-Accurate Emulation
Optimize an Intel 8080 emulator on a ARM-based Rasp berry Pi to run the original 1978 Space Invaders ROM binary
Software solutions only, no additional hardware or modifications
Benchmark was replay of recorded game using clock-c ycle time-stamped events
Emulated game must arrive at correct final framebuf fer
Cycle-accurate emulation made the problem challengi ng and constrained allowable emulator optimization
More details available here: https://caesr.uwaterlo o.ca/memocode/
+8KBROM
+ =Emulator
3 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Intel 8080
Registers/Instructions
8-bit microprocessor
NOPLXI(R,word)STAX(R)INX(R)INR(R)DCR(R)MVI(R,byte)RLCDAD(R)LDAX(R)DCX(R)RRCRALRARSHLD(word)DAALHLD(word)CMASTA(word)STCLDA(word)CMCMOV(R0,R1)HLT
RNZPOP(R)JNZ(word)JMP(word)CNZ(word)PUSH(R)ADI(byte)RST(pc)RZRETJZ(word)CZ(word)CALL(word)ACI(byte)RNCJNC(word)OUT(byte)CNC(word)SUI(byte)RCJC(word)IN(byte)CC(word)SBI(byte)
ADD(R)ADC(R)SUB(R)SBB(R)ANA(R)XRA(R)ORA(R)CMP(R)
RPOJPO(word)XTHLCPO(word)ANI(byte)RPEPCHLJPE(word)XCHGCPE(word)XRI(byte)RPPOP_PSWJP(word)DICP(word)PUSH_PSWORI(byte)RMSPHLJM(word)EICM(word)CPI(byte)
256* Instructions (1-3 byte instruction width)
4 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Original Emulator
Original 8080 emulator from http://emulator101.com
C code, indirect threaded dispatch design
Processor state represented in memory accessed as a C struct
Switch-Case Statement using Opcode
Next Interrupt Calc.
Event Input Update
Load 8080 Binary
Cycle > Interrupt?
Push PC to Stack
Interrupt Handler
Last Event? Break
Call Emulate Op Lookup Opcode @ State->PC
Update State
Return # Cycles[Opcode]
Update Cycle Count
Different for each of
the 256 instructions
5 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
General Optimizations
General optimizations applied to existing design pr oduced 2x speedup
Optimizations did not alter the basic design of the original emulator
Emulate8080p() call changed to return only when pro gram complete
Original call emulated single instruction
Cycle check and interrupt service calls moved insid e this routine
Combine 8-bit registers into 16-bit registers with union for 16-bit operations
Specifically: B8,C8 → BC16 ; D8,E8 → DE16 ; H8,L8 → HL16
Paired 8-bit memory operations combined into single 16-bit memory operation
Remove cycle count table lookup and insert into ins truction code blocks
Use 256-way lookup tables for setting register flag s (Z,S,P)
Replace cycle count check with cycle down-counter a nd check if less than 0
6 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Direct Register Mapping
Replace abstract processor state with direct mappin g of hardware registers
Creates emulator that is qualitatively different – “ crosses a line of abstraction”
Mapping uses up registers for dedicated purpose
Creates highly constrained mode of operation, denie s compiler the flexibility needed for ordinary compiled code generation
First step towards direct mapping between emulated/ target ISAs
Even a proximate mapping will produce fastest emula tion
CB
B C
SP
SP
rmem
Base Memory Address
LH
H L
ED
D E
A
A
cc
cc
rcycle
CycleDown-Counter
PC
PC
= 8-bit 8080 register
= 16-bit 8080 register
= 32-bit ARM register
7 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Custom Call Interface
Custom call interface protocol is implemented as a special function written directly in ARM assembly
Call into the emulator core requires the following steps:
1. All registers saved on stack
2. Mapped registers loaded with values for emulated processor state
3. Return address saved in ARM link return register lr
4. Jump to dynamic (re-)entry address where direct r egister emulation is to begin or continue
Return to normal execution possible at any point wi thin emulator core with lr →→→→pc operation
Upon return, following steps are required:
1. Address in register r0 is immediately saved in processor state C struct as new re-entry address
2. All mapped registers are saved in processor state C struct
3. Previous register values from normal execution ar e restored
8 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Indirect-Threaded Dispatch
Coded in Assembly
Equivalent of emulator switch statement coded direc tly in ARM assembly
Each emulated instruction operates directly on regi ster processor state
“Calling into” the emulator core requires custom in terface protocol
Test design to verify correctness of assembly befor e dynamic binary translation
Direct Register Emulation
Load 8080 Binary
Call Emulate Op
Next Interrupt Calc.
Event Input Update
Push PC to Stack
Interrupt Handler
Last Event? Return
Direct Register
Emulation Interface
Conventional Execution
Jump Instruction @ State->PC
Decrement Cycle Count
Update State
Cycle < 0?
Coded entirely in ARM assembly
Custom Call Interface
9 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Dynamic Binary Translation
The point of the exercise ...
Fast translator designed to process original binary and emit ARM instruction sequence to create an equivalent binary program
Emitted binary based on machine code templates requiring dynamic “fix-up” to specialize
Fast means several O(N) passes –accounts for about 1% of benchmark time, fast = negligible
Cycle-accuracy requires interlaced interrupt checks to be emitted
Current design is complicated, emits instruction blocks with redundant trailing fast code
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
Op
0x0000
0x0001
0x0003
0x0006
0x0007
0x0009
0x000A
0x000B
0x000C
0x000F
0x0000
0x0010
0x001C
0x0020
0x002C
ARMRegister
File
8080Register
File
ARMMemory
Map
8080Memory
Map
10 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Modified Direct Register Mapping
Direct register mapping must be slightly modified
Dedicated register rcross holds address of instruction translation cross-reference to support dynamic address faults
No longer a need for an emulated program counter, s ubsumed by actual ARM program counter(!)
CB
B C
SP
SP
Not used for dynamic binary translation where the 8080 PC is no longer emulated.
rmem
Base Memory Address
Added for dynamic binary translation to support dynamic address faults.
LH
H L
ED
D E
A
A
cc
cc
rcycle
CycleDown-Counter
PC
PC
rcross
AddressCross-Index
= 8-bit 8080 register
= 16-bit 8080 register
= 32-bit ARM register
11 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Dynamic Binary Translation
Emulator core used to test direct register mapping replaced with dynamically translated binary
Uses same custom interface protocol
Execution of the binary proceeds with periodic inte rrupts and re-entries to achieve cycle-accurate emulation based on pre-recor ded events
Instruction Block
Instruction Block
Load/Translate 8080 Binary
Call Emulate Op
Next Interrupt Calc.
Event Input Update
Push PC to Stack
Interrupt Handler
Last Event? Return
Direct Register
Emulation Interface
Conventional Execution Direct Register Emulation with Binary Translation
Instruction Block
Instruction Block
ARM binary programwith preemption
Custom Call Interface
12 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Machine Code Templates
ARM assembly code created for purpose of generating machine code fragments
Fragments of machine code are extracted from assemb led binary object
8080 instructions form classes based on required fi x-up
Machine code fix-up requires sub-instruction bit-fi eld modification
CANNOT be replicated with compiled C code even with inline assembly
instr_0x4a: @ MOV C,D
bic rbc, rbc, #255
lsr r0, rde, #8
and r0, r0, #255
orr rbc, r0
instr_0x01: @ LXI B,word
mov rbc, #XBYTE_LO
orr rbc, rbc, #XBYTE_HI
instr_0xc2: @ JNZ addr
tst rcc, #ZFLAG
beq instr_0xc2
Emitted machine code requires no fix-up
Emitted machine code requires fix-up to replace dummy values for XBYTE_LO and XBYTE_HI
Emitted machine code requires fix-up to adjust PC-relative branch
Examples of assembly code fragments used to generate machine code templates
13 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Instruction Blocks
Dynamic binary translation emits ARM instruction blocks corresponding to runs of 8080 instructions to emulate the original binary
Each code block has a maximum path in terms of clock cycles before reaching the bottom or jumping to another code block
cmp rcyc, #BLOCK_CYCLE
bmi ALPHA
Cycle Check
Instruction 1
Cycle Check
Instruction 2
Cycle Check
Instruction N
b BETA
Instruction 1
Instruction 2
Instruction N
BETA
ALPHA
Safe instructions
executed close to a
cycle interrupt with
cycle checks prior to
each instruction
Check if cycle interrupt
can occur within the
instruction block
Fast instructions
executed when there is
no possibility of a cycle
interrupt
(Next Instruction Block)(Next Instruction Block)
14 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Address Space Layout
0x40000 base address alignment allows 32-bit ARM ad dress to be represented in 16-bits, maintaining emulated stack size
0x8000 offset enabled dynamic address fault detecti on/correction
0x0000
0x1FFF
0x2000
0x24000x23FF
base address with 0x40000 alignment
base + 0x8000
Stack (↑↑↑↑)
8080 ProgramInstructions
0x3FFF
Binary translation from 8080 to ARM
ARM ProgramInstructions
Video RAM
15 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE
Results
Original 8080 processor: average 7.72 cycles per Op
Contest submission achieved approximately 4:1 overh ead on ARM processor
Fastest emulator close to 1:1 in cycle efficiency
Method Performance(seconds)
ARM Cycles per 8080 Op
Speedup
Original Emulator 12.97 104.60 N/AOptimized Emulator 6.32 50.97 2.05x
Direct Register Emulation* 3.37 27.18 3.85xDirect Threaded 1.52 12.26 8.53x
Dynamic Binary Translation 1.08 8.71 12.01x*Contest submission