+ All Categories
Home > Documents > Cycle-Accurate 8080 Emulation Using an ARM11 Processor ... · 2 of 15 UNCLASSIFIED / APPROVED FOR...

Cycle-Accurate 8080 Emulation Using an ARM11 Processor ... · 2 of 15 UNCLASSIFIED / APPROVED FOR...

Date post: 13-Feb-2019
Category:
Upload: vohanh
View: 225 times
Download: 0 times
Share this document with a friend
15
U.S. Army Research, Development and Engineering Command Cycle-Accurate 8080 Emulation Using an ARM11 Processor With Dynamic Binary Translation James Ross Engility Corporation David Richie Brown Deer Technology UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE This work was supported by the U.S. Army Research Laboratory (ARL) under the Advanced Computing research project. We expressly thank Song Park (ARL) and Dale Shires (ARL) for technical discussions during this effort.
Transcript

U.S. Army Research, Development and Engineering Com mand

Cycle-Accurate 8080 Emulation Using an ARM11 Processor

With Dynamic Binary Translation

James Ross

Engility Corporation

David Richie

Brown Deer Technology

UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

This work was supported by the U.S. Army Research Laboratory (ARL) under the Advanced Computing research project. We expressly thank Song Park (ARL) and Dale Shires (ARL) for technical discussions during this effort.

2 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Fast Cycle-Accurate Emulation

Optimize an Intel 8080 emulator on a ARM-based Rasp berry Pi to run the original 1978 Space Invaders ROM binary

Software solutions only, no additional hardware or modifications

Benchmark was replay of recorded game using clock-c ycle time-stamped events

Emulated game must arrive at correct final framebuf fer

Cycle-accurate emulation made the problem challengi ng and constrained allowable emulator optimization

More details available here: https://caesr.uwaterlo o.ca/memocode/

+8KBROM

+ =Emulator

3 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Intel 8080

Registers/Instructions

8-bit microprocessor

NOPLXI(R,word)STAX(R)INX(R)INR(R)DCR(R)MVI(R,byte)RLCDAD(R)LDAX(R)DCX(R)RRCRALRARSHLD(word)DAALHLD(word)CMASTA(word)STCLDA(word)CMCMOV(R0,R1)HLT

RNZPOP(R)JNZ(word)JMP(word)CNZ(word)PUSH(R)ADI(byte)RST(pc)RZRETJZ(word)CZ(word)CALL(word)ACI(byte)RNCJNC(word)OUT(byte)CNC(word)SUI(byte)RCJC(word)IN(byte)CC(word)SBI(byte)

ADD(R)ADC(R)SUB(R)SBB(R)ANA(R)XRA(R)ORA(R)CMP(R)

RPOJPO(word)XTHLCPO(word)ANI(byte)RPEPCHLJPE(word)XCHGCPE(word)XRI(byte)RPPOP_PSWJP(word)DICP(word)PUSH_PSWORI(byte)RMSPHLJM(word)EICM(word)CPI(byte)

256* Instructions (1-3 byte instruction width)

4 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Original Emulator

Original 8080 emulator from http://emulator101.com

C code, indirect threaded dispatch design

Processor state represented in memory accessed as a C struct

Switch-Case Statement using Opcode

Next Interrupt Calc.

Event Input Update

Load 8080 Binary

Cycle > Interrupt?

Push PC to Stack

Interrupt Handler

Last Event? Break

Call Emulate Op Lookup Opcode @ State->PC

Update State

Return # Cycles[Opcode]

Update Cycle Count

Different for each of

the 256 instructions

5 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

General Optimizations

General optimizations applied to existing design pr oduced 2x speedup

Optimizations did not alter the basic design of the original emulator

Emulate8080p() call changed to return only when pro gram complete

Original call emulated single instruction

Cycle check and interrupt service calls moved insid e this routine

Combine 8-bit registers into 16-bit registers with union for 16-bit operations

Specifically: B8,C8 → BC16 ; D8,E8 → DE16 ; H8,L8 → HL16

Paired 8-bit memory operations combined into single 16-bit memory operation

Remove cycle count table lookup and insert into ins truction code blocks

Use 256-way lookup tables for setting register flag s (Z,S,P)

Replace cycle count check with cycle down-counter a nd check if less than 0

6 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Direct Register Mapping

Replace abstract processor state with direct mappin g of hardware registers

Creates emulator that is qualitatively different – “ crosses a line of abstraction”

Mapping uses up registers for dedicated purpose

Creates highly constrained mode of operation, denie s compiler the flexibility needed for ordinary compiled code generation

First step towards direct mapping between emulated/ target ISAs

Even a proximate mapping will produce fastest emula tion

CB

B C

SP

SP

rmem

Base Memory Address

LH

H L

ED

D E

A

A

cc

cc

rcycle

CycleDown-Counter

PC

PC

= 8-bit 8080 register

= 16-bit 8080 register

= 32-bit ARM register

7 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Custom Call Interface

Custom call interface protocol is implemented as a special function written directly in ARM assembly

Call into the emulator core requires the following steps:

1. All registers saved on stack

2. Mapped registers loaded with values for emulated processor state

3. Return address saved in ARM link return register lr

4. Jump to dynamic (re-)entry address where direct r egister emulation is to begin or continue

Return to normal execution possible at any point wi thin emulator core with lr →→→→pc operation

Upon return, following steps are required:

1. Address in register r0 is immediately saved in processor state C struct as new re-entry address

2. All mapped registers are saved in processor state C struct

3. Previous register values from normal execution ar e restored

8 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Indirect-Threaded Dispatch

Coded in Assembly

Equivalent of emulator switch statement coded direc tly in ARM assembly

Each emulated instruction operates directly on regi ster processor state

“Calling into” the emulator core requires custom in terface protocol

Test design to verify correctness of assembly befor e dynamic binary translation

Direct Register Emulation

Load 8080 Binary

Call Emulate Op

Next Interrupt Calc.

Event Input Update

Push PC to Stack

Interrupt Handler

Last Event? Return

Direct Register

Emulation Interface

Conventional Execution

Jump Instruction @ State->PC

Decrement Cycle Count

Update State

Cycle < 0?

Coded entirely in ARM assembly

Custom Call Interface

9 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Dynamic Binary Translation

The point of the exercise ...

Fast translator designed to process original binary and emit ARM instruction sequence to create an equivalent binary program

Emitted binary based on machine code templates requiring dynamic “fix-up” to specialize

Fast means several O(N) passes –accounts for about 1% of benchmark time, fast = negligible

Cycle-accuracy requires interlaced interrupt checks to be emitted

Current design is complicated, emits instruction blocks with redundant trailing fast code

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

Op

0x0000

0x0001

0x0003

0x0006

0x0007

0x0009

0x000A

0x000B

0x000C

0x000F

0x0000

0x0010

0x001C

0x0020

0x002C

ARMRegister

File

8080Register

File

ARMMemory

Map

8080Memory

Map

10 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Modified Direct Register Mapping

Direct register mapping must be slightly modified

Dedicated register rcross holds address of instruction translation cross-reference to support dynamic address faults

No longer a need for an emulated program counter, s ubsumed by actual ARM program counter(!)

CB

B C

SP

SP

Not used for dynamic binary translation where the 8080 PC is no longer emulated.

rmem

Base Memory Address

Added for dynamic binary translation to support dynamic address faults.

LH

H L

ED

D E

A

A

cc

cc

rcycle

CycleDown-Counter

PC

PC

rcross

AddressCross-Index

= 8-bit 8080 register

= 16-bit 8080 register

= 32-bit ARM register

11 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Dynamic Binary Translation

Emulator core used to test direct register mapping replaced with dynamically translated binary

Uses same custom interface protocol

Execution of the binary proceeds with periodic inte rrupts and re-entries to achieve cycle-accurate emulation based on pre-recor ded events

Instruction Block

Instruction Block

Load/Translate 8080 Binary

Call Emulate Op

Next Interrupt Calc.

Event Input Update

Push PC to Stack

Interrupt Handler

Last Event? Return

Direct Register

Emulation Interface

Conventional Execution Direct Register Emulation with Binary Translation

Instruction Block

Instruction Block

ARM binary programwith preemption

Custom Call Interface

12 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Machine Code Templates

ARM assembly code created for purpose of generating machine code fragments

Fragments of machine code are extracted from assemb led binary object

8080 instructions form classes based on required fi x-up

Machine code fix-up requires sub-instruction bit-fi eld modification

CANNOT be replicated with compiled C code even with inline assembly

instr_0x4a: @ MOV C,D

bic rbc, rbc, #255

lsr r0, rde, #8

and r0, r0, #255

orr rbc, r0

instr_0x01: @ LXI B,word

mov rbc, #XBYTE_LO

orr rbc, rbc, #XBYTE_HI

instr_0xc2: @ JNZ addr

tst rcc, #ZFLAG

beq instr_0xc2

Emitted machine code requires no fix-up

Emitted machine code requires fix-up to replace dummy values for XBYTE_LO and XBYTE_HI

Emitted machine code requires fix-up to adjust PC-relative branch

Examples of assembly code fragments used to generate machine code templates

13 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Instruction Blocks

Dynamic binary translation emits ARM instruction blocks corresponding to runs of 8080 instructions to emulate the original binary

Each code block has a maximum path in terms of clock cycles before reaching the bottom or jumping to another code block

cmp rcyc, #BLOCK_CYCLE

bmi ALPHA

Cycle Check

Instruction 1

Cycle Check

Instruction 2

Cycle Check

Instruction N

b BETA

Instruction 1

Instruction 2

Instruction N

BETA

ALPHA

Safe instructions

executed close to a

cycle interrupt with

cycle checks prior to

each instruction

Check if cycle interrupt

can occur within the

instruction block

Fast instructions

executed when there is

no possibility of a cycle

interrupt

(Next Instruction Block)(Next Instruction Block)

14 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Address Space Layout

0x40000 base address alignment allows 32-bit ARM ad dress to be represented in 16-bits, maintaining emulated stack size

0x8000 offset enabled dynamic address fault detecti on/correction

0x0000

0x1FFF

0x2000

0x24000x23FF

base address with 0x40000 alignment

base + 0x8000

Stack (↑↑↑↑)

8080 ProgramInstructions

0x3FFF

Binary translation from 8080 to ARM

ARM ProgramInstructions

Video RAM

15 of 15 UNCLASSIFIED / APPROVED FOR PUBLIC RELEASE

Results

Original 8080 processor: average 7.72 cycles per Op

Contest submission achieved approximately 4:1 overh ead on ARM processor

Fastest emulator close to 1:1 in cycle efficiency

Method Performance(seconds)

ARM Cycles per 8080 Op

Speedup

Original Emulator 12.97 104.60 N/AOptimized Emulator 6.32 50.97 2.05x

Direct Register Emulation* 3.37 27.18 3.85xDirect Threaded 1.52 12.26 8.53x

Dynamic Binary Translation 1.08 8.71 12.01x*Contest submission


Recommended