1 Moo-Kyoung Chung©
Improvement of Compiled Instruction Set Simulator by Increasing Flexibility and
Reducing Compile Time
Moo-Kyoung Chung,Chung-Min Kyung,
Department of EECS, KAIST
2 Moo-Kyoung Chung©
Outline
Previous worksIntroduction to Instruction Set Simulator (ISS)
• Native code execution• Interpretive ISS• Compiled ISS
Improvement of compiled ISSNew approachesReducing compile timeIncreasing flexibility
Experimental resultConclusion
3 Moo-Kyoung Chung©
Instruction Set Simulator (ISS)
Used for processor design and software designInstruction set simulation/ architecture explorationEarly system verificationPre-silicon software development
Essential for hardware/software co-simulationFor embedded system, SoC designConnected with HDL simulator or emulator
ISSS/W
H/W
HostHost
HDL simulator HDL simulator or or SystemCSystemC
ISSS/W
H/W
HostHostEmulatorEmulator
or H/W Prototypeor H/W Prototype
4 Moo-Kyoung Chung©
Instruction Set Simulator
Native code executionInterpretive ISSCompiled ISS
Static compiled ISSDynamic compiled ISS
5 Moo-Kyoung Chung©
Native Code Execution
Target application code is compiled for host machine, executed on the host machineFastestInaccurate
Only for functionality verificationOnly support high-level language
Cannot support hardware dependent instructionCannot support assembly languageCannot support library or OS which does not available source code.
Difficult to measure performanceTarget processor instructions may different from the host processor instructions
Difficult to handle I/O accessTrap-based method
6 Moo-Kyoung Chung©
Interpretive ISS
Simulation loop ( fetch, decode, execute )
Flexible & AccurateWorking in the similar way to the processor behavior
Easy to implement, easy to estimate performanceAlmost commercial available simulatorsSlow
Several millions of simulated instruction per second (MIPS)
for( ; ; ){inst = fetch( pc );opcode = decode( inst );switch( opcode ){
…case ADD:
…break;
}}
7 Moo-Kyoung Chung©
Static Compiled ISS
TargetCompiler
TargetApplication
Code
TargetExecutable
Binary
BinaryTranslation
HostExecutable
(ISS)
C CodeGeneration
SimulationC code
HostC Compiler
HostExecutable
(ISS)
(A) Using Binary Translation(A) Using Binary Translation
(B) Using C intermediate code(B) Using C intermediate code
No comaptibility for host machine
8 Moo-Kyoung Chung©
Static Compiled ISS
AdvantageFast
• Faster than the corresponding interpretive simulator• Move instruction fetch and decode step into compile process• Host C compiler optimizes the simulation C code.
– Powerful optimization effect of host C compiler– Unnecessary activities of the processor hardware are not
simulated.– e.g.) Carry flag does not always need to be updated for all the
data processing instructions, if the next instructions do not use and overwrite it.
AccurateEasy to estimate performance
9 Moo-Kyoung Chung©
Static Compiled ISS
DisadvantageCannot support dynamic program code
• All the target instructions should be compiled in the static time.– Self-modifying code– External code (loading)– Dynamic linking library– Multiple instruction set (ARM:Thumb)
Enormous compile time overhead for the software designerIndirect branch instruction
• C compiler is hard to optimize the code• Performance (simulation speed) drop
Large application • Enormous memory usage• Generated binary is much larger then original binary.
Low locality of binary • Basic block is larger in that scale
10 Moo-Kyoung Chung©
Dynamic Compiled ISS
Dynamic compilationMoving compilation step into simulation run-time. Using binary translation
• Cannot use intermediate C code• Problems on the run-time C chuck code compilation
Using translation cache
Relatively slowRun-time compilation (binary translation) overhead
Flexible and Relatively accurate
TargetExecutable
InstructionFetch
BinaryTranslation
TranslationCache
Execution
Cache hit ?
Yes
No
11 Moo-Kyoung Chung©
ISS
AccuracyStatic compile ISS = Interpretive ISS > Dynamic Compiled ISS > Native code execution
Simulation SpeedNative code execution > Static compiled ISS > Dynamic compiled ISS > Interpreted ISS
SimplicityNative code execution > Interpretive ISS > Static compiled ISS > Dynamic Compiled ISS
Compilation SpeedNative code execution = Interpreted ISS = Dynamic compiled ISS > Static compiled ISS
FlexibilityInterpretive ISS = Dynamic Compiled ISS > Native code execution > Static compile ISS
12 Moo-Kyoung Chung©
Objective
How to reduce compile time (startup cost) of the static compiled ISS?How to increase flexibility of the static compiled ISS?
13 Moo-Kyoung Chung©
Improvement of Compiled-ISS
New approachUsing the object files (relocatable format, ELF) as input files instead of binary executable fileMaking the generated simulation program have the same data and control flow as the target program has.Making the static compiled ISS have built-in interpreter.
AdvantagesReducing the compile time recompiled timeIncreasing flexibility
• Supporting indirect branch efficiently• Supporting dynamic code
Fast speed• Taking all the advantages of the static compiled-ISS
14 Moo-Kyoung Chung©
ISS Generation Flow
Source 1
Source 2
Source 3
CrossCompile
C CodeGenerationObject 1
Object 2
Object 3
C 1
C 2
C 3
HostCompile Simulator
Target Source Target Source FilesFiles
Compile, Compile, Excluding LinkExcluding Link
RelocatableRelocatable filesfilesLibrary filesLibrary files
Simulation Code Simulation Code GenerationGeneration Target Simulation Target Simulation
C CodeC CodeHost ExecutableHost ExecutableISSISS
Relocatable file. After COMPILE, before LINK of C compilation
Having same structure
15 Moo-Kyoung Chung©
C Code Generation
Object 1
Code AnalysisCode Analysis
Code Analyzer Decoder
CFG, DFG Decoded Data
C Code Generation
Simulation C Code 1
ELF Loader
Simulation Code Simulation Code GenerationGeneration
Symbol Text
…
…
Decoded Info.
ELF (Executable and Linkable Format) is the most widely used file format for object, executable and library file.
Identical structure to the target source file
Extracting the CFG, DFG using Symbol table and decoded information
16 Moo-Kyoung Chung©
Generating Constructed C Codeint result;void cfunction( int number ){
if( number >= 5 )result = number - 5;
}(A) Target C Source(A) Target C Source
...14:[e51b3010] ldr r3,[r11,#0x10] 18:[e3530004] cmps r3,#0x4 1c:[da000003] ble #0xc 20:[e59f300c] ldr r3,#0xc 24:[e51b2010] ldr r2,[r11,#0x10] 28:[e2422005] sub r2,r2,#0x5 2c:[e5832000] str r2,[r3,#0x0] ... (B) Object File (Disassemble)(B) Object File (Disassemble)
(C) Simulation C Code(C) Simulation C Code
1 int T_result;2 void T_cfunction()3 {4 ...5 LDType=W;Rd=3;Rn=11;LDDir=PRE_DOWN;Imm=0x10;LDWBack=0; LDR_L_I();6 Rn=3;Imm=0x4;SType=SHT_LSL;SAmt=0x0; CMP_I();7 WR_COND();8 Imm=0xc;Cond=0x000d; B();9 if( conpass ) goto T___newsym_30;
10 LDType=W;Rd=3;Rn=15;LDDir=PRE_UP;Imm=0xc;LDWBack=0; LDR_L_I();11 R[3] = &T_result;12 LDType=W;Rd=2;Rn=11;LDDir=PRE_DOWN;Imm=0x10;LDWBack=0; LDR_L_I();13 Rd=2;Rn=2;Imm=0x5;SType=SHT_LSL;SAmt=0x0; SUB_I();14 LDType=W;Rd=2;Rn=3;LDDir=PRE_UP;Imm=0x0;LDWBack=0; STR_L_I();15 *(R[3]+0) = R[2];16 T_newsym_30:17 ...18 }
17 Moo-Kyoung Chung©
Reducing Compile Time
Previous static compiled ISSThe simulation C code has a large function that contains all of the generated simulation code from the target binary.
• Increasing the function size, the C code compilation time is more increased because of the host compiler optimization.
Even a slight change of the source code causes the time-consuming compilation process.
How to reduce compile timeThe simulation C code is composed of many of small functions.
• Generated C file has the same structure with target C code– The same CFG/DFG
• It speeds up compiler optimization and reduce compile timeSelective Compilation
• Compiling only the files that are changed – Using “make” utility
• It speeds up regeneration of ISS
18 Moo-Kyoung Chung©
Gen.oGen.o
CompileCompile
Reducing Compile Time
A.cA.c B.cB.c C.cC.c D.cD.c A.cA.c B.cB.c C.cC.c D.cD.c
a.oa.o B.oB.o C.oC.o D.oD.o
CompileCompile
A.oA.o B.oB.o C.oC.o D.oD.o
CompileCompile
Appl.exeAppl.exe
LinkLink
SimulatorSimulator
LinkLink
Gen.cGen.c
Code GenCode Gen
A_G.cA_G.c B_G.cB_G.c C_G.cC_G.c D_G.cD_G.c
Code GenCode Gen
SimulatorSimulator
linklink
A_G.oA_G.o B_G.oB_G.o C_G.oC_G.o D_G.oD_G.o
CompileCompile
Previous Simulator Previous Simulator Generation ProcessGeneration Process
New Simulator New Simulator Generation ProcessGeneration Process
Target ProgramTarget Program
Generated Generated C codeC code
Time-consuming because it handles the large single function.
It is fast because it handles many of small C filesRe-compilation
should have all the time-consuming compilation steps
Only the modified files goes through these steps
19 Moo-Kyoung Chung©
Supporting Indirect BranchSupporting Indirect Branch
Could not determine the branch target address at compile timePrevious Static Compiled-ISS
• To support the indirect branch, it is necessary to insert labelsinto every start line of the instruction simulation code in the simulation C file.
• Those labels take basic block apart.• It makes interference with the compiler optimization.
Runtime Branch Target Search• There should be a symbol (label) at a possible branch target
address in target C code according to the normal usage of C language
• Since the simulation C code has the same CFG with the target code, It also has the corresponding symbols.
• I made so-called “Dynamic Branch Handler” which finds the destination symbol (label) and jumps to the address
• We can handle the indirect branch without adding labels.
20 Moo-Kyoung Chung©
Function_A(){...}
Function_B(){...}
Function_C(){Inst.1 simulation code
Inst.2 simulation code
Inst.3 simulation code
Inst.4 simulation code
//Bx R1target = DynamicBH( R1 )*(target)()
...
}
Supporting Indirect BranchTarget_code(){
Label_250: ...
Label_700: ...
Label_1000:Inst.1 simulation code
label_1001: Inst.2 simulation code
label_1002: Inst.3 simulation code
label_1003: Inst.4 simulation code
label_1004: // Bx R1goto addr2(R1)
...
}
Generated C Code of Generated C Code of Previous CompiledPrevious Compiled--ISSISS
Generated C Code ofGenerated C Code ofNew ApproachNew Approach
Unnecessary labels make interference with the optimization and make simulation slow.
No redundant labels. Taking better optimization effect of host C compiler
ISS does not know which address will be the destination at compile time.
ISS knows the possible branch destination address where should be a lable
21 Moo-Kyoung Chung©
Supporting Dynamic Code
Static Compiled ISSCannot support run-time change of the execution code.
• Self-modifying code• External memory code• Downloaded code
Dynamic Code HandlerBuilt-in Interpreter
• Handles the dynamic code.• Fetch, Decode, Dispatch Cache
Target processor resources are shared between the two ISS’sIt is necessary to check the modification of binary to be executed.The code executed by the interpretive block runs without speed improvement of the compiled ISS.
22 Moo-Kyoung Chung©
Supporting Dynamic Code
DecodeDecode
YesYes
Compiled ISSCompiled ISS
Dynamic Code Dynamic Code HandlerHandler
Target ProcessorTarget ProcessorResourceResource
DispatchDispatchCacheCache
Dispatch CacheDispatch CacheManagerManager
YesYes
NoNo
Store to TEXTStore to TEXT
Next PCNext PC
NoNo
BuiltBuilt--in Interpreterin Interpreter
Simulation FlowSimulation FlowData AccessData Access
AddrAddr/Data/Data
External CodeExternal Code
YesYes
NoNoModified?Modified?
ModifiedModifiedCode ?Code ?
SelfSelf--ModifyingModifyingCode TableCode Table
ExecuteExecute
TEXT TEXT Range ?Range ?
Execute Instruction Simulation CodeExecute Instruction Simulation Code
The next instruction was not compiled at static time
Self-modifying Code
Compiled ISS shares the target processor resource data
Could not get speedup of compiled ISSOnly for the dynamic code.
23 Moo-Kyoung Chung©
Experimental Result
Performance of Compiled ISSPlatform
• CPU : Intel Xeon CPU 2GHz, 512K Cache• OS : Linux Redhat 7.2
Target Processor• ARM 7
Target Application• IDCT• Matrix multiply• FIR• JPEG Decoder• MP3 Decoder
24 Moo-Kyoung Chung©
Simulation Speed
FIR
IDCT
Matrix Multiply
Benchmarks(Target
Program)
x44x150x145x1
65.8225.5217.81.511,812 M
X38x176x169x1
31.21431370.811,140 M
Interpretive ISS
33.9198.3179.10.97
X35x205x185x11,601 M
OBSIM(sec.)Commercial
ISS (sec.)GNU(GDB) ISS (sec.)
Native Execution
(sec.)
Executed Instruction
Count
25 Moo-Kyoung Chung©
Compile Time
MP3 Decoder
JPEG Decoder
Benchmarks(Target
Program)
76.9
50.7
Recompile Time (sec.)
78.6
52.3
Total Compile
Time (sec.)
Existing Method
9.065.218 C Files199,220 Lines
OBSIM
7.243.912 C Files137,875 Lines
Recompile Time (sec.)
Total Compile
Time (sec.)
Source
26 Moo-Kyoung Chung©
Summary
New approachKeeping speed of static compiled ISSReducing the compile timeIncreasing the flexibility
• Supporting indirect branch without speed losses• Supporting dynamic code
Practical useCo-simulation for embedded system exploration
• Fast simulation speed• Fast compilation/recompilation speed• Easy to estimate performance• Powerful semi-hosting features
27 Moo-Kyoung Chung©
Reference
[1] Reshadi M., Mishra P., Dutt N., “Instruction set compiled simulation: a technique for fast and flexible instruction set simulation”, 38th DAC, Proceedings of, 2003
[2] Jianwen Zhu, Gajski D.D., “An ultra-fast instruction set simulator”, VLSI Systems, IEEE Transactions on, June 2002, Volume: 10 , Issue: 3
[3] Reshadi M., Dutt N., “Reducing compilation time overhead in compiled simulators”, 21st ICCD, Proceedings of, 2003
[4] Amicel R., Bodin F., “Mastering startup costs in assembler-based compiled instruction-set simulation”, sixth Annual Workshop on Interaction between Compilers and Computer Architectures, Proceedings of, 2002
[5] Nohl A., Braun G., Schliebusch O., Leupers R., Meyr H., Hoffmann A., “A universal technique for fast and flexible instruction-set architecture simulation”, 39th DAC, Proceedings of, 2002
[6] Zivojnvic V., Tjiang S., Meyr H., “Compiled simulation of programmable DSP architectures”, IEEE Workshop VLSI Signal Processing, Proceedings of, 1995.
[7] Emmett Witchel, Mendel Rosenblum, “Embra: fast and flexible machine simulation”, ACM SIGMETRICS, Proceedings of, May 1996, Volume 24 Issue 1
28 Moo-Kyoung Chung©
Reference
[8] R. F. Cmelik, D. Keppel Shade, “A fast instruction-set simulator for execution profiling”, ACM SIGMETRICS, Proceedings of, 1994
[9] ARM9 User Manual manual. Available at http://www.arm.com[10] Zivojnovic V., Meyr H., “Compiled HW/SW co-simulation”, 33rd DAC,
Proceedings of, 1996 [11] Hoffmann A., Kogel T., Nohl A., Braun G., Schliebusch O., Wahlen O.,
Wieferink A., Meyr H., “A novel methodology for the design of application-specific instruction-set processors (ASIPs) using a machine description language”, Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Nov. 2001, Volume: 20 , Issue: 11
[12] Eric C. Schnarr, Mark D. Hill, James R. Larus, “Facile: a language and compiler for high-performance processor simulators”, Programming language design and implementation, Proceedings of, 2001
[13] Jong-Yeol Lee, In-Cheol Park, “Timed compiled-code simulation of embedded software for performance analysis of SOC design”, 39th DAC, Proceedings of, 2002
[14] Bammi J.R., Harcourt E., Kruitzer W., Lavagno L., Lazarescu M.T., “Software performance estimation strategies in a system-level design tool”, Eighth CODES, Proceedings of, 2000
[15] Nagendra G.D., Kumar V.G.P., Sheshadri B.S., “Simulation Bridge: a framework for multi-processor simulation”, Tenth CODES, Proceedings of, 2002