Date post: | 29-Mar-2015 |
Category: |
Documents |
Upload: | sonny-harbach |
View: | 217 times |
Download: | 3 times |
High Performance Embedded Computing
© 2007 Elsevier
Lecture 8: Embedded Processor Issues
Embedded Computing Systems
Mikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
© 2006 Elsevier
Topics
Bus encoding. Security-oriented architectures. CPU simulation. Configurable processors.
© 2006 Elsevier
Bus encoding
Encode information on bus to reduce toggles and dynamic energy consumption. Count energy consumption
by toggle counts. Bus encoding is invisible to
rest of architecture. Some schemes transmit
side information about encoding.
mem CPUenc dec
encodedbus
sideinformation
© 2006 Elsevier
Bus-invert coding Stan and Burleson: take
advantage of correlation between successive bus values.
Choose sending true or complement form of bus values to minimize toggles. Why might this approach
work well? Can break bus into fields
and apply bus-invert coding to each field. How might the bus be
divided?
© 2006 Elsevier
Working zone encoding
Mussoll et al.: Used to encode address buses Uses the observation that the majority of the execution
time for a program is spent in a small range of addresses Divides addresses into sets called working-zone Address in a working zone is sent as an offset from the
base in a one-hot code. Why is a one-hot code used?
Addresses that are not in a working zone have the entire value sent.
Compared to bus-invert coding, what would you expect to be the advantages and disadvantages of this approach?
© 2006 Elsevier
Address bus encoding
Benini et al: cluster correlated address bits and then encode clusters
Compute correlation coefficients of transition variables to determine clusters:
Need to ensure clusters don’t become too large, since this can increase encode/decode logic.
Use logic synthesis to design encoders and decoders for each cluster
© 2006 Elsevier
Benini et al. results
[Ben98] © 1998 IEEE
What important tradeoffs of the address encoding technique are not shown in the table below?
© 2006 Elsevier
Dictionary-based encoding Takes advantage of the observation that many
values are repeated on buses.
© 2006 Elsevier
Dictionary-based encoding Takes advantage of the observation that many values are repeated on buses. Divides bus into three parts:
Only the upper bits of the bus are stored in the dictionary and used to match dictionary values that are indexed by the index part.
When the upper bits match, they are put in a high-Z state and the remaining bits are sent; otherwise all bits are sent.
© 2006 Elsevier
Lv et al. dictionary-based architecture
[Lv03] © 2003 IEEE
© 2006 Elsevier
Lv et al. energy savings
[Lv03] © 2003 IEEE
© 2006 Elsevier
Security-oriented architectures There are a variety of security attacks:
Typical desktop/server attacks, such as Trojan horses and viruses.
Physical access allows side channel attacks. Cryptographic instruction sets have been
developed for several architectures. Embedded systems architecture must add
protection for side effects, consider energy consumption.
© 2006 Elsevier
Secure architectures
SmartMIPS and ARM SecureCore offer security extensions Include encryption instructions, specialized
memory management units, etc. SAFE-OPS
Designed to protect against software modification Compiler embeds a watermark into code based
on register assignment. FPGA accelerator checks the validity of the
watermark during execution.
© 2006 Elsevier
Power attacks
Kocher et al.: Adversary can observe power consumption at pins and deduce data, instructions within CPU.
Yang et al.: Dynamic voltage/frequency scaling (DVFS) can be used as a countermeasure. [Yan05] © 2005 ACM Press
© 2006 Elsevier
CPU simulation Performance vs. energy/power simulation. Temporal accuracy. Trace vs. execution. Simulation vs. direct execution. Simulate using appropriate benchmarks for
embedded systems Don’t use SPEC CPU Benchmarks! Embedded Benchmarks include EEMBC,
MediaBench, MiBench Benchmarks often should be domain-specific
© 2006 Elsevier
Trace-based analysis
Instrumentation generates side information.
PC-sampling checks PC value during execution.
Can measure control flow, memory accesses.
© 2006 Elsevier
Program counter (PC) sampling Example: Unix prof. Interrupts are used to sample PC periodically.
Must run on the platform. Doesn’t provide complete trace. Subject to sampling problems: undersampling,
periodicity problems. Generates a call-graph report that indicates the
percentage execution time spent in each program.
© 2006 Elsevier
Program instrumentation
Example: dinero. Modify the program to
write trace information. Track entry into basic
blocks. Requires editing object
files. Provides complete trace.
© 2006 Elsevier
Microarchitecture-modeling simulators Varying levels of detail:
Instruction scheduler is not cycle-accurate. Cycle timers are cycle-accurate.
Can simulate for performance or energy/power.
Typically written in general-purpose programming language (e.g., C), not hardware description language.
© 2006 Elsevier
Cycle-accurate simulator
Models the microarchitecture. Simulating one instruction
requires executing routines for instruction fetch, decode, execute, etc.
Models pipeline state. Microarchitectural registers are
exposed to the simulator.
reg
IR
PC
I-box
© 2006 Elsevier
Trace-based vs. execution-based
Trace-based: Gather trace first, then
generate timing information.
Basic timing information is simpler to generate.
Full timing information may require regenerating information from the original execution.
Requires owning the platform.
Execution-based: Simulator fully executes the
instruction. Requires a more complex
simulator. Requires explicit
knowledge of the microarchitecture, not just instruction execution times.
© 2006 Elsevier
Power simulation Model capacitance in the processor. Keep track of activity in the processor.
Requires full simulation. Activity determines capacitive
charge/discharge, which determines power consumption.
CPU Power Simulators include: Simple Power and Wattch for embedded GP Trimaran with EPIC Explorer for embedded VLIW
© 2006 Elsevier
Automated CPU design
Customize aspects of CPU for application: Instruction set. Functional units. Memory system (including register files). Busses, I/O, and peripherals.
Tools help design and implement custom CPUs. FPGAs make it easier to implement custom CPUs. Application-specific instruction processor (ASIP) has
custom instruction set. Configurable processor is generated by a tool set.
© 2006 Elsevier
Techniques
Architecture optimization tools help choose the instruction set and microarchitecture.
Configuration tools implement the microarchitecture (and perhaps compiler).
Early example: MIMOLA [1984] analyzed programs, created microarchitecture and instructions, synthesized logic.
© 2006 Elsevier
CPU configuration process
© 2006 Elsevier
Tensilica configuration options
© 2004 Tensilica
© 2006 Elsevier
Tensilica EEMBC comparison
© 2004 Tensilica
© 2006 Elsevier
Tensilica energy consumption by subsystem
© 2006 Tensilica
© 2006 Elsevier
Toshiba MePcore
© 2006 Elsevier
LISA language
[Hof01] © 2001 IEEE
© 2006 Elsevier
LISA descriptions and generation Memory model includes registers and other
memories. Uses clause binds operations to hardware. Timing specified by PIPELINE, IN,
ACTIVATION, ENTITY. Generates hierarchical VHDL design.
© 2006 Elsevier
PEAS-III
Synthesis driven by: Architectural parameters
such as number of pipeline stages.
Declaration of function units.
Instruction format definitions.
Interrupt conditions and timing.
Micro-operations for instructions and interrupts.
Generates both simulation and synthesis models in VHDL.
© 2006 Elsevier
Instruction set synthesis
Generate instruction set from application program, other requirements.
Sun et al. analyzed design space for simple BYTESWAP() program.
[Sun04] © 2004 IEEE
© 2006 Elsevier
Complex function definition Atasu et al. try to
combine many operations into an instruction: Disjoint operator graphs. Multi-output instructions.
Operator graph must be convex---value cannot leave, then re-enter the instruction.
Textbook discusses several other approaches
[Ata03] © 2003 ACM Press
© 2006 Elsevier
Limited-precision arithmetic
Fang et al. used affine arithmetic to analyze numerical characteristics of algorithms.
Mahlke synthesize variable bit-width architectures given bit-width requirements.
Cluster operations to find a small number of distinct bit widths.
What advantages and disadvantages might this approach have?
[Mah01] © 2001 IEEE