Hardware/Software Codesign - Computer Systems @ JSIcs.ijs.si/papa/courses/HW-SW-Codesign.pdf ·...

0 - 1

HardwareHardware//Software CodesignSoftware Codesign

Jožef Stefan International Postgraduate School

0. Organization

doc. dr. Gregor Papa

0 - 2

OverviewAdministrationCourse synopsisIntroduction and motivation

0 - 3

Organization (1)Lecture: introductionary course + consultationsExercises: delivered during consultations

Contact: Gregor [email protected]

Web page: http://csd.ijs.si/papa/courses.php

0 - 4

Organization (2)Course materials:

slide copies, exercise sheets, papersthe slides contain material from Marco Platzner, PeterMarwedel, Lothar Thiele, Frank Vahid, Reinhard Wilhelm

References:P. Marwedel: Embedded System Design, Springer, 2006.F. Vahid, T. Givargis: Embedded System Design: A Unified Hardware/Software Introduction, John Wiley & Sons, 2002.

Exam: written seminar + oral, Slovenian or English

0 - 5

Textbook & slidescourse based

on the book and the slides“Embedded System Design” by Peter Marwedel

on the slides “Hardware/Software Codesign” by Lothar Thiele

0 - 6


0 - 7

Course SynopsisDifferent Levels of Model Representation

SpecificationsModelsAbstraction Levels

Dealing with Contradictory ConstraintsExplorationSimulation

• Worst-Case Eexecution TimeOptimization

Hardware/Software MappingPartitioningSchedulingAllocation

Software Code OptimizationsCompilation

Estimation

0 - 8

Benefits ? Learn about …… challenges and approaches in modern system design… useful optimization methods… performance estimation of embedded systems… a current research area

0 - 9


0 - 10

What is HW/SW Codesign?... integrated design of systems that consist of hardware-and software-components

Analysis of HW/SW boundaries and interfacesEvaluation of design alternatives

0 - 11

Hardware/Software BoundariesGeneral purpose systems (PC, workstation)

processor design:processor compiler, operating system

Embedded systems (cell phone, automotive electronics)design of specialized processors:processor compiler, operating systemsystem design:processors dedicated hardware devices

0 - 12

Target Architectures

0 - 13

Why Codesign? (1)Modern embedded systems require “design” optimization

many functions, great variability, high flexibilityheterogeneous target systems

• processors, ASICs, FPGAs, systems-on-chip, …many design goals

• performance, cost, power consumption, reliability, ...

Advances in formal / automated design methodsautomation on the system level becomes possiblereduction of cost and time-to-market

0 - 14

Why Codesign? (2)Optimization of the “design process”

classic design co-design

0 - 15

Codesign methodologiesDifferent Levels of Model RepresentationDealing with Contradictory ConstraintsHardware/Software MappingSoftware Code OptimizationsEstimation

0 - 16

System Design

0 - 17

System Design

0 - 18

According to forecasts, future of IT characterized by terms such as

Disappearing computer,Ubiquitous computing,Pervasive computing,Ambient intelligence,Post-PC era,Cyber-physical systems.

Basic technologies:Embedded SystemsCommunication technologies

Motivation (1)

0 - 19

“Information technology (IT) is on the verge of another revolution. …..networked systems of embedded computers ... have the potential to change radically the way people interact with their environment by linking together a range of devices and sensors that will allow information to be collected, shared, and processed in unprecedented ways. ...The use … throughout society could well dwarf previous milestones in the information revolution.”

Source. Edward A. Lee, UC Berkeley, ARTEMIS Embedded Systems Conference, Graz, 5/2006

Motivation (2)

0 - 20

“Dortmund“ Definition: [Peter Marwedel]

Information processing systems embedded into a larger product

Berkeley: [Edward A. Lee]:Embedded software is software integrated with physical*processes. The technical problem is managing time and concurrency in computational systems.

Definition: Cyber-Physical (cy-phy) Systems (CPS) are integrations of computation with physical processes [Edward Lee,2006].

Embedded Systems & Cyber-Physical Systems

0 - 21

Ubiquitous computing: Information anytime, anywhere.Embedded systems provide fundamental technology.

Communication Technology

Optical networkingNetwork management

Distributed applicationsService provision

UMTS, DECT, Hiperlan, ATM

Embedded Systems

RobotsControl systemsFeature extraction and recognitionSensors/actorsA/D-converters

Pervasive/Ubiquitous computingDistributed systems

Embedded web systemsR

eal-t

ime

Dep

enda

bilit

y

Qua

lity

of

serv

ice

Embedded Systems and ubiquitous computing

0 - 22

Spending on GPS units exceeded $100 mln during Thanksgiving week, up 237%from 2006 … More people bought GPS units than bought PCs, NPD found.[www.itfacts.biz, Dec. 6th, 2007]

…, the market for remote home health monitoring is expected to generate $225mln revenue in 2011, up from less than $70 mln in 2006, according to Parks Associates. . [www.itfacts.biz, Sep. 4th, 2007]

According to IDC the identity and access management (IAM) market in Australia and New Zealand (ANZ) … is expected to increase at a compound annual growth rate (CAGR) of 13.1% to reach $189.3 mln by 2012 [www.itfacts.biz, July 26th, 2008].

Accessing the Internet via a mobile device up by 82% in the US, by 49% inEurope, from May 2007 to May 2008 [www.itfacts.biz, July 29th, 2008]

Growing importance of embedded systems

0 - 23

Multiple networksBody, engine, telematics, media, safety

Multiple processorsUp to 100

• 8-bit – door locks, lights, etc. • 16-bit – most functions• 32-bit – engine control, airbags

Processing where the action isSensors and actuators distributed all over the vehicleNetworked together

Functions by embedded processing:ABS: Anti-lock braking systemsESP: Electronic stability controlAirbagsEfficient automatic gearboxesTheft prevention with smart keysBlind-angle alert systems... etc ...

Automotive electronics

0 - 24

Avionics

Flight control systems,anti-collision systems,pilot information systems,power supply system,flap control system,entertainment system,…

Dependability is of outmost importance.

0 - 25

Railways

Safety features contribute significantlyto the total value of trains, and dependability is extremely important

0 - 26

TelecommunicationMobile phones have been one of the fastest growing markets in the recent years,

• Multiprocessor• 8-bit/32-bit for UI• DSP for signals• 32-bit in IR port• 32-bit in Bluetooth

• 8-100 MB of memory• All custom chips• Power consumption & battery life depends on

softwarebase stations

• Massive signal processing• Several processing tasks per connected

mobile phone• Based on DSPs

• Standard or custom• 100s of processors

Geo-positioning systems,Fast Internet connections,Closed systems for police, ambulances, rescue staff.

0 - 27

Medical systems

For example:• Artificial eye: several approaches,

e.g.:• Camera attached to glasses;

computer worn at belt; output directly connected to the brain, “pioneering work by William Dobelle”. Previously at [www.dobelle.com]

Translation into sound; claiming much better resolution.[http://www.seeingwithsound.com/etumble.htm]

0 - 28

Functions requiring computers:RadarWeaponsDamage controlNavigationbasically everything

Computers:Large servers1000s of processors

Extremely Large

0 - 29

Custom processorsGraphics, sound

32-bit processorsIR, BluetoothNetwork, WLANHarddiskRAID controllers

8-bit processorsUSBKeyboard, mouse

Inside your PC

0 - 30

Authentication systems

Finger print sensorsAccess controlAirport security systemsSmartpen®Smart cards….

0 - 31

Examples

Consumer electronics

0 - 32

Examples

Industrial automation

0 - 33

Forestry Machines

© Jakob Engblom

Networked computer systemControlling arms & toolsNavigating the forestRecording the trees harvestedCrucial to efficient work

Operator panelGraphical display Touch panelJoystickButtonsKeyboard

“Tough enough to be out in the woods”

0 - 34

ExamplesIntegrated cooling, lightning, room reservation, emergency handling,communicationGoal: “Zero-energy building”

Smart buildings

0 - 35

Robotics“Pipe-climber”

Robot “Johnnie“

Lego mindstormsStandard controller

• 8-bit processor• 64 kB of memory

Electronics to interface to motors and sensors

0 - 36

EstimationHardware, software and system as a whole suitability

��- �

�a��a��a��a��o�t�a��o��si�n�o�t�a��o��si�n

Jo�ef Stefan �nternational Postgraduate School

��nt�o��tion

�o��o��a�a

��- �

�ont�nts� ��

Le�els of �bstraction in Electronic System �esign

�ypical �esign �low of Hardware-Software Systems

��- 3

�ain reason for buying is not information processing

Embedded systems �ES� � in�o�mation��o��ssin��s�st�ms��m��into�a��a��o��t

E�amples�

Em��st�ms

��- �

Em��st�ms

��t��na��o��ss

�m��s�st�m

��man�int��a��

s�nso�s��a�t�ato�s

��- �

�a�a��an��ist�i��t��a��t��at�o�ms

ACC

ABSESP

ASR

enginecontrol powertrain

control

��- 6

E�am��o��sso��ell Processor ��B�� combines

general-purpose architecture core withcoprocessing elements which greatly accelerate multimedia and �ector processing applications, as well as many other forms of dedicated computation�

��- �

�omm�ni�atin��Em��st�mssensor networks �ci�il engineering, buildings, en�ironmental monitoring, traffic, emergency situations�smart products, wearable�ubi�uitous computing

��

��- �

��n�s�in��n�o�mation�an��omm�ni�ation

�ew �pplications andSystem Paradigms

Large-scale�istributed Systems

�entrali�edSystems

�etworkedSystems

�nternet

��- �

�om�a�isonEmbedded Systems

�ew applications that are known at design-time��ot programmable by end user��i�ed run-time re�uirements �additional computing power not useful�� riteria�

• cost• power consumption• predictability• meeting time bounds• �

�eneral Purpose �omputingBroad class of applications�

Programmable by end user�

�aster is better�

�riteria�• cost• a�erage speed

��- �0

��si�n��a��n��s��

increasing application complexity e�en in standard and large �olume products

• large systems with legacy functions• mi�ture of e�ent dri�en and data flow tasks • e�amples� multimedia, automoti�e, mobile communication

increasing target system complexity• mi�ture of different technologies, processor types, and design styles• large systems-on-a-chip combining components from different

sources, distributed system implementationsnumerous constraints and design objectives

• e�amples� cost, power consumption, timing constraints, dependability

��- ��

��a��n��s��o��Em��o�t�a��ynamic en�ironments�apture the re�uired beha�iour��alidate specificationsEfficient translation of specifications into implementations�How can we check that we meet real-time constraints�How do we �alidate embedded real-time software� �large �olumes of data, testing may be safety-critical�

��- ��

�m��m�ntation��t��nati��s

��o�man��o��E��i�i�n�� i�i�it�

��i�ation�s��i�i��int��at��i��its��s�

��i�ation�s��i�i��inst��tion�s�t��o��sso�s��s�

• �i��o�ont�o��• ��s��i�ita��si�na��o��sso�s�

��n��a��os��o��sso�s

��o��amma��a��a��

• ��i��o��amma��at��a��a�s�

��- �3

ES �ust be ��,�� probability of system working correctly pro�ided that is was working at t�� probability of system working correctly d time units after error occurred�� probability of system working at time t�� no harm to be caused�� confidential and authentic communication

E�en perfectly designed systems can fail if the assumptions about the workload and possible errors turn out to be wrong��aking the system dependable must not be an after-thought, it must be considered from the �ery beginning

��n�a�i�it�

��- ��

ES must be efficient�ode-si�e efficient�especially for systems on a chip�Run-time efficient� eight efficient�ost efficientEnergy efficient

E��i�i�n��

��- ��

�any ES must meet ��-�� real-time system must react to stimuli from the controlled ob�ect �or the operator� within the time inter�al �� by the en�ironment��or real-time systems, right answers arri�ing too late are wrong��-�� opet�, ��ll other time-constraints are called �� guaranteed system response has to be e�plained without statistical arguments

��a��tim��onst�aints

��- �6

Embedded and Real-�ime Synonymous�

�ost embedded systemsare real-time�ost real-time systemsare embedded

�m��m��

��a��a��tim�tim�

�m��m��a��a��tim�tim�

� Jakob Engblom

��a��im��st�ms

��- ��

��a�ti��i��s�st�ms

�ypically, ES are �� Beha�ior depends on input ��

automata model appropriate,model of computable functions inappropriate�

��i��s�st�ms�analog � digital parts��

��- ��

�� towards a certain ��nowledge about beha�ior at design time can be used to minimi�e resources and to ma�imi�e robustness

��no mouse, keyboard and screen

��i�at��s�st�ms

��- ��

�ont�nts� hat is an Embedded System �

��

�ypical �esign �low of Hardware-Software Systems

��- �0

��st�a�tion��o��s�an��nt��sis��

�ormal description of selected properties of a system or subsystem� model consists of data and associated methods

��egree of abstraction, granularity

• system, architecture, logic, transistor, • module, block, function, ��

�iew• beha�ior, structural, physical

��Linking ad�acent le�els of abstraction �refinement�Stepwise adding of structural information

��- ��

Structure

Beha�ior

��s�o��st�a�tions

��st�m�rchitecture

R�L

Process��odule

�unction ��

��t��o��

�at��mo��s��it��mo��s�i��it��mo��s��i��mo��s�a�o�t�mo��s

��- ��

�ont�nts� hat is an Embedded System �

Le�els of �bstraction in Electronic System �esign

��-��

��- �3

��si�n�a��oa��s

��inition��nt��sis�is the process of generating the description of a system in terms of related lower-le�el components from some high-le�el description of the e�pected beha�ior�

“describe-and-synthesi�e” paradigm by �a�ski, ��4

�n contrast to the traditional “specify-e�plore-refine” approach, also known as “design-and-simulate” approach�

�anual design steps are more error-prone than automatic synthesis and, therefore, simulation is more important�

��- ��

S� -�ompilation H� -Synthesis

��st�m��si�nSpecification

System Synthesis

�achine �ode �et lists

Estimation

�nstruction Set

�ntellectualProp� Block

�ntellectualProp� �ode

��- ��

�i��o��sso��it��t��


Specification

System Synthesis


Estimation

�nstruction Set



��- �6

��i�ation��i�i�� o��


Specification

System Synthesis


Estimation

�nstruction Set



��- ��

��i�ation��i�i��nst��tion��t��o��sso�


Specification

System Synthesis


Estimation

�nstruction Set



��- ��

��st�m��si�n�� -�� is a comple� synthesis tasks

software synthesis and code generationhardware synthesisinterface and communication synthesishardware�software partitioning and component selectionhardware�software scheduling

�� :application specificationdesign space e�ploration and system optimi�ationestimation

��- ��

��a��in��o��m

��- 30

�� a��in��an��in��

Partitioning of system function to programmable components �software�, hard-wired or parameteri�ed components �hardware� or application specific instruction set processors�

�� to scheduling and load distribution problem in real-time operating systems

time constraints, conte�t switch and conte�t switch o�erhead,process synchroni�ation and communication

��to real-time operating systemslarger design space with �ery different solutionshigh optimi�ation re�uirements �moti�ation for hardware design�underlying hardware is not fi�ed

��- 3�

�� a��in��an��in�Similarity to allocation �or load distribution� problem in high-le�el synthesis �or real-time operating systems�

dedicatedHWcomponents

P1

P3

P2

P4

SW(processors)

��- 3�

Estimation�he principle of synthesis based on abstraction only makes sense if there are ��a�ailable�

Estimate properties of the ne�t layer�s� of abstraction��esign decisions are based on these estimated properties� �f the estimation is not correct �or not accurate enough�, the design will be sub-optimal or e�en not working correctly�

��si�n��a��E��o�ation

�im��in�o��si�n



Estimation�o��o��a��o��ti�s

�i��a�st�a�tion

�o�a�st�a�tion

��

�- �

�a��a��a��a��o�t�a��o��si�n�o�t�a��o��si�n

Jo�ef Stefan �nternational Postgraduate School

��i�i�ation�an��o��s�o��om��tation

doc. dr. Gregor Papa

2 - 2

SW-Compilation HW-Synthesis

System DesignSpecification

System Synthesis

Machine Code Net lists

Estimation

Instruction Set

IntellectualProp. Block

IntellectualProp. Code

2 - �

�onsider a simp�e e�amp�e

��he ��ser�er pattern defines a one-to-many dependency �et�een a su��ect o��ect and any num�er of o�ser�er o��ects so that �hen the su��ect o��ect chan�es state� all its o�ser�er o��ects are notified and updated automatically.�

Eric �amman �ichard Helm� �alph �ohnson� �ohn �lissides� Design Patterns� �ddision-Wesley� ��

2 - �

��amp�e� ��ser�er pattern in �a�a

pu�lic �oid add�istener�listener� ��

pu�lic �oid set�alue�newvalue� �

my�alue�ne��alue�

for �int i�� i�mylisteners.len�th� i��

my�isteners�i�.�alueChan�ed�ne��alue��

�

Will this �ork in a multithreaded conte�t�

2 - �

��ser�er pattern �it� m�te�es

pu�lic sync�roni�ed �oid add�istener�listener� ��

pu�lic sync�roni�ed �oid set�alue�newvalue� �




� �a�asoft recommends a�ainst this.What�s �ron� �ith it�

2 - �

��te�es �sing monitors are mine�ie�dspu�lic sync�roni�ed �oid add�istener�listener� ��

pu�lic sync�roni�ed �oid set�alue�newvalue� �




� �alueChan�ed�� may attempt to ac�uire a lock on some other o��ect and stall. If the holder of that lock calls add�istener�� deadlock�

� calls add�istener

�alueChan�ed

re�uests

lock

held

�y �

mute�

2 - �

Simp�e o�ser�er pattern gets comp�icated


pu�lic �oid set�alue�newValue� �

sync�roni�ed �this� �


listeners�my�isteners.clone��

�

for �int i�� i�listeners.len�th� i��

listeners�i�.�alueChan�ed�ne��alue��

�

�hile holdin� lock� make a copy of listeners to a�oid race conditions

notify each listener outside of the synchroni�ed �lock to a�oid deadlock

�his still isn�t ri�ht.What�s �ron� �ith it�

2 - �

Simp�e o�ser�er pattern� �o� to ma�e it rig�t�


pu�lic �oid set�alue�newValue� �

sync�roni�ed �this� �


listeners�my�isteners.clone��

�

for �int i�� i�listeners.len�th� i��

listeners�i�.�alueChan�ed�ne��alue��

�

Suppose t�o threads call set�alue��. �ne of them �ill set the �alue last� lea�in� that �alue in the o��ect� �ut listeners may �e notified in the opposite order. �he listeners may �e alerted to the �alue-chan�es in the �ron� order�

2 - �

Pro��ems �it� t�read��ased conc�rrency

Nontrivial software written with threads, semaphores, and mutexes is incomprehensible to humans.

Search for non-thread-�ased models� �hich are the re�uirements for appropriate specification techni�ues�

2 - ��

�ontents��

StateCharts

�ata-�lo� Models

2 - ��

�e��irements �or Speci�ication �ec�ni��es ��

�� Humans not capa�le to understand systemscontainin� more than a fe� o��ects.

Most actual systems re�uire more o��ectsHierarchy

�� E�amples� states� processes� procedures.

�� E�amples� processors� racks�printed circuit �oards

procproc

proc

2 - �2

�e��irements �or Speci�ication �ec�ni��es ��

��

�� -�� e�uired for reacti�e systems.

�� -�� Components send streams of datato each other.

�o o�stac�es �or ��

2 - ��

�ode�s o� �omp�tation� De�inition

� �at does it mean� �to comp�te��ode�s o� comp�tation de�ine�

Components and an e�ecution model for computations for each componentCommunication model for e�chan�e of information �et�een components.

� Shared memory� Messa�e passin��

C-�

C-�

2 - ��

S�ared memory

Potential race conditions � inconsistent results possi�le�Critical sections � sections at �hich e�clusi�e access to

resource r �e.�. shared memory� must �e �uaranteed.

process a �..P�S� ��o�tain lock.. �� critical section��S� ��release lock

�

process � �..P�S� ��o�tain lock.. �� critical section��S� ��release lock

�

�ace-free access to shared memory protected �y S possi�le

�his model may �e supported �y�mutual e�clusion for critical sectionscache coherency protocols

2 - ��

�on��oc�ing�async�rono�s message passing

Sender does not ha�e to �ait until messa�e has arri�ed� potential pro�lem� �uffer o�erflo�

�send ��

�recei�e ��

2 - ��

��oc�ing�sync�rono�s message passing

Sender �ill �ait until recei�er has recei�ed messa�e

�send ��

�recei�e ��

2 - ��

Sync�rono�s message passing� �SP

�SP �communicatin� se�uential processes��Hoare� ��rendez-vous-�ased communication�E�ample�

process �..�ar a ...a��c�a� -- output

end

process �..�ar a ...a��c�a� -- output

end

process B..�ar � ......c�� -- input

end

process B..�ar � ......c�� -- input

end

2 - ��

�omponents ��

�iscrete e�ent model

a�c

timeactiona�� c�� a�� a��

�ueue

� ��

�

�on Neumann model

Se�uential e�ecution� pro�ram memory etc.

2 - ��

�omponents ��

�inite state machines

�ifferential e�uations

btx2

2

2 - 2�

��amp�e Discrete ��ent� ��D�

��D� �hard�are description lan�ua�e� is commonly used as a desi�n-entry lan�ua�e for di�ital circuits.

2 - 2�

Sensiti�ity �ists in ��D�Sensi�ity lists are a shorthand for a sin�le �ait on-statement at the end of the process �ody�process �� y�

�eginprod �� and y �

end process�is e�ui�alent toprocess

�egin�ait on ��y�prod �� and y �

end process�

2 - 22

No lan�ua�e that meets all lan�ua�e re�uirementsusin� compromises

2 - 2�

�ontentsModels of Computation

��

�ata-�lo� Models

2 - 2�

��assica� ��tomataClassical automata�

� Moore-automata�Y � �Z�� Z� � �X, Z�

� Mealy-automataY � �X�Z�� Z� � �X, Z�

Internal state Zinput X output Y

Ne�t state Z� computed �y function �utput computed �y function

��

��

e��

e��

e��

e��

��

clockMoore- � Mealy automata�finite state machines ��SMs�

2 - 2�

State��arts

Classical automata not useful for comple� systems �comple� �raphs cannot �e understood �y humans�.

�� StateCharts �Harel� ��

2 - 2�

�ntrod�cing �ierarc�y

�SM �ill �e in e�actly one of the su�states of S if S is acti�e�either in � or in B or ..�

2 - 2�

De�initionsCurrent states of �SMs are also called ��states.States �hich are not composed of other states are called ��.States containin� other states are called ��-��.�or each �asic state s� the super-states containin� s are called �� .Super-states S are called ��-��-�� if e�actly one of the su�-states of S is acti�e �hene�er S is acti�e.

ancestor state of Esuperstate

su�states

2 - 2�

De�a��t State �ec�anism

�ry to hide internal structure from outside �orld�

�efault state�illed circleindicates su�-state entered �hene�er super-state is entered.Not a state �y itself�

2 - 2�

�istory �ec�anism

�or input m� S enters the state it �as in �efore S �as left �can �e �� B� C� �� or E�. If S is entered for the �ery first time� the default mechanism applies.History and default mechanisms can �e used hierarchically.

��eha�ior different from last slide�

km

2 - ��

�om�ining �istory and De�a��t State

same meanin�

2 - ��

�onc�rrencyCon�enient �ays of descri�in� concurrency are re�uired.��-��-��: FSM is in all (immediate) sub-states of a super-state.

2 - 32

Entering and Leaving AND-Super-States

Line-monitoring and key-monitoring are entered and left, when service switch is operated.

incl.

2 - 33

�ree representati�n �� state setsbasicstate

��-super-state ��-super-state

� �

��

�

�

�

� � F

� � L

M

� �

��

� � F M

� �

� � L

�

� �

� ��

2 - 3�

�� putati�n �� state sets�omputation of state sets by �� fromleaves to root:

basic states: state set � state��-super-states: state set � union of children��-super-states: state set � �artesian product of children

��

� � F M

� �

� � L

2 - 3�

��pes �� States

�n State�harts, states are either

�� r

��-��-�� r

��-��-��

2 - 3�

�i� ersSince time needs to be modeled in embedded systems,timers need to be modeled.�n State�harts, special edges can be used for timeouts.

�f event a does not happen while the system is in the left state for �� ms, a timeout will take place.

2 - 3�

�sing �i�ers in Ans�ering �a��ine

2 - 3�

�epresentati�n �� putati�ns

�esides states, arbitrary many other variables can be defined. �his way, not all states of the system are modeled e�plicitly.�hese variables can be changed as a result of a state transition (��). State transitions can be dependent on these variables (��).

condition

action unstructuredstate space

variables

2 - 3�

�eneral ��r� �� Edge La�els

��ist only for the ne�t evaluation of the model�an be either internally or e�ternally generated

��efer to values of variables that keep their value until t�e� are reassigned

��an either be assignments for variables or creation of events

��service-off �not in Lproc� � service:��

event �condition� � action

2 - ��

Events and a�ti�ns��can be composed of several events:

��and �2�: event that corresponds to the simultaneous occurrence of e� and e�.��r �2�: event that corresponds to the occurrence of either e� or e� or both.�n�t ��: event that corresponds to the absence of event e.

��can also be composed:��2�: actions a� und a� are e�ecuted in parallel.

�ll events, states and actions are globally visible.

2 - ��

E�a�ple

e:a1:a2:

c:

x y ze�a1 �c��a2

e:a1:a2:

c:

truefalse

truefalse

2 - �2

��e State��arts Si�ulati�n ��ases

�ow are edge labels evaluated�

�� :

�. �ffect of e�ternal changes on events and conditions is evaluated,

�. �he set of transitions to be made in the current step and right hand sides of assignments are computed,

�. �ransitions become effective, variables obtain new values.

2 - �3

E�a�ple

�n phase �, variables a and b are assigned to temporary variables. �n phase �, these are assigned to a and b. �s a result, variables a and b are swapped.�n a single phase environment, e�ecuting the left state first would assign the old value of b (��) to a and b. ��ecuting the right state first would assign the old value of a (��) to a and b. �he e�ecution would be non-deterministic.

2 - ��

Steps��ecution of a State�hart model consists of a se�uence of (status, step) pairs

Status� values of all variables � set of events � current timeStep � e�ecution of the three phases

Status phase �

phase �

phase �

2 - ��

�e�le�ts ��del �� l��ed �ard�are

�n an actual clocked (synchronous) hardware system, both registers would be swapped as well.

Same separation into phases found in other languages as well, especially those that are intended to model hardware.Same separation into phases found in other languages as well, especially those that are intended to model hardware.

2 - ��

��re �n se�anti�s �� State��arts�nfortunately, there are several time-semantics of State�harts in use. �his is another possibility:

� step is e�ecuted in arbitrarily small time.�nternal (generated) events e�ist only within the ne�t step.��ternal events can only be detected after a stable state has been reached.

e�ternal events

steptransport of internal events

stablestate

stablestate

tstate transitions

2 - ��

E�a�ples

state diagram:stable state

2 - ��

E�a�ple�on-determinism

A C

B D

E G

F H

a

a a

a

A,B C,DE,H

F,G

a

a

astate diagram:

2 - ��

E�a�ple

� �

� �

� c��a �

��

�

� �

a�c

��

a

state diagram (only stable states are represented, only a and b are e�ternal):

�

��

��

a��

a��

a�� a��

a�� a��

2 - ��

Evaluati�n �� State��arts ��

��allows arbitrary nesting of ��- and ��-super states.�� in a follow-up paper to original paper.Large number of commercial simulation �� (StateMate, StateFlow�Matlab, �etterState, �ML, ...)�vailable �back-ends�translate State�harts into � �� , thus enabling software or hardware implementations.

2 - ��

Evaluati�n �� State��arts ��

��enerated � �� ,�ot useful for �� applications,�o description of ��-�� ,�o ��-��,�o description of �� .

2 - �2

SDL

�� (S�L) is a specification language targeted at the unambiguous specification and description of the behaviour of reactive and distributed systems.

�sed here as a (prominent) e�ample of a model of computation based on as�n��r�n�us �essage passing.

appropriate also for distributed systems

2 - �3

��uni�ati�n a��ng SDL-�S�s�ommunication between FSMs (or �processes�) is based on �essage-passing, assuming a p�tentiall� inde�initel� large ��-�ueue.

�ach process fetches ne�t entry from F�F�,checks if input enables transition,if yes: transition takes place,if no: input is discarded (e�ception: S��-mechanism).

2 - ��

Deter� inisti��Let tokens be arriving at F�F� at the same time:

�rder in which they are stored, is unknown

�ll orders are legal: simulators can show different behaviors for the same input, all of which are correct.

2 - ��

��ntentsModels of �omputation

State�harts

��-��

2 - ��

Data�l�� Language ��del��communicating through ��

�rocess � �rocess �

�rocess �

F�F� �uffer

F�F� �ufferF�F� �uffer

2 - ��

��il�s�p�� Data�l�� Languages�� :

�mperative language style: program counter is king�ataflow language: movement of data is the priorityScheduling responsibility of the system, not the programmer

�� :�ll processes run �simultaneously��rocesses can be described with imperative code�rocesses can ��y communicate through buffersSe�uence of read tokens is identical to the se�uence of written tokens

2 - ��

Data�l�� Languages�ppropriate for applications that deal with �� :

Fundamentally concurrent: maps easily to parallel hardware�erfect fit for block-diagram specifications (control systems, signal processing)Matches well current and future trend towards multimedia applications

��:�ost Language (process description), e.g. �, ��, �ava, .... .�oordination Language (network description), usually �home made�, e.g. �ML.

2 - ��

E�a�ple� ��E�-� vide� de��der

2 - ��

�a�n �r��ess Net��r�s

�roposed by �ahn in �� as a general-purpose scheme for parallel programming:

��: destructive and blocking (reading an empty channel blocks until data is available)��: non-blocking��: infinite si�e

�ni�ue attribute: ��

2 - ��

A �a�n �r��essFrom �ahn�s original �� paper

process f(in int u, in int v, out int w)�

int i� bool b � true�for (��) �

i � b � wait(u) : wait(v)�printf(�� i�n�, i)�send(i, w)�b � �b�

��

f

u

v

w

�rocess alternately reads from u and v, prints the data value, and writes it to w

� hat does this do�

2 - �2

A �a�n �r��essFrom �ahn�s original �� paper:

process g(in int u, out int v, out int w)�

int i� bool b � true�for(��) �

i � wait(u)�if (b) send(i, v)� else send(i, w)�b � �b�

��

guv

w

�rocess reads from u and alternately copies it to v and w


2 - �3

A �a�n �r��essFrom �ahn�s original �� paper:

process h(in int u, out int v, int init)�

int i � init�send(i, v)�for(��) �

i � wait(u)�send(i, v)�

��

hu v

�rocess sends initial value, then passes through values.


2 - ��

A �a�n �r��ess Net��r�� hat does this do��rints an alternating se�uence of ��s and ��s.

fg

hinit � �

hinit � �

�mits a � once and then copies input to output

�mits a � once and then copies input to output

2 - ��

Deter� ina�� :

� system is random if the information �� about the system and its inputs is not sufficient to determine its outputs.

�� :�efine the ��y of a channel to be the se�uence of tokens that have been both written and read. � process network is said to be �e�e��a�e�if the histories of all channels depend only on the histories of the input channels.

�� :Functional behavior is independent of timing (scheduling, communication time, e�ecution time of processes).Separation of functional properties and timing.

2 - 66

Determinacy

��monotonic mapping��

��x��x��y��,��

F[x1,x2,x3,…] [y1,y2,y3,…]

2 - 6�

Determinacy

�orma� de�inition�� [x1,�x2,�x3,��]��x�� [x1] [x1,�x2] [x1,�x2,�x3,��]�� ,��1,��,��

�� F��

�� F�� F��

F[x1,x2,x3,…] [y1,y2,y3,…]

2 - 6�

�r��Determini�m� ��determinate�

��y��

Rea�oning��,��y��y��y��y��y,��,��y��,��y��y��

2 - 6�

��in��n��eterminacy��y��

��

��amp�e ��

��y��

2 - ��

��in��n��eterminacy

F

�1��[�,��]

�2��[�]

F��[�,��,��]��1�,��2��

� ��[�],�[�]�� [�,��],�[�]��F�� F��[�,��]� [�,��,��]

F

�1 ��[�]

�2 ��[�]

F��[�,��]� ��1,��2�

2 - ��

�c�e��in��a�n��et��r��

� �

� ��y��

��y��

��

2 - �2

Deman��ri�en��c�e��in��y��y��

� �

� ��y�

��

��y��

��y��

��

2 - ��

��m��ar��rit�m��o�nded memor��

�tart ��ith �o�nded ��er si�es ��any s�hed��in� te�hni��e ��x��itho�t dead�o��y��ontin�e��y��dead�o��,�in�rease si�e��

2 - ��

Fr�m��n�inite�t��Finite��er��i�e��

��y��n,��n ��y��

��y��

��2

2 - �5

Dea��c��am��e��x��y��2��

�

�

��,��1,��1,��1,��

��

�

��,��

��

�

2 - �6

��am��e��Finite��i�e��er��in��

�

�

��1,��1,��1,��

��

�

��1,��1,��1,��1,��1,��1,��

�

�

��

��

��2��1

2 - ��

�ar��rit�m�in��cti�n��1��,��,��,��

� �

� �

��y��

�1

�2

�3

� � � ��1 1 1 � ��2 � 1 1 1�3 � 1 1 �

2 - ��

�ar��rit�m�in��cti�n��y��y�

� �

� �

��y��

�1

�2

�3

� � � � � � ��1 1 1 � � 1 � ��2 � 1 1 1 1 1 ��3 � 1 1 � � � ��

2 - ��

��a��ati�n��a�n��r�ce��et��r��ro�

��y��x��

�on��y��y��x��y��

2 - ��

�ync�r�n��Data��DF��,��y,�1��

�estri�tion ��i�ed n�m�er o� token��

��amp�e��1��

1 1 2 3 2 � � � � 1

��

2 - ��

�DF��c�e��in��c�ed��e ��y�at compi�e time��y�� sta��ish re�ati�e e�e��tion rates �y��y��

�� etermine �eriodi� s�hed��e �y��y��

��

Re��t��x��y��

2 - �2

�a�ancin��ati�n��

�

�

12

3

2

�

�

3

�1

3

21

�

3a �2��3d ��

��3��2��a ��d �2a ��

��

� �

2 - ��

��in��t�e��a�ancin��ati�n�ain �D� �c�ed��ing t�eorem ��

��n ��y��x�� n�1��n�1��x�� y��y��y��

��amp�e�

2 - ��

Determine��eri��ic��c�e��e�o��i��e �c�ed��e��

��

��

��

…

�y��y��e�i�i�it��

��,��y��

�

�

12

3

2

�

�

3

�1

3

21

�

�- �

�ar��are�ar��are��t�are��e�i�n��t�are��e�i�n

��

��De�i�n��ace��rati�n

��c��r��re��r��a�a

�- 2

�� y��

�y�tem�De�i�n��

�y��y��

��

��

��

��

��

�- �

De�i�n��ace��rati�n

��icati�n �rc�itect�re

�a��in�

��timati�n

�- �

Detai�e��ie��De�i�n��ace��rati�n

m��ti��ecti�e��timi�ati�n

e�a��ati�n

��

��

c�n�tr�ctarc�itect�re

ma�a��icati�n

e�timate�er��rmance

��

��

��

��

�- 5

��am��e��im��e��e�

��

��

��,��y,��

1

2

3

��

,, 21 ��

3 - 6

Example 1: Evolutionary Algorithms for DSE

��

“chromosome” = encoded allocation + binding

design point(implementation)

allocation

binding

individual

decode allocation

decode binding

scheduling

selectionrecombinationmutation

fitness evaluationfitness

user constraints

3 - �

Example 1: �asi� Model

1

2

3

4

5

6

7

RISC

HWM1

HWM2

SB

PTP

GP EM GA

Definition: A specifica-tion graph is a graphGS=(VS,ES) consistingof a problem graph GP,an architecture graphGA, and edges EM. Inparticular, VS=VP∪VA,ES=EP∪EA∪EM

data flow

3 - �

Example 1: Mapping

1

2

3

4

5

6

7

RISC

HWM1

SB

1

0

8

1

20

1

2

α

τ

0

1

21

30

1

21

29

β RISC HWM1

HWM2

sharedbus

PTP bus

3 - �

Example 1: �hallenges

�ncoding of (allocation+binding)simple encoding

� e�g� one bit per resource� one variable per binding� eas� to implement� man� infeasible partitioning solutions

encoding + repair� e�g� simple encoding and modif� such that for each vp VP there

e�ists at least one va VA �ith a (vp) = va� reduces number of infeasible partitioning solutions

�eneration of the initial population� mutation �ecombination

3 - ��

Example 1: �ase Study

3 - ��

Example 1: �ase Study

3 - 12

��am�le �� ase �tud��rame memor� dual �orted �rame memor� bloc� matc�� module ��ut module

out�ut module�u��ma� e�coder��/�� module

subtract/add module

3 - 13

��am�le �� olut�o� �

INMINM OUTMOUTM FMFM RISC2RISC2

SBS

3 - 1�

��am�le �� olut�o� �

INMINM OUTMOUTM DPFMDPFM HCHC

SBF

DCTMDCTM BMMBMM SAMSAM

3 - 1�

��am�le �� o�t�are ��t�es�s

C D �� 2 �

A F� 2 � ��

CD DATB

��

��

�ec�s�o�s�

CODE(A)CODE(B)CODE(A)CODE(B)CODE(C)

CALL(A)CALL(B)CALL(A)CALL(B)CALL(C)

FOR 1 TO 2CODE(A)CALL(B)CODE(C)CODE(A)

I��

S��

ABABABCCABABA�

C��

3 - 1�

��am�le �� t�m��at�o� �r�ter�a

2A

�

PROCEDURE AFOR 1 TO 3CALL(A)CODE(B)CODE(B)

��

��

��

��

P��

��

��

��

D��

B

3 - 1�

��am�le �� rade�o��s

D��

P��

��

��

��

��

��

3 - 1�

��am�le �� rade�o�� ur�aces

3 - 1�

��am�le �� lorat�o� �trate��am�le �� lorat�o� �trate��

3 - 2�

��am�le �� a�� rocess �et�or��am�le �� a�� rocess �et�or�

3 - 21

��am�le �� ard�are �rc��tecture

��

3 - 22

��am�le �� esult o� �u�ct�o�al ��mulat�o�n(p)

��p

b(s)

��s

3 - 23

��am�le �� esult o� �lat�orm �e�c�mar�s

P��(�) ��

I��

��(p��)p

��

3 - 2�

��

��

��

��

��

��

��am�le �� ac��o��t�e�e��elo�e ��al�s�s

3 - 2�

��

��2

��am�le �� am�le ��lorat�o� �esult��am�le �� am�le ��lorat�o� �esult

��

��2

�- 1

�ard�are�ard�are//�o�t�are �odes��o�t�are �odes��

��S��I��P��S��

�� stem ��mulat�o�

doc� dr� �re�or �a�a

�- 2

S� �C�� H� �S��

��stem �es��S��

S��S��

M��C�� N��

��

I��S��

I��P��B��

I��P��C��

�- 3

�utl��e

��

D��S��

��S��C

S��H��A��

�- �

��stem a�d �odelA�� I��S��D��T��A��

�- �

�tateT�� T��

��S��

�- �

�tateI�� A��p��s�

��

�- �

��meI��I��

��p��s��

�- �

��e�ts a�d ��screte ��e�t ��stemsA��T��I��I��

��I��

�- �

��screte ��e�t ��stems ��A�D�S��A��D�S��T��D�S��I��D�S��O�� P��

�- 1�

��me�dr��e� �s� ��e�t�dr��e��

�(�)�

��

��

��

��

��

��

��

��2��

��

�- 11

��me�dr��e� �s� ��e�t�dr��e��-��-��

T��T��A��

��

�

�- 12

��me�dr��e� �s� ��e�t�dr��e��-��

S��A��

��

�

��

�

��

�2��

��

��

��

��

�- 13

�utl��e

S��C��

��

��S��C

S��H��A��

�- 1�

��screte��e�t �odel�� a�d ��mulat�o��

��

T�� -��

�- 1�

�om�o�e�ts o� a ��screte��e�t ��mulat�o��

��I��

��T��

��C��C��A��P��

�- 1�

��screte��e�t ��mulat�o� ��e��

I��

��D��

��U��

��t� rout��e

��le��

set ��to ��

u�date stat�st�cal ��ormat�o�

�e�erate s�mulat�o� re�ort

�rocess ��b� call�� subs�stem module�s�� remo�e e�e�t �rom ��

�- 1�

��screte��e�t ��mulat�o�

�� 2 ��

��

A��P��s��n� may “produce” new events.

Problem: Within the same simulation cycle, “cause” and “effect” events share the same time of occurrenceSolution: The simulator uses a zero duration virtual time interval, called delta-cycle ( )

The role of a delta-cycle is to order “simultaneous” events within a simulation cycle, i.e. identifying which event caused another; “causes” and “effects” are separated by delta-cycles.

Simulation cycles may be composed of several delta-cycles ( )

A C D B C E

4 - 18

Outline

System Classification

Discrete Event Simulation

Example SystemC

Simulation at High Abstraction Levels

4 - 1�

S��te� ��O�e��ie�

4 - ��

��le��

4 - �1

��le��O�

4 - ��

�o�ule�

processes

4 - ��

��o�e��e�

4 - �4

�o�ule�

4 - ��

��o�e��o��uni��tion�rocesses can directly communicate through s��als.

�odule

�rocess �

�rocess �

�nternal signal

�nput ports

�� port

�utput ports

sensitivity

4 - ��

��n�e��o��uni��tionSystemC �.� introduces general�purpose primitives�

C�a��el� A container for communication and synchronization, e.g. can

have state and private data, transport data, transport events.� They implement one or more ��te��aces

��ter�ace� Specify a set of access methods to the channel� But it does not implement those methods

E�e�t� �le�ible, low�level synchronization primitive, �sed to construct

other forms of synchronization� Have no type and no value

�ther comm. � sync. models can be built based on the above primitives

4 - ��

��nnel��n��o�t�

4 - 28

Wait and NotifyWait: halt �rocess e�ecution until event is raised

wait() with arguments => dynamic sensitivity•wait(sc_event)•wait(time)•wait(time_out, sc_event)

Notify: raise an eventnotify() with arguments => delayed notification•my_event.notify(); // notify immediately•my_event.notify(SC_ZERO_TIME); // notify next delta cycle

•my_event.notify(time); // notify after time

4 - 2�

�i��lation �le�ents ��ain Pro�ra�

��e��te all the �ro�esses �ntil a �lo��in� �oint

��date si�nals

Co��te the set of �ready��ro�esses

N��er of �ready�

�ro�esses�d�an�e si��lation ti�e ��e��te all the �ro�esses

�ntil a �lo��in� �oint

�lo�� y�le

��

delta �y�le

�nitiali�ation Phase

��date si�nals

4 - ��

��a��le �i��le �� Channel

4 - ��

��a��le �i��le �� Channel ��nterfa�e

4 - �2

��a��le �i��le �� Channel

4 - ��

��a��le �i��le Prod��er�Cons��er

4 - �4

��a��le� �ahn Pro�ess Net�or�

4 - ��


4 - ��


the �� will deadloc� unlessan initial to�en is �ut into the loo�:

output1.write(0.0);

4 - �7


4 - �8


4 - ��

�yste�C and �odels of Co��tation

4 - 4�

��tline

�ystem �lassification

�iscrete �vent �imulation

��am�le �ystem�

�i��atio��at��i��t�a�tio��

4 - 4�

��lti�le �e�els of ��stra�tion��nti� ed� ��n�tional �e�el

�se: model �un��timed functionality�ommunication: shared varia�les� messages�y�ical languages: �� atla�

�ransa�tion �e�el�se: �o� architecture analysis� early �� develo�ment� timing estimation �ommunication: method calls to channels�y�ical languages: �ystem�

�e�ister �ransfer �e�el �Pin �e�el�se: �� design and verification�ommunication: wires and registers�y�ical languages: �erilog� ��

�unctional

�ransaction��evel

�egister �ransfer �evel

4 - 42

A�straction ModelsTime �ranularity for communication�computation objects can be classified into 3 basic cate�ories� ��-Timed, Approximate-Timed, Cycle-TimedModels B, C, D and E could be classified as Transaction Level Models (TLM)

D. "Cycle-accurate communication model"

E. "Cycle-accurate computation model”Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

System Modeling Graph(2003 Dan Gajski and Lukai Cai)

A. "Un-timed functional model"

B. "Timed functional model"

C. "Transaction model"

F. "Register transfer model"

4 - 4�

v2 � v� � b�b� v3� v�- b�b�

��

v� � a�a�

��

v� � v2 � v3�c � se�u(v�)�

B�

B�

��

B�

B�

B�B�

Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

A� "Un-Timed Functional Model"Computation

Un-timed be�avior

CommunicationUn-timed transfer�ariables

se�uential execution�B�, B2��B3, B�

parallel execution� B2 �� B3

4 - 44

��

v3� v�- b�b�B�

v� � v2 � v3�c � se�u(v�)�

B�

�E�

v2 � v� � b�b�B�

�E�

v� � a�a�B�

�E�

c��c��

c��

v2 � v� � b�b� v3� v�- b�b�

v�

v� � a�a�

v2

v� � v2 � v3�c � se�u(v�)�

B�

B�

v3

B�

B�

B�B�

A

Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

B� �Timed Functional Model”Computation (on processin� elements - �Es)

Time annotation (estimate)

CommunicationMessa�e-passin�� no protocol implementationUn-timed transfer

Mappin��Es (arc�itecture) allocation and process-to-�E mappin�

code - time estimates� e�� DELA�()� or

�ait()

Messa�e-passin�

4 - 4�

Compile�enerated C and

run natively

ldldopldliopts--br

Analy�ebasic blocks�

compute delays

v__st_tmp = v__st;startup(proc);if(events[proc][0] & 1)

execute(proc);

E�ample B� Soft�are Code Annotation�pecification�A�� C �nput

Annotate C code

�� Model��C code �

execution delay

delay c�aracteri�ation

�erformanceEstimation

��UT � A�� C source code �UT�UT � functionally e�uivalent C code au�mented by execution times

v__st_tmp = v__st;__DELAY(LI+LI+LI+LI+LI+LI+OPc);startup(proc);if(events[proc][0] & 1) {__DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF);

execute(proc);}

4 - 46Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

v2 = v1 + b*b;B2

PE2

v1 = a*a;B1

PE1

v3

v3= v1- b*b;B3

v4 = v2 + v3;c = sequ(v4);

B4

PE3

cv12

cv11

cv2

PE4(Arbiter)

3

1 2

Master interface

Slave interface

Arbiter interface

123

C: “Transaction Model”Computation

Approximate-timed (estimate)

CommunicationApproximate-timed (estimate) using simplified (abstract) bus protocols

MappingMapping of computation and communication

4 - 4�

v2 = v1 + b*b;B2

PE2

v1 = a*a;B1

PE1

v3

v3= v1- b*b;B3

v4 = v2 + v3;c = sequ(v4);

B4

PE3

PE4(Arbiter)

3

1 2readyack

address�1�:��data�31:�� e

readyack

address�1�:��data�31:��

Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

Master interface

Slave interface

Arbiter interface

123

D: “C�cle�Accurate Communication Model”Computation

Approximate-timed (estimate)

Communication Protocol bus channels (time�cycle-accurate and pin-accurate)

MappingMapping of computation and communication

3

1 2

4 - 4�

PE3

cv12

cv11

cv2

3

1 2

S�

S1

S2

S3

S4

PE4S�

S1

S2

S3

4

4

PE2

PE1MO� r1� 1�MU� r1� r1� r1

��

��M�A r1� r2� r2� r1

��

4

4

Computation

Communication

A B

C

D F

Un-timed

Approximate-timed

Cycle-timed

Un-timed

Approximate-timed E

Cycle-timed

Master interface

Slave interface

Arbiter interface

1234 � rapper

E: “C�cle�Accurate Computation Model”Computation

Cycle-accurate

CommunicationApproximate-timed (estimate) using simplified (abstract) bus protocols

WrappersSimulation interfaces bet�een cycle-accurate PEs and abstract bus channels interfaces

cycle-accurate and pin-accurate



4 - 4�

Example E: � �at is an ISS��

An Instruction Set Simulator (ISS) is a �� coded in a ��-�� hich mimics the behavior of a processor by “reading” instructions and maintaining internal variables �hich representprocessor�s registers

��Instruction-accurateCycle-accurate

��Simulate (execute and monitor) machine code instructions, compiled for a target processor

4 - 50

Example E: Types of ISS

int Reg[32];…while(1) {Fetch();Decode();Execute();InterruptHandler();

}

…add r1, r2, r3…

…add(r1, r2, r3);…

original assembly code

Interpretive ISS Compiled ISSISS code

…a = b+c;…

original C code

compilation

#define Add(r1, r2, r3)\r3=r1+r2switch INSN {

case ADD: r3=r1+r2;case SUB: ...

}

intermediary C code generation and recompilation

4 - 5�

�E2�E1

�E3�E�S�

S1

S2

S3

S�

�� r1, 1�� r1, r1, r1

��

�� r1, r2, r2, r1

��

S�

S1

S2

S3

�C�T��T�

interr�pt

interrupt

interr�pt

Re�. �e��

Comp�tation

Comm�nication

� �

C

� �

Un�timed

Approximate�timed

Cycle�timed

Un�timed

Approximate�timed

E

Cycle�timed

�: ��e�ister Transfer �odel�Computation and Communication

cycle�timedmodeled on the le�el of combinatorial (stateless)functions, memory andand digital signals

��E1, �E2: microprocessors ��E3, �E�: custom�hardware

4 - 5�

�ifferent �bstraction �odels�odels Communication time Computation time Communication

Scheme �E Interface

A. Un��imed Functional �odel

�o �o �ariables �no �E�

B. �imed Functional �odel

�o �pproximate �bstract c�annel �bstract

C. �ransaction �odel

�pproximate �pproximate �bstract b�s c�annel

�bstract

D. Cycle�Accurate Communication �odel

Cycle acc�rate �pproximate �rotocol b�s c�annel

�bstract

E. Cycle�Accurate Computation �odel

�pproximate Cycle acc�rate �bstract b�s c�annel

�in�acc�rate

F. Register �ransfer model

Cycle acc�rate Cycle acc�rate ��s ��ires� �in�acc�rate

4 - 5�

Trace��ased Sim�lation�� (Un�timed Functional �odel) and � (�ransaction �odel)

Higher simulation speed (for large hardware�software systems, multiprocessors)Uses estimates of non�functional beha�ior

Comp�tation

Comm�nication

� �

C

� �

Un�timed

Approximate�timed

Cycle�timed

Un�timed

Approximate�timed E

Cycle�timed

4 - 54

Trace��ased Sim�lation: 2��ases��

Input: application specification�utput: execution traces = se�uence of e�ents ∈ {��; ��; ��}�ethod: un�timed functional simulation

��-��Input:

execution tracesarchitecture specificationmapping specification

�utput: performance estimation results, e.g. execution time, processor load and bus load�ethod: map abstract read, write and compute primiti�es onto �irtual machines that reflect binding and resource sharing (mapping)

�race generation

�race�based simulation

4 - 55

Cosim�lation ��otivation �ixed �odels�� and the simulation is �ery much dependent on the system description model

How to �� se�eral abstraction le�els or se�eral models of computation�

�oti�ating ��1. Different abstraction le�els2. Different description languages3. Different models of computation

�more abstract less abstract

pac�et

addressdatacmdcnfgstatus

�

��C�C++ �

4 - 5�

Cosim�lation �Example

Se�eral ISSs coupled with H� �R�� simulation: accurate, but slow (especially for multiple ISS running in parallel)

ISSs are replaced with higher�le�el simulation models: speed�up simulation time

H� I�

T� T�

T�

interconnect

T1 T2

T3

nati�e execution (UNI�)

cosim. interface cosim. interface

H� �R�� Simulator (SystemC)

�S model

�S model

En�ironments for multiprocessor system cosimulation:

H� I�

interconnect

cosim. interface cosim. interface

H� �R�� Simulator (SystemC)

T� T�

T�

�S

ISS

T1 T2

T3

�S

ISS

4 - 5�

Cosim�lation �Sin�le vs� ��ltiple En�ines

Sin�le sim�lation en�ine ��ltiple sim�lation en�ines

Simulator# 1

Simulator# 2

Simulator# n

Cosimulation Bus

�1 �2 �n

Unified �odel

�1�2

�n

Simulator

�ard�are�ard�are��Soft�are Codesi�nSoft�are Codesi�n

�o�ef Stefan International �ostgraduate School5 - �

�� orst Case Exec�tion Time �nalysis

doc� dr� �re�or �apa

5 - �

S� �Compilation H� �Synthesis

System �esi�nSpecification

System Synthesis

�achine Code Net lists

Estimation

Instruction Set

Intellectual�rop. Bloc�

Intellectual�rop. Code

5 - �

Contents��

problem statement, tool architecture�rogram �ath Analysis �alue AnalysisCaches

must, may analysis�ipelines

Abstract pipeline modelsIntegrated analyses

5 - 4

Ind�strial �eeds��-�� , often in safety�critical applications abound

Aeronautics, automoti�e, train industries, manufacturing control

� ing �ibration of airplane, sensing e�ery � mSec

Sideairbag in car,Reaction in �1� mSec

5 - 5

�ard �eal�Time SystemsEmbedded controllers are expected to finish their tas�s reliably within time bounds.

�as� scheduling must be performed.

Essential: ��of all tas�s statically �nown.

Commonly called the � ��-��(� CE�)

Analogously, ��-��(BCE�)

5 - �

Execution �ime

Best CaseExecution �ime

� orst CaseExecution �ime

Upper bound�nsafe:Execution �ime�easurement

Dis

tribu

tion

f exe

cutio

n tim

es

� or�s if either�worst�case input can be determined, or�exhausti�e measurement is performed

�therwise,determine upper boundfrom execution times ofinstructions

�eas�rement �Ind�stry�s �best practice�

5 - �

��ost of� Ind�stry�s �est �racticeMeasurements: determine execution times directly by observing the execution or a simulation on a set of inputs.

Does not guarantee an upper bound to all executions.Exhaustive execution in general not possible!

Too large space of input domain x set of initial execution states.

Compute upper bounds along the structure of the program:

Programs are hierarchically structured. Statements are nested inside statements.So, compute the upper bound for a statement from the upper bounds of its constituents

5 - 8

Sequence of Statements

A A1; A2; Constituents of A:A1 and A2

Upper bound for Ais the sum of the upperbounds for A1 and A2

ub(A) = ub(A1) + ub(A2)

5 - �

�on��t�ona� Statement� �f �

t�en ��e�se ��

�

��

�es no

Constituents of A:�� ondition �� state�ents A1 and A2

ub(A) =

ub(�) +

max(ub(A1), ub(A2))

5 - ��

�oo�s

i 1

i ≤ 100

A1

�es

no

ub(A) =ub(i 1) +1�� ( ub(i 1��) +

ub(A1) ) +ub( i ≤ 100)

A for i 1 to 1�� do A1

5 - ��

�o� to sta�t��ssignmentx a + b

load a

load b

add

store x

ub(x a + b) = cycles(�oa� a) +cycles(�oa� �) +cycles(a��) +cycles(sto�e �)

cyclesadd �load m 12store m 1�move 1

�ssu�es��onstant�e�e��ution�ti�esfo��inst�u�tions

�ot�a��i�a��eto��ode�n��o�esso�s��

5 - ��

�o�e�n �a��a�e �eatu�es�odern processors increase per�ormance by using: Ca��es��i�e�ines��an��edi�tion��e�u�ation

These features ma�e � CE� computation di��icult:�xecution times of instructions vary �idely.

�est case �everything goes smoothely: no cache miss, operands ready, needed resources free, branch correctly predicted.� orst case �everything goes �rong: all loads miss the cache, resources needed are occupied, operands are not ready.��an��a��e�se�e�a��und�ed��es�

5 - ��

LOAD r2, _a

LOAD r1, _b

ADD r3,r2,r1

�

��

1��

1��

2��

2��

��

��

�est �ase � orst �ase

�xecution Time (�loc� �ycles)

�loc� �ycles

��

x = a + b;

�ccess ��mes

5 - ��

��m�n� �cc��ents an� �ena�t�es�iming �ccident �cause for an increase of the execution time of an instruction�iming �enalt��the associated increase��pes of timing accidents

�ache missesPipeline stalls�ranch mispredictions�us collisions�emory refresh of D�A�T�� miss

5 - �5

��e�a�� oac�� o�u�a��at�onMicro-architecture �nal�sis:

Uses Abstract �nterpretation�xcludes as many Timing Accidents as possibleDetermines � ��T for basic bloc�s (in contexts)

� orst-case �ath �etermination�aps control flo� graph to an integer linear programDetermines upper bound and associated path

5 - ��

�ontents�ntroduction

problem statement, tool architecture�rogram �ath �nal�sis�alue Analysis�aches

must, may analysisPipelines

Abstract pipeline models�ntegrated analyses

5 - ��

�ont�o� ��o� ��a��

what_is_this {1 read (a,b);2 done = FALSE;3 repeat {4 if (a>b)5 a = a-b;6 elseif (b>a)7 b = b-a;8 else done = TRUE;9 } until done;10 write (a);

}

1

2

�

�

� �

�

�

1�

a=b

a>b

a<b

a<=b

done!done

5 - �8

��o��am �at� �na��s�s�rogram �ath �nal�sis

�hich se�uence of instructions is executed in the �orst�case (longest runtime)�problem: the number of possible program paths gro�s exponentially �ith the program length

Modelfixed number of cycles for each basic bloc� (from static analysis)loops must be bounded

ConceptTransform structure of �� into a set of (integer) linear e�uations.Solution of the �nteger �inear Program (��P) yields bound on the � ��T.

5 - ��

�as�c ��oc��e�inition�A basic bloc� is a se�uence of instructions �here the control flo� enters at the beginning and exits at the end, �ithout stopping in�bet�een or branching (except at the end).

t1 := c - dt2 := e * t1t3 := b * t1t4 := t2 + t3if t4 < 10 goto L

5 - ��

�as�c ��oc�s�etermine basic bloc�s o� a program�1. �ete��ine�t�e��o��e�innin�s:

the first instructiontargets of un�conditional �umpsinstructions that follo� un�conditional �umps

2. dete��ine�t�e��asi��o��s:there is a basic bloc� for each bloc� beginningthe basic bloc� consists of the bloc� beginning and runs

until the next bloc� beginning (exclusive) or until the program ends

5 - ��

i := 0t2 := 0

L t2 := t2 + ii := i + 1if i < 10 goto Lx := t2

�ont�o� ��o� ��a�� t� �as�c ��oc�s��egenerated� control �lo� graph �C��

the nodes are the basic bloc�s

i < 10i >= 10

5 - ��

��am��e

/* k >= 0 */s = k;WHILE (k < 10) {

IF (ok)j++;

ELSE {j = 0;ok = true;

}k ++;

}r = j;

s = k;

WHILE (k<10)

if (ok)

j++; j = 0;ok = true;

k++;

r = j;

�1

�2

��

��

��

��

5 - ��

�a�cu�at�on of t�e � ��Definition: A program consists of N basic blocks, where each basic block Bi has a worst-case execution time ci and is executed for exactly xi times. Then, the WCET is given by

N

iii xcWCET

1

the ci values are determined using the static analysis.how to determine xi ?

• structural constraints given by the program structure• additional constraints provided by the programmer (bounds for

loop counters, etc.; based on knowledge of the program context)

5 - 24

Structural Constraintss = k;

WHILE (k<10)

if (ok)

j++; j = 0;ok = true;

k++;

r = j;

B1

B2

B3

B4 B5

B6

B7

Flow equations:

d1

d2d1 = d2 = x1

d3

d8

d9

d2 + d8 = d3 + d9 = x2

d4 d5

d3 = d4 + d5 = x3

d6

d4 = d6 = x4

d7

d5 = d7 = x5

d6 + d7 = d8 = x6

d10

d9 = d10 = x7

5 - 25

��itional Constraintss = k;

WHILE (k<10)

if (ok)

j++; j = 0;ok = true;

k++;

r = j;

B1

B2

B3

B4 B5

B6

B7

d1

d2

d3

d4 d5

d6 d7d8d9

d10

loop is executed for at most 10 times�

x3 �= 10 �x1

B5 is executed for at most one time�

x5 �= 1 �x1

5 - 26

WCET - ILPILP with structural and additional constraints:

}

{...1,

1max

sconstraint additional

)()(

11

Ni��

��

iBoutk

kBinj

j

N

iii

ii structuralconstraints

program is executed once

5 - 2�

Cont�nts�ntroduction

pro�lem statement� tool arc�itecture�rogram �at� �nal�sis �alu� �nal�sis�ac�es

must� ma� anal�sis�ipelines

��stract pipeline models�ntegrated anal�ses

5 - 2�

A�stra�t Int�r�r�tation �AI�� antics-�as�d � �thod �or static program anal�sis

�asic id�a o� �I� �er�orm t�e program�s computations using �alue descriptions or abstract values in place o� t�e concrete �alues� start �it� a description o� all possi�le inputs�

�� supports corr�ctn�ss proo�s�

5 - 2�

A�stra�t Int�r�r�tation �t�� In�r�di�ntsa�stract do� ain �related to concrete domain �� a�straction and concreti�ation �unctions� e�g� � Intervals, where Intervals = LB UB, LB = UB = Int {- , }instead of L Int abstract transfer functions for each statement type –abstract versions of their semantics e.g. + : Intervals Intervals Intervals where [a,b] + [c,d] = [a+c, b+d] with + extended to - , a join function combining abstract values from different control-flow paths e.g. t : Interval Interval Interval where[a,b] t [c,d] = [min(a,c),max(b,d)]

5 - 30

Value AnalysisMotivation:

Provide access information to data-cache/pipeline analysisDetect infeasible pathsDerive loop bounds

Method: calculate intervals at all program points, i.e. lower and upper bounds for the set of possible values occurring in the machine program (addresses, register contents, local and global variables).

5 - 3�

Value Analysis

�Intervals are computed along the �� edges

��t �oins, intervals are �unioned�

D�: [-�,+�] D�: [-�,�]

D�: [-�,+�]

move #4,D0

add D1,D0

move (A0,D0),D1

D�:[-�,�], ��:[�x��,�x��]

D�:[�,�], D�:[-�,�],��:[�x��,�x��]

D�:[�,�], D�:[-�,�],��:[�x��,�x��]

access [�x��,�x��]� hich address is accessed here�

5 - 3�

��n�en�sIntroduction

problem statement, tool architectureProgram Path �nalysis �alue �nalysis�aches

must, may analysisPipelines

�bstract pipeline modelsIntegrated analyses

5 - 33

�a��es� �as� �e��y �n ��i��aches are used, because

�ast main memory is too expensive�he speed gap between �PU and memory is too large and increasing

�aches wor� well in the avera�e case:Programs access data locally (many hits)Programs reuse items (instructions, data)�ccess patterns are distributed evenly across the cache

5 - 3�

�a��es

Processor

Memory

Bus

Cachefast, small,expensive

(relatively)slow, large,cheap

accesstakes

~ 1 cycle

accesstakes

~ 100 cycles

5 - 35

�a��es� �� e ��PU wants to read��rite at � e� or� address a, sends a re�uest for a to the bus.�ases:

Bloc� m containing a in the cache (hit): re�uest for a is served in the next cycle.Bloc� m not in the cache (miss): m is transferred from main memory to the cache, m may replace some bloc� in the cache,re�uest for a is served asap while transfer still continues.

�everal re��ace� ent strate�ies: L�U, PL�U, �I��,...determine which line to replace.

5 - 3�

�� ay �e� Ass��ia�i�e �a��e

5 - 3�

�� a�e�y�ach cache set has its own re��ace� ent �o�ic =� �ache sets are independent. �verything explained in terms of one set��-�e��ace�ent �trate��:

�eplace the bloc� that has been Least �ecently Used�odeled by �ges

��a��e: �-way set associative cacheaccess age � age � age � age �

m� m� m� m�

m� (miss) m� m� m� m�

m� (hit) m� m� m� m�

m� (miss) m� m� m� m�

5 - 3�

�a��e Analysis�ow to statically precompute cache contents:

Must �na��sis:�or each program point (and calling context), find out which bloc�s are in the cache.Determines safe information about cache hits. �ach predicted cache hit reduces � ��.

Ma� �na��sis: �or each program point (and calling context), find out which bloc�s may be in the cache. �omplement says what is not in the cache.

Determines safe information about cache misses. �ach predicted cache miss increases B��.

5 - 50

��n�e��s�ache contents depends on the context, i.e. calls and loops

�irst Iteration loads the cache:Intersection looses most of the information.

Distinguish as many contexts as useful: � unrolling for caches� unrolling for branch prediction (pipeline)

��ile cond ��oin (must)

5 - 5�

��n�en�sIntroduction

problem statement, tool architectureProgram Path �nalysis �alue �nalysis�aches

must, may analysis�i�e�ines

�bstract pipeline modelsIntegrated analyses

5 - 5�

��a�is�n �� A��i�e��u�es

L� ��

�� I� �� B

L�� I� ��

��

��

��

�� I� �� B L�� I� �� B

�in�y�lenverarb.

�ehr�y�lenverarb.

Pipelineverarb.

single cycle

multiple cycle

pipelining

5 - 53

��d��e�Fe�tu�e��e��e�

Ideal Case: 1 Instruction per Cycle

Fetch

Decode

Execute

WB

Fetch

Decode

Execute

WB

��t�� t�� t�� t��

Fetch

Decode

ExecuteWB

Fetch

Decode

Execute

WB

Fetch

Decode

Execute

WB

5 - 5�

D�t��th�o��e��e��ch�tectu�e

5 - 55

��d��e�Fe�tu�e��e��e��.

Several instructions can be e�ecuted in parallel.

Some pipelines can begin more than one instruction per cycle: VLIW, Superscalar.

Some CP�s can e�ecute instructions out�o��order.

�� : Hazards and cache misses.

5 - 5�

��e��e��d�Pipeline �azards:

��: �perands not yet available ��ata �ependences�

��: Consecutive instructions use same resource

��: Conditional branch

��-��: Instruction �etch causes cache miss

5 - 5�

�o�t�o��d

��

5 - 5�

D�t��d

5 - 5�

��: prediction o� cache hits on instruction or operand �etch or store

�t�t�c��o��h��d�

l�z r4� 2��r1� Hi�

��: analysis o� data�control hazards

��: analysis o� resource hazards

add r4� r5�r6l�z r7� 1��r1�add r8� r4� r4

�pera�dread�

�FE��F

5 - ��

��o�c�ete��t�te��ch��eProcessor �pipeline� cache� memory� inputs� vie�ed as a �� per�orming transitions every clock cycle.Starting in an initial state �or an instruction transitions are per�ormed� until a �� is reached:

��: instruction has le�t the pipeline��: e�ecution time o� instruction

�u�ct�o� e�ec �� : ��c��oc�� s : co�c�ete��e��e��t�te� �: t��ceinterprets instruction stream o� �starting in state s producing trace �successor basic block is interpreted starting in initial state las��le��h��gives number o� cycles

5 - ��

��t��ct��e��e��o��B��c�B�oc��u�ct�o� e�ec �� : ��c��oc�� s : ��t��ct��e��e��t�te� �: t��ce

interprets instruction stream o� ��annotated �ith cache in�ormation� starting in state s producing trace �le��h�� gives number o� cycles

� ��bstract states may lack in�ormation� e.g. about cache contents.�ssume local �orst cases is sa�e�in the case o� no timing anomalies�Traces may be longer �but never shorter�.

5 - �2

Wh�t��d��e�e�t��or successor basic block� In particular� i� there are several predecessor blocks��:

sets o� statescombine by assuming that local �orst case is sa�e

s�s�s�

5 - �3

�u��o��te��

��using statically computed e��ective addresses and loop bounds

��assume cache hits �here predicted�assume cache misses �here predicted or not e�cluded.�nly the ��orst�result states o� an instruction need to be considered as input states �or successor instructions�

�- �

��d��e��d��e��o�t��e��ode��o�t��e��ode��

�o�e� Ste�an International Postgraduate School

��u�t��te��t�� t�o�

doc��d��e�o��

�- 2

SW�Compilation �W�Synthesis

��te��De��Speci�ication

System Synthesis

Machine Code �et lists

Estimation

Instruction Set

IntellectualProp. Block


�- 3

De��ce�Ex��o��t�o�

��c�t�o� ��ch�tectu�e

��

E�t��t�o�

�multi�ob�ective� optimization

�- �

�ectu�e��o��ptimization�esignImplementation

�- 5

Wha� are ��lu�i��ar� �l��ri�hms�

randomized� ��o��e��de�e�de�t search heuristics→ applicable to black�bo� optimization problems

H�� d� �he� ��r��

by iteratively improving a �o�u��t�o�o� solutions by variation and selection→ can �ind many di��erent optimal solution in a single run

E�o�ut�o��u�t�o��ect��e��t��t�o��o��th��

�- �

�he��c��o��e�

�

�eight � 75�gpro�it � 5

�eight � 15��gpro�it � 8



�o��choose subset that

�ma�imizes overall pro�it

�minimizes total �eight

�- �

�he��o�ut�o��ce

5��g 1��g 15��g 2��g 25��g 3��g 35��g�ei�h�

pr��i�

5

1�

15

2�

�- �

�inding the goodsolutions

5��g 1��g 15��g 2��g 25��g 3��g 35��g�ei�h�

pr��i�

5

1�

15

2�

�he��de�o��F�o�t��e��t�o�� there is no single optimal solution� but

some solutions � � are better than others � �

selecting asolution

�- �

��o�che�� pro�it more important than cost �ranking�

too heavy

5��g 1��g 15��g 2��g 25��g 3��g 35��g�ei�h�

pr��i�

5

1�

15

2�

Dec��o��e�ect��o�ut�o�

�eight must not e�ceed 24��g �constraint�

�- ��

��te��t�� t�o��

searches �or a set o��green� solutions

selects one solutionconsidering constraints

decision making o�ten easier

evolut. algorithms �ell suited

Whe��to��e�the�Dec��o�Be�o�e��t��t�o��

ranks ob�ectives�de�ines constraints��

searches �or one �green� solution

too heavy

5��g 1��g 15��g 2��g 25��g 3��g 35��g

�ei�h�

pr��i�

5

1�

15

2�

�- ��

��t��t�o��te��t��e��se o� classical si��le ��ec�i�e �p�imiza�i��methods

simulated annealing� tabu searchinteger linear programother constructive or iterative heuristic methods

�ecisi�� ma�i��eighting the di��erent ob�ectives� is done �e��re �he �p�imiza�i��.

��pula�i�� ased �p�imiza�i�� me�h�dsevolutionary algorithmsgenetic algorithms

�ecisi�� ma�i�� is done a��er �he �p�imiza�i��.

�- �2

F�t�e��d��u�t��e��ect��e�

y1

y2

y1

y2

��e��t�o��ed do��ce��ed

parameter�orientedscaling�dependent

set�orientedscaling�independent

�ei�h�ed sum

�- �3

We��hted��o�t�Fu�ct�o�

y2

y1

trans�ormation

parameters

y�y1� y2� � � yk�

multipleob�ectives

singleob�ective

e�ample: �eighting approach

y � �1y1 � � � �kyk

��1� �2� � � �k�

ma�imization problem

�- ��

�ut��e�o��e�E�o�ut�o��o��th�

�1��

��11 � 1 solution

�itnessevaluation

�111 �itness � 19

matingselection

��11

��

mutation

��11

1�11

environmentalselection recombination

recombination � mutation � variation

�- �5

�t��t��o��t��E�cod��o��o�ut�o��

�

�� 11 11item 1 item 2 item 3 item 4

subset

�- ��

��e�e��c��u�t�o��ect��e�E�archivepopulation

ne� population ne� archive

samplevary

select

updatetruncate

�- ��

��E�o�ut�o��o��th��ct�o�ma�. y2

min. y1

hypothetical trade�o�� ront

�- ��

�1��

t

vv ��t �

2 134

5Rlane�� v� a

ngear

point o� gear

changegear

lanev� a

��

�X

Stretch�Module �andling�Module

lane

a re�uire

Br�r

�r

gears clutch

lane� �� va�gear� n

�ehicle�Module

�ecision�Module

B��c��Box��t��t�o�

�ptimization �lgorithm:

only allo�ed to evaluate � �direct search�

decision vector �

ob�ectivevector ��

o��ect��e��u�ct�o�

�e.g. simulation model�

�- ��

De��ce�Ex��o��t�o�

cost

latencypo�er

consumption

Speci�icationSpeci�ication �ptimization�ptimization ImplementationImplementationEvaluationEvaluation

�- 2�

��c�et��oce��et�o��

Mobile InternetMobile Internet

Embedded Internet�evices

Embedded Internet�evices

�ccess Core

�ethod��d��o��to��do��oth��c��co��d��e��e�d��o�

Wearable ComputingWearable Computing

��B��e�

�- 2�

�et�o��oce��o��et�o��oce��o�� high�per�ormance� programmable device

designed to e��iciently e�ecute communication

�orkloads ��r��le� e� al��

�et�o��oce��o�

��

�et�o��oce��o�

��

routing � �or�ardingtranscoding

encryption � decryption

incoming �lo�s�packet streams�

outgoing �lo�s�processed packets�

real�time �lo�s

non�real�time �lo�s

e.g.� voice

e.g.� s�tp

�

�- 22

��t��t�o��ce��o��e��e��e�� speci�ication o� the task structure �t��ode��

�or each �lo� the corresponding tasks to be e�ecuteddi��erent usage scenarios ��o��ode�� sets o� �lo�s �ith di��erent characteristics

�ou�ht� net�ork processor implementation � architecture � task mapping � scheduling

��ect��e�� ma�imize per�ormanceminimize cost

�u��ect�to� memory constraintdelay constraints

��e��o��ce��ode��

�- 23

Ex��o��t�o��t��te��

� u�t�o��ect��eo�t��t�o�

e��u�t�o�

per�ormance �cost vector

allocationbindings

co��t�uct��ch�tectu�e

��o��

e�t��te�e��o�� ce

per�ormancearchitecture

bindingrestrictions

taskgraph

architecturetemplate

�or each usagescenario separately

�- 2�

�ectu�e��o��ectu�e��o��Introduction��esignImplementation

�- 25

Do��ce��eto��o��t�� design� point �� is d�mi�a�ed by �i� i� �i is

better or e�ual than �� in all criteria and better in at least one criterion.

� point is Pareto�optimal or a �are��p�i��i� it is not dominated.

The domination relation imposes a partial order on all design points

We are �aced �ith a set o� optimal solutions.�ivergence o� solutions vs. convergence.

�- 2�

�u�t��o��ect��e��t��t�o�

�- 2�

�u�t�o��ect��e��t��t�o�Ma�imize �y1� y2� � � yk� � ��1� �2� � � �n�

��eto��et � set o� all Pareto�optimal solutions

y2

y1

�orse

better

incomparable

incomparable

y2

y1

Pareto optimal � not dominated

dominated

�- 2�

��do��ed��B��c��Box��e��ch��o��th��

t��randomly� choose asolution �1 to start �ith

Randomizedsearch algorithm

t� t��randomly� choose a solution �t�1 using solutions �1� � � �t

�de�� ind good solutions �ithout investigating all solutions��u��t�o��better solutions can be �ound in the neighborhood

o� good solutionsin�ormation available only by �unction evaluations

�- 2�

��e��o��do��ed��e��ch��o��th��e�ect�o�

environmentalselection

matingselection

��t�o�� e� o��

E� ≥ 1 bothevolutionary algorithm

�� 1 no mating selectiontabu search

�� 1 no mating selectionsimulated annealing

6 - 30

Limitations of Randomized Search AlgorithmsLimitations of Randomized Search Algorithms

Remarks:Not all functions equally likely and realisticWe cannot expect to design the algorithm beating all othersOngoing research: which algorithm suited for which class of problem?

The No-Free-Lunch Theorem

All search algorithms provide in average the sameperformance on a all possible functionswith finite search and objective spaces.

[Wolpert, McReady: 1997]

6 - 3�

�ourse Synopsis�ourse Synopsis�ntroductionOptimi�ation��mplementation

6 - 32

�esign �hoices�esign �hoices

�1��

��11 �111

��11��

��11

1�11

representation fitness assignment mating selection

environmental selection variation operators

parameters

6 - 33

�omparison of Three �mplementations��

��

e�te�ded��

�-o��ective knapsack pro�lem

�rade�off betweendistance and diversity?

6 - 3�


�1��

��11 �111

��11��

��11

1�11

�� fitness assignment mating selection


parameters

6 - 3�

RepresentationRepresentationsearch space decoder solution space o��ectives o��ective space

�1 � 1 1 1 �

1 � 1 1 1 �

1 � 1 1 1 �

solutions encoded by vectors� matrices� trees� lists� ...

�ssues:completeness �each solution has an encoding�uniformity �all solutions are represented equally often�redundancy �cardinality of search space vs. solution space�feasibility �each encoding maps to a feasible solution�

fixed length variable length

6 - 36

E�ample: �inary �ector EncodingE�ample: �inary �ector Encoding�iven: graph�oal: find minimum subset of nodes such that each edge

is connected to at least one node of this subset�minimum vertex cover�

1A

��

1�

1�

1�

��

�nodes

selected?

A � � � �

6 - 3�

E�ample: �nteger �ector EncodingE�ample: �nteger �ector Encoding�iven: graph� k colors�oal: assign each node one of the k colors such that the

number of connected nodes with the same color is minimi�ed �graph coloring problem�

1A

��

1�

��

1�

��

�nodescolors

A � � � �

6 - 3�

E�ample: Real �ector EncodingE�ample: Real �ector Encoding

�.��x1

�.��x�

1.��x�

�.��x�

��

�.��xnparameters

values

6 - 39

Tree E�ample: �arking a TruckTree E�ample: �arking a Truck

steeringangle

dock

cab

trailer

position �x�y�

u

constant speed

�oal:find function c with

u � c�x� y� d� t�

d

t

6 - �0

Search Space for the Truck �ro�lemSearch Space for the Truck �ro�lem�perators:

Arguments: � position x� position y�� cab angle d�AN� trailer angle t

Search space :set of symbolic expression using the above operators and arguments

6 - ��

E�ample Solution: Tree RepresentationE�ample Solution: Tree Representation

��

��N��

�� AN��

encodes the function �symbolic expression�: u � �x �d� � �y � t�

6 - �2

A Solution Found �y an EAA Solution Found �y an EAtruck simulation encoded tree

6 - �3


�1��

��11 �111

��11��

��11

1�11

representation �� mating selection


parameters

6 - ��

Fitness AssignmentFitness AssignmentFitness F � scalar value representing quality of an individual �

The simple case:single objective optimi�ation:

��

�ore difficult cases:fitness not only takes into account the different objectives �compliance to �areto optimality� but also properties of the whole populationmultiple optima need to be approximated �diversity�constraints are involved which have to be met

solution in search spacesolution in solution space

solution in objective space

6 - ��

Simple e�ample: �areto RankingSimple e�ample: �areto Ranking

�itness function:

cost

execution time

�

�

�

�

�

� 0)6(1)5(2)4(1)3(1)2(3)1(

FFFFFF

6 - �6

�onstraint �andling�onstraint �andling�onstraint ��x1� x�� xn�≥ � �≥ �

�< �Approaches:construct initiali�ation and variation such that infeasiblesolutions are not generated �resp. not inserted�representation is such that decoding always yields a feasible solutioncalculate constraint violation ��x1� x�� xn� and incorporate it into fitness� e.g.� �� penalty��x1� x�� xn�� fitness to be maximi�ed� use of a penalty�function penalty�y� � � if y � �� include the constraints as new objectives

feasi�le

infeasi�lesolution in solution space

6 - ��


�1��

��11 �111

��11��

��11

1�11

representation fitness assignment ��

�� variation operators

parameters

6 - ��

SelectionSelection

T�o types of selection:

mating selection � select for variation

environmental selection � select for survival

6 - �9

Tournament SelectionTournament Selection

� � tournament si�e �binary tournament selection means ��

population mating pool

uniformly choose� individuals at

random independentlyof fitness

compare fitnessand copy best

individualin mating pool

6 - �0


�1��

��11 �111

��11��

��11

1�11

representation fitness assignment mating selection

environmental selection ��

parameters

6 - ��

�ector �utation: E�amples�ector �utation: E�amples

�it vectors:

�ermutations:1 � � � � �

swap

1 � � � � �

1 � � � � �

rearrange

1 � � � � �

1 � 1 1 1 �

1 � � 1 1 �

each bit is flipped with probability 1��

6 - �2

�utation �perators on Trees: �ro��utation �perators on Trees: �ro�

��

��N��

�� AN��N��

��

��

��N��

�� N��

��

gro�

6 - �3

�utation �perators on Trees: Shrink�utation �perators on Trees: Shrink

��

��N��

�� AN��N��

��

��

��N��

�� AN��AN�

shrink

6 - ��

�utation �perators on Trees: S�itch�utation �perators on Trees: S�itch

��

��N��

�� AN��N��

��

��

��AN�

��

� ��N��

��N��

��

s�itch

6 - ��

�utation �perators on Trees: Replace�utation �perators on Trees: Replace

��

��N��

�� AN��N��

��

��

��N��

�� AN��N��

��

replace

6 - �6

�ector Recom�ination: E�amples�ector Recom�ination: E�amples�it vectors:

�ermutations:

1 � 1 � � 1

1 1 � � 1 �1 1 � � � 1

1 � � � � �

� � � � 1 �

1 � � � � �parents

child

6 - ��

Recom�ination of TreesRecom�ination of Trees

��

��N��

�� AN��N��

��

��

��N��AN�

��

��

��

�AN��e�change

6 - ��

A �eneric �ultio��ective EAarchivepopulation

new population new archive

samplevary

select

updatetruncate

6 - �9

�tep 1: �enerate initial population �� and empty archive �external set� A�. �et t � �.

�tep �: �alculate fitness values of individuals in �t and At.�tep �: At�1 � non�dominated individuals in �t At.

�f si�e of At�1 � N then reduce At�1� else ifsi�e of At�1 � N then fill At�1 with dominatedindividuals in �t and At.

�tep �: �f t � � then output the non�dominated set of At�1.�top.

�tep �: �ill mating pool by binary tournament selection.�tep �: Apply recombination and mutation operators to

the mating pool and set �t�1 to the resultingpopulation. �et t � t � 1 and go to �tep �.

S�EA� Algorithm

6 - 60

�dea �Step ��: calculate dominance rank weighted by dominance count

Note: higher objective function �bettersmaller fitness � better

S�EA� Fitness Assignment

y�

y1

�

��

��

��

��

��

non-dominated solutions:� � �dominated solutions dominated solutions� � � of non��areto solutions � ∑ strengths of dominators

6 - 62

�ourse Synopsis�ntroductionOptimi�ation�esign��

6 - 6�

�mplementation: �omponentsA frame�ork that

�rovides ready�to�use modules �algorithms � applications��s simple to use�s independent of programming language and O��omes with minimum overhead

�dea: separate problem�dependent from problem�independent part

Selection

Archiving

Representation

Objective functions

Mutation

RecombinationFitness assignment

cut

6 - 6�

The �oncept of ��SA

��A�

N��A��

�A��

Algorithms Applications

knapsack

��

networkprocessordesign

text�based�latform and programming language independent �nterface

for Search Algorithms [�le�ler et al�: ��]

6 - 66

��SA: �mplementation

selectorprocessselectorprocess

textfiles

sharedfile

system

sharedfile

system

variatorprocessvariatorprocess

application independent:mating � environmental selectionindividuals are describedby ��s and objective vectors

handshake protocol:state � actionindividual ��sobjective vectorsparameters

application dependent:variation operatorsstores and manages individuals

�- �

�ard�are�ard�are��Soft�are �odesignSoft�are �odesign

�o�ef �tefan �nternational �ostgraduate �chool

�� apping Applications To Architectures

doc� dr� �regor �apa

�- 2

�W��ompilation �W��ynthesis

System �esign�pecification

�ystem �ynthesis

�achine �ode Net lists

�stimation

�nstruction �et

�ntellectual�rop. �lock

�ntellectual�rop. �ode

�- 3

Synthesis

�ynthesis transforms behavior into structure.

��:

��: select components

��: assign functions to components

scheduling: determine execution ordermapping

(allocation and) binding sometimes called partitioning

7 - 4

Application SpecificationDepends on the underlying model of computation.Examples (see also next slides):

Task graphs (data flow graph, control flow graph)Process Networks (Kahn Process Network, Synchronous Dataflow)State Machine Representations (SpecCharts, StateCharts, Polis) [not covered in this course].

For the mapping, very often only the network structureand abstract properties of the processes are relevant (abstraction from detailed process function).

7 - �

�ata �lo� ��ap� ��

x = 3*a + b*b - c; y = a + b*x;z = b - c*(a + b);

� a

a

ab b b

b

b

c

c

xy �

7 - �

�ont�ol �lo� ��ap� ��

what_is_this {1 read (a,b);2 done = FALSE;3 repeat {4 if (a>b)5 a = a-b;6 elseif (b>a)7 b = b-a;8 else done = TRUE;9 } until done;10 write (a);

}

�

�

�

�

� �

�

�

��

a=b

a>b

a<b

a<=b

done!done

7 - 7

�a�n ��oce�� et�o��ierarchical network for M�P�� application:

7 - �

A�c�itect��e SpecificationDepends on the underlying model of the platform.�sually a graph notation is used� to the elements, properties of the underlying platform are usually attached.

7 - �

��ample �� A�c�itect��e Specification

- <processor name="processor1" type="DSP"><port name="processor_port" type="duplex" /><configuration name="clock" value="100 MHz" />

</processor>+ <processor name="processor2" type="RISC">+ <memory name="sharedmemory" type="DXM">- <hw_channel name="in_tile_link" type="bus">

<port name="port1" type="duplex" /><port name="port2" type="duplex" /><port name="port3" type="duplex" /><configuration name="buswidth" value="32bit" />

</hw_channel>- <connection name="processor1link">

<origin name="processor1"><port name="processor_port" />

</origin><target name="in_tile_link">

<port name="port1" /></target>

</connection>+ <connection name="processor2link">+ <connection name="memorylink">

busbus

DSPDSP R�SCR�SC D�MD�M

7 - ��

�apping SpecificationRelates application and architecture specification:

maps processes to computing resourcesmaps communication between processes (in case of process networks) to communication paths of the architecturespecifies resource sharing disciplines and scheduling

7 - ��

��ample ��asic model with a data flow graph and static scheduling

Problem graph GP(VP,EP):

1 2

3

4

5 6

7

Interpretation:

• VP consists of functional

nodes VPf (task, proce-

dure) and communication

nodes VPc .

• EP represent data depend-encies

Data flow graph �P(�P, �P)

7 - 12

Example (2)Architecture graph GA(VA,EA):

• VA consists of functional resources VAf (RISC, ASIC) and

bus resources VAc. These components are potentially allo-

catable.• EA model directed communication.

RISC HWM1

HWM2

sharedbus

PTP bus

RISC HWM1

HWM2

shared bus PTP bus

Architecture Architecture graph

7 - 1�

Example (�)1

2

3

4

5

6

7

RISC

HWM1

HWM2

SB

PTP

GP EM GA

Definition: A specifica-tion graph is a graphGS=(VS,ES) consistingof a problem graph GP,an architecture graphGA, and edges EM. Inparticular, VS=VP∪VA,ES=EP∪EA∪EM

�a�a �l��

7 - 1�

Example (�)Three main tasks of synthesis:

• Allocation α is a subset of VA.

• Binding β is a subset of EM, i.e., a mapping of functionalnodes of VP onto resource nodes of VA.

• Schedule τ is a function that assigns a number (start time) toeach functional node.

7 - 1�

Example (�)

Definition: Given aspecification graph GSan implementation is atriple (α,β,τ), where αis a feasible allocation,β is a feasible binding,and τ is a schedule.

1

2

3

4

5

6

7

RISC

HWM1

SB

1

0

8

1

20

1

2

α

τ

0

1

21

30

1

21

29

β RISC HWM1

HWM2

sharedbus

PTP bus

7 - 1�

�� ompilation �� nthesis

��em �e��pecification

��stem ��nthesis


Estimation

�nstruction �et

�ntellectualProp� Bloc�


7 - 1�

�e�� pa�e Expl��a��

Determine mappingDetermine important paramerters (end�to�end dela�, throughput, �uffer space output �itter, ��)Gi�e feed�ac� to optimi�ation

�ppl��a�� e��e

�app��

E��ma��

7 - 1�

�e�� e�� a�� el

�- 1

�a��a�e�a��a�e��a�e ��e��a�e ��e��

�o�ef �tefan �nternational Postgraduate �chool

�� em �a��

�� e�� apa

�- 2

�� ompilation �� nthesis

��em �e��pecification

��stem ��nthesis


Estimation

�nstruction �et

�ntellectualProp� Bloc�


�- �

�a��

low level: at the register transfer (��) le�el, at the netlist le�el

� split a digitial circuit and map it to se�eral de�ices (�PG�s, ��s)

� s�stem parameters are relati�el� well��nown (area, dela�)

high level: at the s�stem le�el� comparison of design alternati�es mandator� (design space

e�ploration) � s�stem parameters are un�nown� importance of estimation (anal�sis, simulation, rapid

protot�ping)

�- �

��el�� (see pre�ious lecture�)�� model application�� define architectural template�� identif� possi�le �indings

��Ver� often, parameters are attached to the a�o�e models that simpl� allow to ��of the partitioning (allocation and �inding)��ometimes, �� (simulation, anal�sis) are applied to gi�e more accurate predictions�� allocation gi�es cost � as the sum of the allocated component costs� scheduling gi�es latenc� �� constraints� feasi�le schedule � �ma�� feasi�le allocation � �ma�

�- �

��e �a�� lem��he partitioning pro�lem is to assign no��ects O ={o1, ..., on} to m �loc�s (also called partitions) P={p1, ..., pm}, such that

p1 p2 ... pm = O

pi pj = � � i,j: i j andcost c(P) are minimi�ed�

�n �� (simple model)� o��ects � data flow graph nodes

�loc�s � architecture graph nodes

�- �

�� of a design point

ma� include C � s�stem cost in ��L � latenc� in �sec�P� power consumption in ��

re�uires �� to find C, L, P

�� linear cost function with penalt�

hC , hL , hP � denote how strong C, L, P �iolate the design constraints Cmax, Lmax, Pmax

k1 , k2 , k3 � weighting and normali�ation

f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)

�- 7

�e�e�al �a�� e��

enumeration �nteger �inear Programs (��P)

��constructi�e methods

� random mapping� hierarchical clustering

iterati�e methods� �ernighan��in �lgorithm� �imulated �nnealing� E�olutionar� �lgorithms (E�) �� see ne�t lecture

8 - 8

Integer Programming �o�el�Ingredients:

Cost functionConstraints

In�ol�in� linear e�pressions of inte�er �ariables from a set X

Def.: �he problem of minimi�in� (1) sub�ect to the constraints (2) iscalled an integer programming (IP) problem.

If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem.

Cost function )1(,with NxRaxaC iXx

iiii

Constraints: )2(,with: ,, RcbcxbJjXx

jjijijii

8 - �

��ample

�� xxxC

�1,0�,,2

�21

�21

xxxxxx

�ptimal

C

minimi�e�

��b�e�t to�

8 - ��

�emar�� on Integer Programming�aximi�ing the cost function can be done b� settin� C��C

Inte�er pro�rammin� is �P�complete.

In practice, running times can increase ex�onentia�l� with the si�e of the problem, but problems of some thousands of �ariables can still be sol�ed with commercial sol�ers, dependin� on the si�e and structure of the problem.

IP models can be a good starting �oint for modelin�, e�en if in the end heuristics ha�e to be used to sol�e them.

8 - ��

Integer �inear Program �or Partitioning (1)�inar� �ariables xi��

xi�� 1: ob�ect �i in bloc� ��

xi�� 0: ob�ect �i not in bloc� ��

Cost ci��, if ob�ect �i is in bloc� ��

Inte�er linear pro�ram:

nimkcx

nix

mknix

m

k

n

ikiki

m

kki

ki

��e

��

��

� ��

��

�

8 - ��

Integer �inear Program �or Partitioning (�)�dditiona� constraints

e�ample: ma�imum number of h�ob�ects in bloc� �

�he idea of mappin� the s�nthesis problem to an I�P is �er��o�u�ar:

�chedulin� can be inte�rated.�arious additional constraints can be added.If not sol�in� to optimalit�, run times are acceptable and a solution with a �uaranteed �ualit� can be determined.�indin� the ri�ht e�uations to model the constraints is an art � .

mkhxn

ikki �

��

8 - ��

�on�tr��ti�e �et�o��andom ma��ing

each ob�ect is assi�ned to a bloc� randoml�

Hierarchica� c�usteringstepwise �roupin� of ob�ectscloseness function determines how desirable it is to �roup two ob�ects

�onstructi�e methodsare often used to �enerate a startin� partition for iterati�e methodsshow the difficult� of findin� proper closeness functions

8 - 14

Hierarchical Clustering - Example (1)

2010

10�

4 �

v1

v3v2

v4

v5 = v1 v3

10

7

4 v4

v5

v2

closeness function: arithmetic mean of weights

8 - 1�

Hierarchical Clustering - Example (�)

v�= v2 v5

5�5

v4

v�10

7

4 v4

v5

v2

8 - 1�


v7 = v� v4

v75�5

v4

v�

8 - 1�


v7 = v� v4

v4

v�= v2 v5

v5 = v1 v3

v1 v2 v3

ste� �:

ste� �:

ste� �:

cut lines��artitions�

8 - 18

�terative Methods - �ernighan-�in (1)�imple greed� heuristic:

�ntil there is no im�ro�ement in cost: re�grou� a �air of o��ects which lea�s to the largest gain in cost

v�

v2

v4v5

v7

v1

v3v�

v�

e�am�le: cost � num�er of e�ges crossing the �artitions�efore re�grou�: � � after re�grou�: � � gain � �

8 - 1�

�terative Methods - �ernighan-�in (�)�ro�lem

�im�le gree�� heuristic can get stuc� in a local minimum�

�mproved algorithm ��ernighan��in�:as long as a �etter �artition is foun�:

� from all �ossi�le �airs of o��ects� �irtuall� re�grou� the ��est��lowest cost of the resulting �artition�� then from the remaining not �et touche� o��ects �irtuall� re�grou� the ��est��air� etc�� until all o��ects ha�e �een re�grou�e��

� from these n/2 �artitions ta�e the one with smallest cost an� actuall� �erform the corres�on�ing re�grou� o�erations�

8 - ��

�terative Methods - �imulated �nnealing�rom �h�sics:

metal an� gas ta�e on a minimal�energ� state �uring cooling �own �un�er certain constraints�:

� at each tem�erature� the s�stem reaches a thermo��namic e�uili�rium� the tem�erature is �ecrease� sufficientl� slowl�

�ro�a�ilit� that a �article ��um�s�to a higher�energ� state:

�pplication to �om�inatorial ��timi�ation:energ� � cost of a solution ��artition�cost �ecreases with tem�erature� sometimes �with a certain �ro�a�ilit�� increases in cost are acce�te��

Tkee

ii B

ii

eTeeP1

�� 1

8 - �1

�terative Methods - �imulated �nnealing

tem� � tem��start�cost � c��hile ��ro�en��

�hile ��uili�rium�� {P’ = RandomMove(P);cost’ = c(P’);deltacost = cost’ - cost;if (Accept(deltacost, temp) > random[0,1)) {

P = P’;cost = cost’;

}}

temp = DecreaseTemp (temp);}

tempkdeltacost

etempdeltacost ),Accept(

8 - 22

Iterative Methods - Simulated AnnealingCooling Down: DecreaseTemp(), Frozen()

• temp_start = 1.0• temp = • temp (typical: 0.8 0.99)• terminate when temp < temp_min or there is no more improvement

Equilibrium: Equilibrium()• after defined number of iterations or when there is no more

improvement

Complexityfrom exponential to constant, depending on the implementation ofthe functions Equilibrium(), DecreaseTemp(), and Frozen()the longer the runtime, the better the quality of resultstypical: construct functions to get polynomial runtimes

�- �

�ard�are�ard�are��Soft�are �odesignSoft�are �odesign

�o�ef �tefan �nternational Postgraduate �chool

�� Allo�ation

do�� dr� �regor �a�a

�- 2

Integer �rogramming models�ngredients:

�ost function�onstraints

�nvolving linear expressions of integer variables from a set �

�ef.: The problem of minimizing (1) sub�ect to the constraints (�) is called an integer �linear� �rogramming �I�� ro�lem.

�f all ��are constrained to be either 0 or 1, the �P problem said to be a �� integer �linear� �rogramming �ro�lem.

�ost function )1(,with ��

��

�onstraints: )�(,with: ,, ��

��

�- �

��am�le

��1 ��

�1,0�,,�

��1

��1

��

�ptimal

�

�- �

�emar�s on integer �rogramming

Maximizing the cost function: �ust set ��=��nteger programming is �P-complete.Running times depend exponentially on problem size,but problems of >1000 vars solvable with good solver (depending on the size and structure of the problem)The case of �� is called ��(�P).�P has polynomial complexity, but most algorithms are exponential, still in practice faster than for ��P problems.The case of some �� and some �� is called ��P��P models can be a good starting point for modeling, even if in the end heuristics have to be used to solve them.

�- �

Simulated Annealing

�eneral method for solving combinatorial optimization problems.

�ased the model of slowly cooling crystal liquids.

�ome configuration is sub�ect to changes.

�pecial property of �imulated annealing: �hanges leading to a poorer configuration (with respect to some cost function) are accepted with a certain probability.

This probability is controlled by a temperature parameter: the probability is smaller for smaller temperatures.

�- �

��lanation�nitially, some random initial configuration is created.�urrent temperature is set to a large value.�uter loop:• Temperature is reduced for each iteration• Terminated if (temperature lower limit) or

(number of iterations upper limit).�nner loop: For each iteration:• �ew configuration generated from current configuration• Accepted if (new cost cost of current configuration)• Accepted with temperature-dependent probability if

(cost of new config. > cost of current configuration).

�- �

Multio��e�tive ��timi�ationMaximize (y1, y2, …, yk) = (x1, x2, …, xn)

y2

y1

worse

better

incomparable

incomparable

y2

y1

Pareto optimal = not dominated

dominated

Pareto set = set of all Pareto-optimal solutions

9 - 8

SummarySingle objective optimization methods

decision is performed during optimizationExamples: integer programming, simulated annealing

Multiple objective optimization methodsdecision is done after optimizationExample: Evolutionary algorithmsRefer to publications of Thiele or Schwefel et al. for more information

Concept of Pareto pointseliminates large set of non-relevant design pointsallows separating optimization and decision

9 - 9

�m�ro�� re��ta��ty �or �a��es�oop cachesMapping code to less used part(s) of the index spaceCache locking�freezingChanging the memory allocation for code or data Mapping pieces of software to specific waysMethods:

- �enerating appropriate way in software- �llocation of certain parts of the address space to a specific way- �ncluding way-identifiers in virtual to real-address translation�Caches behave almost like a scratch pad�

9 - ��

Summary

�llocation strategies for SPM� �ynamic sets of processes� Multiprocessors� MM�s� Sharing between SPMs in a multi-processor

�ptimizations for Caches� Code �ayout transformations� � ay prediction

��- �

�ar��are�ar��are��So�t�are �o�es��So�t�are �o�es��

�o�ef Stefan �nternational Postgraduate School

�� o�e o�t�m��at�o�

�o�� r� �re�or Pa�a

��- �

�as��e�e� �o��urre��y ma�a�eme�t

Granularity: size of tasks (e.g. in instructions)Readable specifications and efficient implementations can possibly re�uire different task structures.

�ranularity changes

��- �

�er�� o� tas�s

Reduced overhead of context switches,More global optimization of machine code,Reduced overhead for inter-process�task communication.

��- �

S��tt�� o� tas�s

�o blocking of resources while waiting for input,more flexibility for scheduling, possibly improved result.

��- �

�er�� a�� s��tt�� o� tas�s

The most appropriate task graph granularity depends upon the context merging and splitting may be re�uired.Merging and splitting of tasks should be done automatically, depending upon the context.

��- �

system��am��e �

��- �

�ttr��utes o� a system t�at �ee�s re�r�t��

Tasks blocking after they have already started running

��- 8

� or� �y �orta�e��a et a��1. Transform each of the tasks into a Petri net,2. �enerate one global Petri net from the nets of the tasks,�. Partition global net into �se�uences of transition��. �enerate one task from each such se�uence

Mature, commercial approach not yet available

��- 9

�esu�t� as �u��s�e� �y �orta�e��aReads only at the beginning

�nitialization task

�lways true

�evertrue

��- ��

��t�m��e� �ers�o� o� ��

Tin () �RE�� (��, sample, 1)�sum �= sample� i��T� = sample� d = ��T��: �� (i < �) retur��T� = sum�� d = ��T��d = d�c� � R�TE(��T,d,1)�sum = �� i = ��retur��lways true

j==i-1j i

�ever true

��- ��

�as��e�e� �o��urre��y ma�a�eme�t ��

The dynamic behavior of applications getting more attention.Energy consumption reduction is the main target.Some classes of applications (i.e. video processing) have a

considerable variation in processing power re�uirements depending on input data.

Static design-time methods becoming insufficient.Runtime-only methods not feasible for embedded systems.

�ow about mixed approaches�

��- ��

��am��e o� a m��e� ��

�� e��um� �tt��me��e��

…or they can define a probability for violating the deadline.

t

�eadline

Task1

Task2

Task�

Static (compile-time) methods can ensure � CET feasible schedules, but waste energy in the average case.

t

�

�eadline

Runtime scheduler selects the most energy saving, deadline preserving combination.

t

�eadline

Mixed methods use compile-time analysis to define a set of possible execution parameters for each task.

��- ��

��oat��o��t to ��e� �o��t �o��ers�o�

Pros:�ower cost�aster�ower power consumptionSufficient S��R, if properly scaledSuitable for portable applications

Cons:�ecreased dynamic range�inite word-length effect, unless properly scaled

� �verflow and excessive �uantization noiseExtra programming effort

© Ki-Il Kum, et al. (Seoul �ational �niversity): � �loating-point To �ixed-point C Converter �or �ixed-point �igital Signal Processors, 2nd S�� orkshop, 1��

��- ��

��e��Po��t �ata �ormat

S 1 � � . . . � � � � 1 �

hypothetical binary point

�� =�

S 1 � � . . . � � � � 1 �

(a) �nteger

(b) �ixed-Point

��

© Ki-Il Kum, et al

�loating-Point vs. �ixed-Point�loating-Point vs. �ixed-Point �nteger vs. �ixed-Point�nteger vs. �ixed-Point

exponent, mantissa�loating-Point

� automatic computation and update of each exponent at run-time

�ixed-Point� implicit exponent� determined off-line

exponent, mantissa�loating-Point

� automatic computation and update of each exponent at run-time

�ixed-Point� implicit exponent� determined off-line

��- ��

�ss��me�t a�� t�o��Su�tra�t�o�

�ssume y = x, with- x (�� =2) and- y (�� =�):

s

s

�

��

y

s

�et result = x � y:e�ualizing each ��

sy

sresu�t

�

© Ki-Il Kum, et al

s

�

��

s

��- ��

�u�t��at�o�

�ssume result = x � y, with

- x (�� =2) and- y (�� =�)- -� result (�� =2��) s

�

� y

s

s

resu�t

© Ki-Il Kum, et al

s

s

��- ��

�e�e�o�me�t Pro�e�ure

�a��e �st�mat�o�� Pro�ram

��

��oat��Po��t� Pro�ram

��e��Po��t� Pro�ram

��-��

��-��

��

�a�ua�s�e��at�o�

��

© Ki-Il Kum, et al

��- �8

�a��e �st�mator

� �re��ro�essor

� �ro�t�e��

�� ass��me�t

Su�rout��e �a�� sert�o�

S��to�� o��erter

��oat��Po��t� Pro�ram

�a��e �st�mat�o�� Pro�ram

�� ormat�o�

��

float iir1(float x)�

static float s = ��float y�

y = �.� � s � x�range(y, 0);s = y�range(s, 1);

return y��

float iir1(float x)�

static float s = ��float y�

y = �.� � s � x�range(y, 0);s = y�range(s, 1);

return y��

�a��e �st�mat�o� � Pro�ram

© Ki-Il Kum, et al

��- �9

��erat�o�s �� e� �o��t �ro�ram

�.� x 21�siwl=�.xxxxxxxxxxxx

�

�

xiwl=�.xxxxxxxxxxxx

��overflow if

result

��- ��

��oat��Po��t to ��e��Po��t Pro�ram �o��erter

int iir1(int x)�static int s = ��int y�y=sll(mulh(29491,s)+ (x>> 5),1);s = y�return y�

�

�ixed-Point C Program

mulhto access the upper half of the multiplied resulttarget dependent implementation

sllto remove 2nd sign bitopt. overflow check

© Ki-Il Kum, et al

��- ��

Per�orma��e �om�ar�so��a��e �y��es �

�ourt� �r�er �� ter

21�

2��

�

1��

2��

��

��

�ixed-Point (1�b) �loating-Point

Cycles

© Ki-Il Kum, et al

��- ��

Per�orma��e �om�ar�so��a��e �y��es �

��P��

2��1�

�1��1

12�2��

�2��

1��12��1��

�ixed-Point(1�b)

�ixed-Point(�2b)

�loating-Point

Cycles

© Ki-Il Kum, et al

��- ��

Per�orma��e �om�ar�so��S��

��P��

�

�

1�

1�

2�

2�

� � C �

S�R (d�)

�ixed-Point (1�b)�ixed-Point (�2b)�loating-Point

© Ki-Il Kum, et al

��- ��

�m�a�t o� memory a��o�at�o� o� e��e��y

�rray ��

Row major order (C)

Column major order (��RTR��)

��

��

��

�

�

��

��

��

��

��

��

��- ��

�est �er�orma��e �� ermost �oo� �orres�o��s to r��tmost array ��e�

��o �oo�s� assum�� ro� ma�or or�er ��or (k=�� k<=m� k��) �or (j=�� j<=n� j��)�or (j=�� j<=n� j��) ) �or (k=�� k<=m� k��)p�j��k� = ... p�j��k� = ...

�or row major order

��

��

��

�ood cache behavior Poor cache behavior

Same behavior for homogenous memory access, but:

memory architecture dependent optimization

��- ��

Pro�ram tra�s�ormat�o� ��oo� ��ter��a��e�

(S�� interchanges array indexes instead of loops)

�mproved localityExample:…#define iter 400000int a[20][20][20];void computeijk() {int i,j,k;

for (i = 0; i < 20; i++) {for (j = 0; j < 20; j++) {

for (k = 0; k < 20; k++) {a[i][j][k] += a[i][j][k];}}}}

void computeikj() {int i,j,k;for (i = 0; i < 20; i++) {

for (j = 0; j < 20; j++) {for (k = 0; k < 20; k++) {

a[i][k][j] += a[i][k][j] ;}}}}…start=time(&start);for(z=0;z<iter;z++)computeijk();

end=time(&end);printf("ijk=%16.9f\n",1.0*difftime(end,start));

��- ��

stro�� ue��e o� t�e memory ar��te�ture

�oop structure: i j k

��m

e �s

�

�Till �uchwald, �iploma thesis, �niv. �ortmund, �nformatik 12, 12�2��

��

��te� Pe�t�um��

Su� SP��

Pro�essorre�u�t�o� to ��

�ramatic impact of locality

�ot always the same impact ..

��- �8

��oo� �us�o��mer�� oo� ��ss�o��or(j=�� j<=n� j��) �or (j=�� j<=n� j��)p�j�= ... � �p�j�= ... �

�or (j=�� j<=n� j��) , p�j�= p�j� � ...�p�j�= p�j� � ...

�oops small enough to �etter locality for allow zero overhead access to p.�oops �etter chances for

parallel execution.

� hich of the two versions is best��rchitecture-aware compiler should select best version.

��- �9

��am��e� s�m��e �oo�s

void ss1() {int i,j;for(i=0;i<size;i++){for

(j=0;j<size;j++){a[i][j]+= 17;}}

for(i=0;i<size;i++){for

(j=0;j<size;j++){b[i][j]-=13;}}}

void ms1() {int i,j;for (i=0;i<size;i++){for

(j=0;j<size;j++){a[i][j]+=17; }for

(j=0;j<size;j++){b[i][j]-=13; }}}void mm1() {int i,j;

for(i=0;i<size;i++){

for(j=0;j<size;j++){a[i][j] += 17;b[i][j] -= 13;}}}

#define size 30#define iter 40000int a[size][size];float b[size][size];

#define size 30#define iter 40000int a[size][size];float b[size][size];

��- ��

�esu�ts� s�m��e �oo�s

�u�t�me

�

2�

��

��

��

1��

12�

�� gcc �.2 -�� x�� gcc 2.�� -o� Sparc gcc �xo1 Sparc gcc �x o�

P�att�orm

�

Merged loops superior� except Sparcwith �o�

Merged loops superior� except Sparcwith �o�

ss1ms1

mm1

(1�� max)

��- ��

�oo� u�ro��

�or (j=�� j<=n� j��) p�j�= ... �

�or (j=�� j<=n� j�=2)�p�j�= ... � p�j�1�= ...�

factor = 2�etter locality for access to p.�ess branches per execution of the loop. More opportunities for optimizations.Tradeoff between code size and improvement. Extreme case: completely unrolled loop (no branch)

��- ��

��am��e� matr��mu�t#define s 30#define iter 4000inta[s][s],b[s][s],c[s][s];void compute(){inti,j,k;for(i=0;i<s;i++){

for(j=0;j<s;j++){

for(k=0;k<s;k++){c[i][k]+=

a[i][j]*b[j][k];}}}}

extern void compute2(){int i, j, k;for (i = 0; i < 30; i++) {for (j = 0; j < 30; j++) {for (k = 0; k <= 28; k += 2){{int *suif_tmp;suif_tmp = &c[i][k];*suif_tmp=*suif_tmp+a[i][j]*b[j][k];}{int *suif_tmp;suif_tmp=&c[i][k+1];*suif_tmp=*suif_tmp

+a[i][j]*b[j][k+1];}}}}return;}

��- ��

�esu�ts�� te� Pe�t�umSu� SP��Pro�essor

�enefits �uite small� penalties may be large


�a�tor�a�tor

��- ��

�esu�ts� �e�e��ts �or �oo� �e�e��e��es

Small benefits�

�� Pro�essorre�u�t�o� to ��

#define s 50#define iter 150000int a[s][s], b[s][s];void compute() {int i,k;for (i = 0; i < s; i++) {for (k = 1; k < s; k++) {a[i][k] = b[i][k];b[i][k] = a[i][k-1];

}}}


�a�tor

��- ��

�oo�t��oo��o�� r��a� �ers�o� �

�or (i=1� i<=�� i��)�or(k=1� k<=�� k��)�

r=��i,k�� to be allocated to a register��or (j=1� j<=�� j��)

��i,j� �= r� ��k,j�� ever reusing information in the cache for � and � if � is large or cache is small (2 ��references for �).

��

��

��

��

��

��

��- ��

�oo� t��oo� ��o��t��e� �ers�o� �

�or (kk=1� kk<= �� kk�=�)�or (jj=1� jj<= �� jj�=�)�or (i=1� i<= �� i��)�or (k=kk� k<= min(kk��-1,�)� k��)�r=��i��k�� to be allocated to a register��or (j=jj� j<= min(jj��-1, �)� j��)��i��j� �= r� ��k��j�

�

�euse �a�tor o� � �or �� or �

�� a��esses to ma�� memory

��

��

��

��

��

��

��

��

Same elements for next iteration of i

Compiler should select best option

Monica �am: The Cache Performance and �ptimization of �locked �lgorithms, �SP��S, 1��1

��- ��

��am��e

�� ra�t��e� resu�ts �y �u��a�� are ��sa��o��t��e o� t�e �e� �ases ��ere a� �m�ro�eme�t�as a��e�e��Sour�e� s�m��ar to matr�� mu�t�


��a�tor

SP��

Pe�t�um

��- �8

Summary

Task concurrency management� Re-partitioning of computations into tasks� �ynamic exploitation of slack

�loating-point to fixed point conversion� Range estimation� Conversion� �nalysis of the results

�igh-level loop transformations� �usion� �nrolling� Tiling

��- �9

�ra�s�ormat�o� ��oo� �est s��tt��

��am��e� Se�arat�o� o� mar�� a��

�many if-statements for margin-checking

no checking,efficient

only few margin elements to be processed

��- ��

if (x�=1��y�=1�)for (� y�� y��)for (k=�� k�� k��)

for (l=�� l��l�� )for (i=�� i�� i��)for (j=�� j��j��) �then�block�1� then�block�2�

else �y1=��y�for (k=�� k�� k��) �x2=x1�k-��for (l=�� l�� ) �y2=y1�l-��for (i=�� i�� i��) �x�=x1�i� x�=x2�i�for (j=�� j��j��) �y�=y1�j� y�=y2�j�if (� �� x� �� y�)then-block-1� else else-block-1�if (x�� x��y��y�)then�block�2� else else�block�2�

��

�oop nest from MPE�-� full search motion estimation

for (z=�� z�2�� z��)for (x=�� x�� x��) �x1=��x�for (y=�� y�� y��) �y1=��y�for (k=�� k�� k��) �x2=x1�k-��for (l=�� l�� ) �y2=y1�l-��for (i=�� i�� i��) �x�=x1�i� x�=x2�i�for (j=�� j��j��) �y�=y1�j� y�=y2�j�if (x�� x��y��y�)then�block�1� else else�block�1�if (x�� x��y��y�)then�block�2� else else�block�2�

��

for (z=�� z�2�� z��)for (x=�� x�� x��) �x1=��x�for (y=�� y�� y��)

analysis of polyhedral domains, selection with genetic algorithm

��. �alk et al., �nf 12, �ni�o, 2��2�

��- ��

�esu�ts �or �oo� �est s��tt��e�ut�o� t�mes �

��

��

��

��

��

��

��

��

��

��

��

��

Su�

Pe�t�um �P

��PSPo�erP

�� a

�r��e��a

��

�� t�m�

�� arm

��era�

e

Cavity Motion Estimation �S�PCM

��. �alk et al., �nf 12, �ni�o, 2��2�

��- ��

�esu�ts �or �oo� �est s��tt��o�e s��es �

��alk, 2��2�

��

��

��

��

��

��

��

��

��

��

��

Su�

Pe�t�um �P

��PSPo�erP

��

��a

�r��e��a

�� t�

m��arm

��era�

e

Cavity Motion Estimation �S�PCM

��- ��

�rray �o��nitial arrays

��- ��

�rray �o��nfolded arrays�nfolded arrays

��- ��

��ter�array �o��

��tra�array�o��

��- ��

��at�o��rray folding is implemented in the �TSE optimization proposed by �MEC. �rray folding adds div and mod ops. �ptimizations re�uired to remove these costly operations. �t �MEC, ��PT address optimizations perform this task.�or example, modulo operations are replaced by pointers (indexes) which are incremented and reset.

��- ��

��

��

Pe�t�um��

��PS �r��e��a �P��S� �P��S��o �P�

��t�a�

��t�a� � ��S�

��t�a� � ��P�

��t�a� � ��S� ��P�

�esu�ts ��y��es �or �a��ty �e��mar��

��PT��TSE re�uired to achieve real benefit

[C.Ghez et al.: Systematic high-level Address Code Transformations for Piece-wise Linear Indexing: Illustration on a Medical Imaging Algorithm, IEEE WS on Signal Processing System: design & implementation, 2000, pp. 623-632]

10 - 48

Prilagoditev kodeprenos zapisa iz ANSI-C v Handel-C

VHDL zahteva bistveno ve sprememb

opis algoritma v C kodi je treba pred strojno izvedbo ustrezno prilagoditi

SystemC oz. Handel-C vsebujeta samo podmnožico ukazov obi ajnega Cdruga e je treba realizirati aritmetiko plavajo e vejice, ki je strojne izvedbe na eloma ne podpirajo

• zavzame preve razpoložljivih virov• zmanjšuje frekvenco delovanja

vnos ukazov za vzporedno izvajanje delov kodeprilagoditev velikosti vseh spremenljivk

10 - 4�

Prilagoditev �rogra��ke kode ��nadomestek aritmetike plavajo e vejice

uporaba fiksne vejiceuporaba celoštevil nih vrednosti �manjša enota mere�

vrednosti s fiksno vejico so pomnožene in predstavljene kot celoštevilske vrednosti

si� � �62�� si� � �.62�

celoštevilski in decimalni del sta predstavljena kot zgornji in spodnji del celoštevilske spremenljivke

signed int � var�, var2�signed int �6 si��

si� � 0x0�a0� ��si� � �.62�var� � si�[��:�]� �� var� � 0x0� � �var2 � si�[�:0]� �� var2 � 0xa0 � �60

10 - �0

Prilagoditev �rogra��ke kode ��ukazi za vzporedno izvajanje delov kode

ukaz ��namesto ��• kjer je mogo e, glede na vsebino zanke

for �i � 0� i �� 3� i��

a[i] � b[2�i]��

se��a[i] � b[2�i]�a[i] � a[i] � c[i]�b[2�i] � a[i]�

�

par �i � 0� i �� 3� i��

a[i] � b[2�i]��

se��

par�

a[i] � b[2�i]�a[i] � a[i] � c[i]�

�b[2�i] � a[i]�

�

10 - �1

Prilagoditev �rogra��ke kode ��prilagoditev velikosti vseh spremenljivk

vse velikosti morajo biti vnaprej definirane• za manjšo porabo virov naj bodo minimizirane

vnaprej je treba dolo iti predzna ene�nepredzna enepri ra unanju s spremenljivkami razli nih velikosti

• uporaba operatorja spajanja: manjši spremenljivki dodamo manjkajo a mesta

• uporaba spodnjih mest pri ve ji spremenljivki[signed � unsigned] int n �� n-bit

unsigned int �6 var�, var3�unsigned int � var2, var��

var3 � var� � �� var2�var� � var��var2�

11 - 1

�ard�are�ard�are��o�t�are �ode�ig��o�t�are �ode�ig�

�ožef Stefan International Postgraduate School

�� o��ilatio�

do�� dr� �regor Pa�a

11 - �

�o� �iler� �or e� �edded ��te� �� are �o��iler� a� i��e�Many reports about low efficiency of standard compilers

- Special features of embedded processors have to be exploited.- High levels of optimization more important than compilation

speed.- Compilers can help to reduce the energy consumption.- Compilers could help to meet real-time constraints.

Less legacy problems than for PCs.- There is a large variety of instruction sets. - Design space exploration for optimized processors makes

sense

11 - �

� ke��ro�le� ��or ��t�re � e� or��te� �

Energy

Access times

�� verage� ��eed�� erg��Po�er�� Predi�ta�ilit��

11 - 4

�a� e a� o�ti� i�atio� �or �ig� �er�or�a��e�

int a[�000]�c � a�for �i � �� i � �00� i�� b �� c� b �� c�� c ��

int a[�000]�c � a�for �i � �� i � �00� i�� b �� c� b �� c�� c ��

LD� r3, [r2, �0]ADD r3,r0,r3M�V r0,�2�LD� r0, [r2, r0]ADD r0,r3,r0ADD r2,r2,��ADD r�,r�,��CMP r�,��00�LT LL3

ADD r3,r0,r2M�V r0,�2�M�V r2,r�2M�V r�2,r��M�V r��,rr�0M�V r0,r�M�V r�,r�M�V r�,r�LD� r�, [r�, r0]ADD r0,r3,r�ADD r�,r�,��ADD r�,r�,��CMP r�,��00�LT LL3

�� le��

�� le��

�o �• High-performance if available memory bandwidth fully used�low-energy consumption if memories are at stand-by mode

• �educed energy if more values are kept in registers

11 - �

�o� �iler o�ti� i�atio��or i��rovi�g e�erg� e��i�ie��

Energy-aware schedulingEnergy-aware instruction selection�perator strength reduction: e.g. replace � by � and ��Minimize the bitwidth of loads and storesStandard compiler optimizations with energy as a cost function

E.g.: �egister pipelining:

for i:� 0 to �0 doC:� 2 � a[i] � a[i-�]�

�2:�a[0]�for i:� � to �0 dobegin

��:� a[i]�C:� 2 � �� 2��2 :� ��

end�

Exploitation of the memory hierarchy Exploitation of the memory hierarchy

11 - �

��i�g ��rat�� ad � e� orie��P��

Address space A�M�TDMI

cores, well-known for low power consumption

main

SPM

processor

HierarchyHierarchyExampleExample

scratch pad memory

0

��..

no tag memory

11 - �

�er�li� ited ��ort i��a�ed tool �lo��

��e �rag�a i� ��o�r�e to allo�ate to ��e�i�i� �e�tio��or example:#pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region)int const z2[3] = {1,2,3}; // in bar

��t ��atter loadi�g �ile to li�ker �or allo�ati�g �e�tio� to ��e�i�i� addre�� ra�ge

http:��www.arm.com�documentation� Software�Development�Tools�index.html

11 - 8

glo�al o�ti� i�atio� �odel �� ort��d�

Which memory object �array,loop, etc.� to be stored in SPM�

�o��overla�i�g ��tati�� allo�atio��

Gain gk and size sk for each segment k. Maximise gain G = gk,respecting size of SPM SSP sk.

Solution: knapsack algorithm.

�verla�i�g ��d��a� i�� allo�atio��

Moving objects back and forthProcessor

Scratch pad memory,capacity SSP

mainmemory

�

�or i .� �

for j ..� �

while ...

�epeat

call ...

Array ...

Int ...

Array

Example:

11 - �

�P re�re�e�tatio�� igrati�g ��tio�� a�d varia�le��

�� ol��S�vark � � size of variable knk � number of accesses to variable ke�vark �� energy �aved per variable access, if vark is migratedE�vark � � energy �aved if variable vark is migrated �� e�vark �n�vark ��x�vark � � decision variable, �� if variable k is migrated to SPM,

�0 otherwiseK � set of variables

Similar for functions I

��teger �rogra��i�g �or��latio��Maximize k K x�vark �E�vark � � i I x�Fi �E�Fi �Subject to the constraint

k K S �vark �x�vark �� i I S �Fi �x�Fi � SSP

11 - 10

�ed��tio� i� e�erg� a�d average r��ti� e

Multi�sort�mix of sort algorithms�

Cyc

les

[x�0

0]E

nerg

y [�

�]

�easible withstandard compiler& postpassoptimization

Measured processor � external memory energy � CACTI values for SPM �combined model�

Numbers will change with technology, algorithms remain unchanged.

11 - 11

�llo�atio� o� �a�i� �lo�k�

�ine-grained granularitysmoothens dependency on the size of the scratch pad.

�e�uires additional jump instructions to return to �main� memory.

�ine-grained granularitysmoothens dependency on the size of the scratch pad.

�e�uires additional jump instructions to return to �main� memory.

Mainmemory

��

��2

�ump�

�ump2

�ump�

�ump3

�or consecutive basic blocks

Statically 2 jumps,but only one is taken

11 - 1�

�llo�atio� o� �a�i� �lo�k�� et� o� ad�a�e�t �a�i� �lo�k� a�d t�e �ta�k

�e�uiresgeneration ofadditional jumps�special compiler�

Cyc

les

[x�0

0]E

nerg

y [�

�]

11 - 1�

�avi�g� �or �e�or� ��te� e�erg� alo�e

Combined model for memories

11 - 14

�i� i�g �redi�ta�ilit�

aiT:WCET analysis toolsupport for scratchpad memories by specifying different memory access timesalso features experimental cache analysis for A�M�

aiT:WCET analysis toolsupport for scratchpad memories by specifying different memory access timesalso features experimental cache analysis for A�M�

11 - 1�

�r��ite�t�re� �o��ideredA�M�TDMI with 3 different memory architectures:

�� ai� � e� or�LD�-cycles: �CP�,I�,D��3,2,2�ST�-cycles: �2,2,2�� ,2,0�

�� ai� �e�or� � ��i�ied �a��eLD�-cycles: �CP�,I�,D��3,�2,6�ST�-cycles: �2,�2,3�� ,�2,0�

�� ai� �e�or� � ��rat�� adLD�-cycles: �CP�,I�,D��3,0,2�ST�-cycles: �2,0,0�� ,0,0�

11 - 1�

�e��lt� �or ��

�eferences:• Wehmeyer, Marwedel: Influence of �nchip Scratchpad Memories on

WCET: �th Intl Workshop on worst-case execution time �WCET�analysis, Catania, Sicily, Italy, �une 2�, 200�

• Second paper on SP�Cache and WCET at DATE, March 200�

�sing Scratchpad: �sing �nified Cache:

11 - 1�

��lti�le ��rat�� ad�

11 - 18

��ti� i�atio� �or � �lti�le ��rat�� ad�

iiij

jj nxeC ,Minimize

With ej: energy per access to memory j,and xj,i� � if object i is mapped to memory j, �0 otherwise,and ni: number of accesses to memory object i,subject to the constraints:

ijiij SSPSxj ,:

jijxi �: ,

With Si: size of memory object i,SSPj: size of memory j.

11 - 1�

�o��idered �artitio��

11 - �0

�e��lt� �or �art� o� �� oder�de�oder

A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.

�Working set�

11 - �1

��a�i� re�la�e�e�t �it�i� ��rat�� ad

Effectively results in a kind of �o� �iler��o�trolled �eg� e�tatio�� agi�g for SPM Address assignment

within SPM re�uired�paging or segmentation-like�

�eference: Verma, Marwedel: Dynamic �verlay of Scratchpad Memory for Energy Minimization, ISSS 200�

CP�

Memory

Memory

SPM

11 - ��

��rat�� ad� �a�ed o� live�e��a�al��i�

M� � �A, T�, T2, T3, T��SP Size � �A� � �T�� T��

Solution:A SP & T3 SP

Solution:A SP & T3 SP

�P��P��

�P��P��

�P��

�P��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

11 - ��

�� ar�High-level transformations

Loop nest splittingArray folding

Impact of memory architecture on execution times & energy.The SPM provides

�untime efficiencyEnergy efficiencyTiming predictability

Achieved savings are sometimes dramatic, for example:savings of � �� of the memory system energy

1�- 1

�ard�are�ard�are��o�t�are �ode�ig��o�t�are �ode�ig�

�ožef Stefan International Postgraduate School

�� Per�or� a��e ��ti� atio�

do�� dr� �regor Pa�a

1�- �

SW-Compilation HW-Synthesis

��te� �e�ig�Specification

System Synthesis

Machine Code Net lists

Estimation

Instruction Set

IntellectualProp. �lock


1�- �

�otivatio�The values of the objectivefunctions that should guide the design space exploration are obtained through ��Design space exploration intends to change

mapping �binding and resource sharing�

architecture �hardware platform�

application �choice between different algorithms and�or partitioning into concurrent components�

Application Architecture

Mapping

Estimation

1�- 4

��tli�e

�vervie�

Performance Metrics

Subsystems

Abstraction Levels

Performance Estimation Methods

1�- �

Per�or�a��e ��ti�atio� ��lo�al Pi�t�re

ABSTRACTION LEVEL

PERFORMANCE ESTIMATION METHOD

CPUsubsystem

CPU

AD

Mem

I/O

interconnect subsystem

70

1

2

34

5

6

0

4

6 25

1

3

7

blackboxM1 M2

communication

Intermediary levele.g. TLM, OS

Task1 Task2

Task3

High-levele.g. functional, HLL

MPSoC

HW IP

API

communication

API API

SW ss. SW ss.

SWsubsystem

Task1 Task2

Task3

communication

HW IP

HW itf.

SESE(CPU)

Low-levele.g. RTL, ISA

HW itf.

Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE

M1 M2 …

interface HW subsystem

METRIC

statistic

simulation

analytic

x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593

y

x

Time

Cost AreaPower

Other: Quality, SNR, …

1�- �

Po�itio� i� t�e ��te� �e�ig� �lo��-��

Advantages: short simulation time, no details of implementation necessaryDrawbacks: limited accuracy, e.g. no information about timing

��-��

Advantages: higher accuracyDrawbacks: long simulation time, many implementation details need to be known

��ti� atio�

��ti� atio�

��ti� atio�

�a��i�g a�d Partitio�i�g

�o��i�atio�

��

�� od�le�� od�leParallel ��e�i�i�atio�

��

��

�ig��level ��tio�al�

��e�i�i�atio�

�e�i�e�e�t

i��

�P�P��

a��li��

i��

��

�o��level ��e�i��lo�er to t�e

i� �le�e�tatio��

�� le�e�tatio�

1�- �

��e� o� t�e ��ti�atio�

Prere�uisite for ��

part of the feedback cycle �see global flow�functional and non-functional validation �e.g. power, energy, timing, memory consumption�

��show e�uivalence of specification and implementationfunctional and non-functional aspects

1�- 8

��tli�e

�verview

Per�or� a��e �etri��

Subsystems

Abstraction Levels


1�- �

Per�or�a��e ��ti�atio� ��lo�al Pi�t�re

ABSTRACTION LEVEL


CPUsubsystem

CPU

AD

Mem

I/O


70

1

2

34

5

6

0

4

6 25

1

3

7

blackboxM1 M2

communication


Task1 Task2

Task3


MPSoC

HW IP

API

communication

API API

SW ss. SW ss.

SWsubsystem

Task1 Task2

Task3

communication

HW IP

HW itf.

SESE(CPU)


HW itf.


M1 M2 …


METRIC

statistic

simulation

analytic

x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593

y

x

Time

Cost AreaPower


��

Performance MetricsPer�ormance metric = function defined on relevant non-functional properties of a system which indicates a quantitative performance of the system.

Time [second]for example end-to-end delay, throughput, latency

Power, Energy, Temperature [mW, mJ, °C] for example power consumed by the network, energyexecute a task, maximal temperature

Area [mm2]for example area of an integrated circuit

Cost [$]for example cost of parts, labor, development cost

Other metrics:SNR (signal to noise ratio), quality of the video image/sound, size of the hardware platform

usually, performance metrics are conflicting

��

Eam�les of Performance �ra�e��ffsMa��in� �omain

change the mapping of the application to the architecturesee example 1

�rc�itecture �omainchange the hardware platform

see example 2

��lication �omainchange the application implementation (e.g. degree of parallelization, partitioning into concurrent processes, use of different algorithms with a similar functional behavior)

��

E�� ra�e��ffs in t�e Ma��in� �omain

�PE�� apping Optimi�ation �2� mapping optimization space

ob�� Worst load of computation nodeob�2� Worst load of communication node

ob��

ob��

worst bus load

��

E�� ra�e��ffs in t�e �ar��are Platform

�imin� �erformanceEner�� Efficienc� �le�ibilit�

��lication�s�ecific inte�rate� circuits ��s�

��lication�s�ecific instruction set �rocessors ��Ps�

�Microcontroller��Ps ��i�ital si�nal �rocessors�

General��ur�ose �rocessors

Pro�rammable �ar��are

��PG� �fiel��ro�rammable �ate arra�s�

��

�utline

�verview

�erformance �etrics

�ubs�stems

�bstraction �evels

�erformance �stimation �ethods

��

Performance Estimation – Global Picture

ABSTRACTION LEVEL


CPUsubsystem

CPU

AD

Mem

I/O


70

1

2

34

5

6

0

4

6 25

1

3

7

blackboxM1 M2

communication


Task1 Task2

Task3


MPSoC

HW IP

API

communication

API API

SW ss. SW ss.

SWsubsystem

Task1 Task2

Task3

communication

HW IP

HW itf.

SESE(CPU)


HW itf.


M1 M2 …


METRIC

statistic

simulation

analytic

x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593

y

x

Time

Cost AreaPower


��

��stem �om�osition

�c�e�ulin� an� �rbitration�em�lates

�ro�ortionals�are� ��

static��namicfi�e� �riorit�

E��M�

��

�ommunication �em�lates �om�utation �em�lates

��P

m� ��interface

��M

��E��

�rc�itecture

��M

E��

�riorit�

E��

E��

��

� �� s Estimation �ifficult ��om�utation an� �ommunication

(Non-deterministic) computations in processing nodes(Non-deterministic) communication delaysComplex resource interaction via scheduling and arbitration policies

��clic timin� �e�en�encies�nternal data streams interact on computing and communication resources�nteraction determines stream characteristics

�ncertain en�ironment�ifferent load scenarios�nknown (worst case) inputs

��

�llustration of E�aluation �ifficulties

�n�ut�tream

�om�le� �n�ut��imin� ��itter� bursts� ��ifferent E�ent ��es

�as� �ommunication�as� �c�e�ulin�

ab acc b

��

�llustration of E�aluation �ifficulties

Processor�as�

�uffer�n�ut�tream

�as� �ommunication�as� �c�e�ulin�

ab acc b

�om�le� �n�ut��imin� ��itter� bursts� ��ifferent E�ent ��es

�ariable �esource ��ailabilit��ariable E�ecution �eman��n�ut ��ifferent e�ent t��es��nternal �tate �Pro�ram� �ac�e� ��

��

�e�uirements for Performance Estimation

�stimation should be com�osable in terms of�su�systems and their interactions, i.e. �W, SW, interconnectcomputation, communication, and sche�u�ing�ar�itration

�stimation should cover different metrics, for example power, energy, delay, memory, throughput

�stimation method should represent a reasonable tra�e�off between (a) estimation effort in terms of computation/simulation time and set-up time and (b) accuracy

��

�utline

�verview


Subsystems

�bstraction �e�els

�erformance �stimation �ethods

��


ABSTRACTION LEVEL


CPUsubsystem

CPU

AD

Mem

I/O


70

1

2

34

5

6

0

4

6 25

1

3

7

blackboxM1 M2

communication


Task1 Task2

Task3


MPSoC

HW IP

API

communication

API API

SW ss. SW ss.

SWsubsystem

Task1 Task2

Task3

communication

HW IPHW itf.

SESE(CPU)


HW itf.


M1 M2 …


METRIC

statistic

simulation

analytic

x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593

y

x

Time

Cost AreaPower


��

��s

� �rief �istor� in �bstraction

cluster

cluster

abst

ract

�ate level model�/�/�/� (� ns) ab

stra

ct R��

Register-transfer level modeldata[��] (critical path latency)

2��s 2��

cluster

on-chipcommunication

Network

SW tasks�S

��Comm. int.

SW tasks�S

��Comm. int.

SW tasksSW adaptation

C�� core�W adaptation

�W adaptation

��s

abst

ract

Com

m.N

etw.

SW

SW

�W

cluster

abst

ract

�S/drivers

SW �asks

C��

��s

�W adaptation

SW �W

abst

ract

�ransistor model(t=RC)

��s

tec�nolo��transistors, layouts

��s

si�nal�gate, schematic, R��

transaction�SW, �W systems

to�ens�SW tasks, comm. backbones, ��s

simulator�S��C� simulator�� simulator�SystemC/�SS

simulator�So� �W/SW codes./cosim. tools�

formal methods

��

�utline

�verview


Subsystems

�bstraction �evels

Performance Estimation Met�o�s

��


ABSTRACTION LEVEL


CPUsubsystem

CPU

AD

Mem

I/O


70

1

2

34

5

6

0

4

6 25

1

3

7

blackboxM1 M2

communication


Task1 Task2

Task3


MPSoC

HW IP

API

communication

API API

SW ss. SW ss.

SWsubsystem

Task1 Task2

Task3

communication

HW IPHW itf.

SESE(CPU)


HW itf.


M1 M2 …


METRIC

statistic

simulation

analytic

x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593

y

x

Time

Cost AreaPower


12 - 26

e.g. delay

Real System

Worst-Case

Best-Case

MeasurementProbabilisticEstimation

Worst Case(Formal) Analysis

presented later

Simulation

presented in Lecture 6

(next lecture)

System-Level Performance Estimation Methods

12 - 2�

System�o� to e�aluate�

Measurements Formal Analysis Statistics

�e�elop a mat�ematical

abstraction o� t�e system and

deri�e �ormulas ��ic� describe

t�e system per�ormance.

�e�elop a program ��ic� implements a model o� t�e

system. Per�orm experiments by

running t�e program.

�se existing instance o� t�e

system to per�orm

per�ormance measurements.

Simulation

�e�elop a statistical

abstraction o� t�e system and

deri�e statistic per�ormance �ia

analysis or simulation.

�vervie�

12 - 2�


model o�en�ironmentmodel o�

en�ironmentsystemmodel

systemmodel

estimationresults

estimationresults

inputtracesinput

traces

spec. o�inputs

spec. o�inputs

model o�applicationmodel o�

application

model o�arc�itecturemodel o�

arc�itecture

datas�eetsdata

s�eets

plat�ormbenc�mar�splat�orm

benc�mar�s

componentsimulation

componentsimulation

designersexperiencedesigners

experience

estimationtool (met�od)

12 - 2�

�� nalytic ModelsStatic analytic �sym�olic� models�

�escribe computing� communication� and memory resources by algebraic e�uations� e.g.

�escribe system properties by parameters� e.g. data rateCombine relations

Fast and simple estimation�enerally inaccurate modeling� e.g. resource s�aring not modeled

timecommsizeburst

wordsdelay __

#

12 - ��

�� ynamic �nalytic ModelsCombination bet�een

Static models possibly extended by non-determinism in run-time and e�ent processing�ynamic models �or describing e.g. resource s�aring mec�anisms (sc�eduling and arbitration).

Existing approac�es��-�� t�eory ��(statistical bounds)��-�� orst case�best case be�a�ior)

12 - �1

E�am�le - ��e�in� Systems�� clients re�uest some ser�ice �rom a ser�er o�er a net�or�.��

� Per�ormance o� t�e ser�er� Per�ormance o� t�e net�or�

12 - 32

Stochastic Models - Queuing Systems� queuing system is described by

�rrival rateService mechanism�ueuing discipline

Performance measuresaverage delay in queue

• Customer point of viewtime-average number of customers in queue.

• System point of viewproportion of time server is busy

The classical M/M/1 queuing system: (M = Markovian (exp.) distribution )

12 - 33

�ondete�ministic Models - Queuing Systems� queuing system is described by

�rrival function (bounds on arrival times)Service functions (bounds on server behavior)�esource interaction

Performance measuresworst case delay in queueworst-case number of customers in queue.worst-case and best-case end-to-end delay in the system

��

��

��

��M�

12 - 3�

�� SimulationConsider the underlying hardware platform and the mapping of the application onto that architectureCombine functional simulation and performance data�valuate average-case behavior� for one simulation scenario

Complex set-up and extensive runtimes... �ut accurate results and good debugging possibilities

�nputtrace

Model

application� hardware platform� mapping

Model

application� hardware platform� mapping�utputtrace

12 - 3�

Example� ��ace-�ased SimulationA�stract simulation at system-le�el �it�out timing

�aster than simulation� but still based on a single input traceA�straction

�pplication - represented by abstract execution traces graph of events: read, write, and execute�rchitecture - represented by “virtual machines” and “virtual channels”including non-functional properties (timing� power� energy)

�teps�xecution trace determined by functional application simulation�xtension of the event graph by non-functional propertiesSimulation of the extended model

application �unctional model

completet�ace

a�chitectu�edesc�iption

a�st�acte�ent g�aph

t�acesimulation

estimation�esults

e�g� ��ahi�i et al�� imentel et al��

Date post:	24-Mar-2018
Category:	Documents
Upload:	trinhhanh
View:	214 times
Download:	0 times

Hardware/Software Codesign - Computer Systems @ JSIcs.ijs.si/papa/courses/HW-SW-Codesign.pdf ·...

Documents