0 - 1
HardwareHardware//Software CodesignSoftware Codesign
Jožef Stefan International Postgraduate School
0. Organization
doc. dr. Gregor Papa
0 - 2
OverviewAdministrationCourse synopsisIntroduction and motivation
0 - 3
Organization (1)Lecture: introductionary course + consultationsExercises: delivered during consultations
Contact: Gregor [email protected]
Web page: http://csd.ijs.si/papa/courses.php
0 - 4
Organization (2)Course materials:
slide copies, exercise sheets, papersthe slides contain material from Marco Platzner, PeterMarwedel, Lothar Thiele, Frank Vahid, Reinhard Wilhelm
References:P. Marwedel: Embedded System Design, Springer, 2006.F. Vahid, T. Givargis: Embedded System Design: A Unified Hardware/Software Introduction, John Wiley & Sons, 2002.
Exam: written seminar + oral, Slovenian or English
0 - 5
Textbook & slidescourse based
on the book and the slides“Embedded System Design” by Peter Marwedel
on the slides “Hardware/Software Codesign” by Lothar Thiele
0 - 6
OverviewAdministrationCourse synopsisIntroduction and motivation
0 - 7
Course SynopsisDifferent Levels of Model Representation
SpecificationsModelsAbstraction Levels
Dealing with Contradictory ConstraintsExplorationSimulation
• Worst-Case Eexecution TimeOptimization
Hardware/Software MappingPartitioningSchedulingAllocation
Software Code OptimizationsCompilation
Estimation
0 - 8
Benefits ? Learn about …… challenges and approaches in modern system design… useful optimization methods… performance estimation of embedded systems… a current research area
0 - 9
OverviewAdministrationCourse synopsisIntroduction and motivation
0 - 10
What is HW/SW Codesign?... integrated design of systems that consist of hardware-and software-components
Analysis of HW/SW boundaries and interfacesEvaluation of design alternatives
0 - 11
Hardware/Software BoundariesGeneral purpose systems (PC, workstation)
processor design:processor compiler, operating system
Embedded systems (cell phone, automotive electronics)design of specialized processors:processor compiler, operating systemsystem design:processors dedicated hardware devices
0 - 12
Target Architectures
0 - 13
Why Codesign? (1)Modern embedded systems require “design” optimization
many functions, great variability, high flexibilityheterogeneous target systems
• processors, ASICs, FPGAs, systems-on-chip, …many design goals
• performance, cost, power consumption, reliability, ...
Advances in formal / automated design methodsautomation on the system level becomes possiblereduction of cost and time-to-market
0 - 14
Why Codesign? (2)Optimization of the “design process”
classic design co-design
0 - 15
Codesign methodologiesDifferent Levels of Model RepresentationDealing with Contradictory ConstraintsHardware/Software MappingSoftware Code OptimizationsEstimation
0 - 16
System Design
0 - 17
System Design
0 - 18
According to forecasts, future of IT characterized by terms such as
Disappearing computer,Ubiquitous computing,Pervasive computing,Ambient intelligence,Post-PC era,Cyber-physical systems.
Basic technologies:Embedded SystemsCommunication technologies
Motivation (1)
0 - 19
“Information technology (IT) is on the verge of another revolution. …..networked systems of embedded computers ... have the potential to change radically the way people interact with their environment by linking together a range of devices and sensors that will allow information to be collected, shared, and processed in unprecedented ways. ...The use … throughout society could well dwarf previous milestones in the information revolution.”
Source. Edward A. Lee, UC Berkeley, ARTEMIS Embedded Systems Conference, Graz, 5/2006
Motivation (2)
0 - 20
“Dortmund“ Definition: [Peter Marwedel]
Information processing systems embedded into a larger product
Berkeley: [Edward A. Lee]:Embedded software is software integrated with physical*processes. The technical problem is managing time and concurrency in computational systems.
Definition: Cyber-Physical (cy-phy) Systems (CPS) are integrations of computation with physical processes [Edward Lee,2006].
Embedded Systems & Cyber-Physical Systems
0 - 21
Ubiquitous computing: Information anytime, anywhere.Embedded systems provide fundamental technology.
Communication Technology
Optical networkingNetwork management
Distributed applicationsService provision
UMTS, DECT, Hiperlan, ATM
Embedded Systems
RobotsControl systemsFeature extraction and recognitionSensors/actorsA/D-converters
Pervasive/Ubiquitous computingDistributed systems
Embedded web systemsR
eal-t
ime
Dep
enda
bilit
y
Qua
lity
of
serv
ice
Embedded Systems and ubiquitous computing
0 - 22
Spending on GPS units exceeded $100 mln during Thanksgiving week, up 237%from 2006 … More people bought GPS units than bought PCs, NPD found.[www.itfacts.biz, Dec. 6th, 2007]
…, the market for remote home health monitoring is expected to generate $225mln revenue in 2011, up from less than $70 mln in 2006, according to Parks Associates. . [www.itfacts.biz, Sep. 4th, 2007]
According to IDC the identity and access management (IAM) market in Australia and New Zealand (ANZ) … is expected to increase at a compound annual growth rate (CAGR) of 13.1% to reach $189.3 mln by 2012 [www.itfacts.biz, July 26th, 2008].
Accessing the Internet via a mobile device up by 82% in the US, by 49% inEurope, from May 2007 to May 2008 [www.itfacts.biz, July 29th, 2008]
Growing importance of embedded systems
0 - 23
Multiple networksBody, engine, telematics, media, safety
Multiple processorsUp to 100
• 8-bit – door locks, lights, etc. • 16-bit – most functions• 32-bit – engine control, airbags
Processing where the action isSensors and actuators distributed all over the vehicleNetworked together
Functions by embedded processing:ABS: Anti-lock braking systemsESP: Electronic stability controlAirbagsEfficient automatic gearboxesTheft prevention with smart keysBlind-angle alert systems... etc ...
Automotive electronics
0 - 24
Avionics
Flight control systems,anti-collision systems,pilot information systems,power supply system,flap control system,entertainment system,…
Dependability is of outmost importance.
0 - 25
Railways
Safety features contribute significantlyto the total value of trains, and dependability is extremely important
0 - 26
TelecommunicationMobile phones have been one of the fastest growing markets in the recent years,
• Multiprocessor• 8-bit/32-bit for UI• DSP for signals• 32-bit in IR port• 32-bit in Bluetooth
• 8-100 MB of memory• All custom chips• Power consumption & battery life depends on
softwarebase stations
• Massive signal processing• Several processing tasks per connected
mobile phone• Based on DSPs
• Standard or custom• 100s of processors
Geo-positioning systems,Fast Internet connections,Closed systems for police, ambulances, rescue staff.
0 - 27
Medical systems
For example:• Artificial eye: several approaches,
e.g.:• Camera attached to glasses;
computer worn at belt; output directly connected to the brain, “pioneering work by William Dobelle”. Previously at [www.dobelle.com]
Translation into sound; claiming much better resolution.[http://www.seeingwithsound.com/etumble.htm]
0 - 28
Functions requiring computers:RadarWeaponsDamage controlNavigationbasically everything
Computers:Large servers1000s of processors
Extremely Large
0 - 29
Custom processorsGraphics, sound
32-bit processorsIR, BluetoothNetwork, WLANHarddiskRAID controllers
8-bit processorsUSBKeyboard, mouse
Inside your PC
0 - 30
Authentication systems
Finger print sensorsAccess controlAirport security systemsSmartpen®Smart cards….
0 - 31
Examples
Consumer electronics
0 - 32
Examples
Industrial automation
0 - 33
Forestry Machines
© Jakob Engblom
Networked computer systemControlling arms & toolsNavigating the forestRecording the trees harvestedCrucial to efficient work
Operator panelGraphical display Touch panelJoystickButtonsKeyboard
“Tough enough to be out in the woods”
0 - 34
ExamplesIntegrated cooling, lightning, room reservation, emergency handling,communicationGoal: “Zero-energy building”
Smart buildings
0 - 35
Robotics“Pipe-climber”
Robot “Johnnie“
Lego mindstormsStandard controller
• 8-bit processor• 64 kB of memory
Electronics to interface to motors and sensors
0 - 36
EstimationHardware, software and system as a whole suitability
��- �
�a���a���a���a�����o�t�a����o��si�n�o�t�a����o��si�n
Jo�ef Stefan �nternational Postgraduate School
����nt�o���tion
�o�����������o���a�a
��- �
�ont�nts� ���������������������������
Le�els of �bstraction in Electronic System �esign
�ypical �esign �low of Hardware-Software Systems
��- 3
�ain reason for buying is not information processing
Embedded systems �ES� � in�o�mation���o��ssin��s�st�ms��m�������into�a��a�������o���t
E�amples�
Em���������st�ms
��- �
Em���������st�ms
��t��na����o��ss
�m�������s�st�m
��man�int���a��
s�nso�s��a�t�ato�s
��- �
�a�a�����an���ist�i��t����a���t���at�o�ms
ACC
ABSESP
ASR
enginecontrol powertrain
control
��- 6
E�am������������o��sso��ell Processor ��B�� combines
general-purpose architecture core withcoprocessing elements which greatly accelerate multimedia and �ector processing applications, as well as many other forms of dedicated computation�
��- �
�omm�ni�atin��Em���������st�mssensor networks �ci�il engineering, buildings, en�ironmental monitoring, traffic, emergency situations�smart products, wearable�ubi�uitous computing
��������
��- �
���n�s�in��n�o�mation�an���omm�ni�ation
�ew �pplications andSystem Paradigms
Large-scale�istributed Systems
�entrali�edSystems
�etworkedSystems
�nternet
��- �
�om�a�isonEmbedded Systems
�ew applications that are known at design-time��ot programmable by end user��i�ed run-time re�uirements �additional computing power not useful�� �riteria�
• cost• power consumption• predictability• meeting time bounds• �
�eneral Purpose �omputingBroad class of applications�
Programmable by end user�
�aster is better�
�riteria�• cost• a�erage speed
��- �0
��si�n���a���n��s��������������������������������������������
increasing application complexity e�en in standard and large �olume products
• large systems with legacy functions• mi�ture of e�ent dri�en and data flow tasks • e�amples� multimedia, automoti�e, mobile communication
increasing target system complexity• mi�ture of different technologies, processor types, and design styles• large systems-on-a-chip combining components from different
sources, distributed system implementationsnumerous constraints and design objectives
• e�amples� cost, power consumption, timing constraints, dependability
��- ��
��a���n��s��o��Em��������o�t�a���ynamic en�ironments�apture the re�uired beha�iour��alidate specificationsEfficient translation of specifications into implementations�How can we check that we meet real-time constraints�How do we �alidate embedded real-time software� �large �olumes of data, testing may be safety-critical�
��- ��
�m���m�ntation���t��nati��s
����o�man���o����E��i�i�n�� ����i�i�it�
����i�ation�s���i�i��int���at����i���its������s�
����i�ation�s���i�i��inst���tion�s�t���o��sso�s������s�
• �i��o�ont�o����• ���s���i�ita��si�na����o��sso�s�
��n��a������os����o��sso�s
��o��amma�����a���a��
• �������i������o��amma�����at��a��a�s�
��- �3
ES �ust be ����������,����������������� probability of system working correctly pro�ided that is was working at t����������������������� probability of system working correctly d time units after error occurred������������������� probability of system working at time t������� no harm to be caused��������� confidential and authentic communication
E�en perfectly designed systems can fail if the assumptions about the workload and possible errors turn out to be wrong��aking the system dependable must not be an after-thought, it must be considered from the �ery beginning
����n�a�i�it�
��- ��
ES must be efficient�ode-si�e efficient�especially for systems on a chip�Run-time efficient� eight efficient�ost efficientEnergy efficient
E��i�i�n��
��- ��
�any ES must meet ����-��� �������������� real-time system must react to stimuli from the controlled ob�ect �or the operator� within the time inter�al �������� by the en�ironment��or real-time systems, right answers arri�ing too late are wrong��������-��� ������������������������������������� �������������������������������������������������������opet�, �������ll other time-constraints are called ������ guaranteed system response has to be e�plained without statistical arguments
��a��tim���onst�aints
��- �6
Embedded and Real-�ime Synonymous�
�ost embedded systemsare real-time�ost real-time systemsare embedded
�m�������m������
��a���a���tim�tim�
�m��������m���������a���a���tim�tim�
� Jakob Engblom
��a���im����st�ms
��- ��
��a�ti���������i��s�st�ms
�ypically, ES are ��������������� �������������������� ����������������������������������������� ���������������� ������������������������������� ��������������������� ����Beha�ior depends on input ������������������
automata model appropriate,model of computable functions inappropriate�
����i��s�st�ms�analog � digital parts��
��- ��
��������� towards a certain ������������nowledge about beha�ior at design time can be used to minimi�e resources and to ma�imi�e robustness
������������������������no mouse, keyboard and screen
���i�at���s�st�ms
��- ��
�ont�nts� hat is an Embedded System �
�������������������������������������������������
�ypical �esign �low of Hardware-Software Systems
��- �0
��st�a�tion���o���s�an����nt��sis�����
�ormal description of selected properties of a system or subsystem� model consists of data and associated methods
�������������������������egree of abstraction, granularity
• system, architecture, logic, transistor, • module, block, function, ���
�iew• beha�ior, structural, physical
���������Linking ad�acent le�els of abstraction �refinement�Stepwise adding of structural information
��- ��
Structure
Beha�ior
�����s�o����st�a�tions
��st�m�rchitecture
R�L
Process��odule
�unction �� ��
�����t��o��
�at��������mo���s��it���������mo���s�i���it�������mo���s���i���������mo���s�a�o�t�mo���s
��- ��
�ont�nts� hat is an Embedded System �
Le�els of �bstraction in Electronic System �esign
�������������������������������-��������������� �
��- �3
��si�n�a���oa���s
���inition���nt��sis�is the process of generating the description of a system in terms of related lower-le�el components from some high-le�el description of the e�pected beha�ior�
“describe-and-synthesi�e” paradigm by �a�ski, ���4
�n contrast to the traditional “specify-e�plore-refine” approach, also known as “design-and-simulate” approach�
�anual design steps are more error-prone than automatic synthesis and, therefore, simulation is more important�
��- ��
S� -�ompilation H� -Synthesis
��st�m���si�nSpecification
System Synthesis
�achine �ode �et lists
Estimation
�nstruction Set
�ntellectualProp� Block
�ntellectualProp� �ode
��- ��
�i������o��sso������it��t���
S� -�ompilation H� -Synthesis
Specification
System Synthesis
�achine �ode �et lists
Estimation
�nstruction Set
�ntellectualProp� Block
�ntellectualProp� �ode
��- �6
����i�ation�����i�i���� ���o��
S� -�ompilation H� -Synthesis
Specification
System Synthesis
�achine �ode �et lists
Estimation
�nstruction Set
�ntellectualProp� Block
�ntellectualProp� �ode
��- ��
����i�ation�����i�i���nst���tion���t���o��sso�
S� -�ompilation H� -Synthesis
Specification
System Synthesis
�achine �ode �et lists
Estimation
�nstruction Set
�ntellectualProp� Block
�ntellectualProp� �ode
��- ��
��st�m���������si�n������ -������������ is a comple� synthesis tasks
software synthesis and code generationhardware synthesisinterface and communication synthesishardware�software partitioning and component selectionhardware�software scheduling
��������� �������:application specificationdesign space e�ploration and system optimi�ationestimation
��- ��
�����a��in����o���m
��- 30
�� ��� ��a��in��an���������in��������������������������
Partitioning of system function to programmable components �software�, hard-wired or parameteri�ed components �hardware� or application specific instruction set processors�
��� �������to scheduling and load distribution problem in real-time operating systems
time constraints, conte�t switch and conte�t switch o�erhead,process synchroni�ation and communication
�����������to real-time operating systemslarger design space with �ery different solutionshigh optimi�ation re�uirements �moti�ation for hardware design�underlying hardware is not fi�ed
��- 3�
�� ��� ��a��in��an���������in�Similarity to allocation �or load distribution� problem in high-le�el synthesis �or real-time operating systems�
dedicatedHWcomponents
P1
P3
P2
P4
SW(processors)
��- 3�
Estimation�he principle of synthesis based on abstraction only makes sense if there are ���������������������������a�ailable�
Estimate properties of the ne�t layer�s� of abstraction��esign decisions are based on these estimated properties� �f the estimation is not correct �or not accurate enough�, the design will be sub-optimal or e�en not working correctly�
��si�n���a��E���o�ation
�im��in�o���si�n
��si�n���a��E���o�ation
��si�n���a��E���o�ation
Estimation�o��o�����a�����o���ti�s
�i��a�st�a�tion
�o�a�st�a�tion
���
�- �
�a���a���a���a�����o�t�a����o��si�n�o�t�a����o��si�n
Jo�ef Stefan �nternational Postgraduate School
�������i�i�ation�an���o���s�o���om��tation
doc. dr. Gregor Papa
2 - 2
SW-Compilation HW-Synthesis
System DesignSpecification
System Synthesis
Machine Code Net lists
Estimation
Instruction Set
IntellectualProp. Block
IntellectualProp. Code
2 - �
�onsider a simp�e e�amp�e
��he ��ser�er pattern defines a one-to-many dependency �et�een a su��ect o��ect and any num�er of o�ser�er o��ects so that �hen the su��ect o��ect chan�es state� all its o�ser�er o��ects are notified and updated automatically.�
Eric �amman �ichard Helm� �alph �ohnson� �ohn �lissides� Design Patterns� �ddision-Wesley� ����
2 - �
��amp�e� ��ser�er pattern in �a�a
pu�lic �oid add�istener�listener� �� �
pu�lic �oid set�alue�newvalue� �
my�alue�ne��alue�
for �int i��� i�mylisteners.len�th� i��� �
my�isteners�i�.�alueChan�ed�ne��alue��
�
Will this �ork in a multithreaded conte�t�
2 - �
��ser�er pattern �it� m�te�es
pu�lic sync�roni�ed �oid add�istener�listener� �� �
pu�lic sync�roni�ed �oid set�alue�newvalue� �
my�alue�ne��alue�
for �int i��� i�mylisteners.len�th� i��� �
my�isteners�i�.�alueChan�ed�ne��alue��
� �a�asoft recommends a�ainst this.What�s �ron� �ith it�
2 - �
��te�es �sing monitors are mine�ie�dspu�lic sync�roni�ed �oid add�istener�listener� �� �
pu�lic sync�roni�ed �oid set�alue�newvalue� �
my�alue�ne��alue�
for �int i��� i�mylisteners.len�th� i��� �
my�isteners�i�.�alueChan�ed�ne��alue��
� �alueChan�ed�� may attempt to ac�uire a lock on some other o��ect and stall. If the holder of that lock calls add�istener��� deadlock�
� calls add�istener
�alueChan�ed
re�uests
lock
held
�y �
mute�
2 - �
Simp�e o�ser�er pattern gets comp�icated
pu�lic sync�roni�ed �oid add�istener�listener� �� �
pu�lic �oid set�alue�newValue� �
sync�roni�ed �this� �
my�alue�ne��alue�
listeners�my�isteners.clone���
�
for �int i��� i�listeners.len�th� i��� �
listeners�i�.�alueChan�ed�ne��alue��
�
�hile holdin� lock� make a copy of listeners to a�oid race conditions
notify each listener outside of the synchroni�ed �lock to a�oid deadlock
�his still isn�t ri�ht.What�s �ron� �ith it�
2 - �
Simp�e o�ser�er pattern� �o� to ma�e it rig�t�
pu�lic sync�roni�ed �oid add�istener�listener� �� �
pu�lic �oid set�alue�newValue� �
sync�roni�ed �this� �
my�alue�ne��alue�
listeners�my�isteners.clone���
�
for �int i��� i�listeners.len�th� i��� �
listeners�i�.�alueChan�ed�ne��alue��
�
Suppose t�o threads call set�alue��. �ne of them �ill set the �alue last� lea�in� that �alue in the o��ect� �ut listeners may �e notified in the opposite order. �he listeners may �e alerted to the �alue-chan�es in the �ron� order�
2 - �
Pro��ems �it� t�read��ased conc�rrency
Nontrivial software written with threads, semaphores, and mutexes is incomprehensible to humans.
Search for non-thread-�ased models� �hich are the re�uirements for appropriate specification techni�ues�
2 - ��
�ontents������ �� �����������
StateCharts
�ata-�lo� Models
2 - ��
�e��irements �or Speci�ication �ec�ni��es ���
��������� ���������Humans not capa�le to understand systemscontainin� more than a fe� o��ects.
Most actual systems re�uire more o��ectsHierarchy
���������� ���������E�amples� states� processes� procedures.
���������� ���������E�amples� processors� racks�printed circuit �oards
procproc
proc
2 - �2
�e��irements �or Speci�ication �ec�ni��es ���
��������� ������ ���������������������
��������� �����-�������� ���������e�uired for reacti�e systems.
��������� ��������-�������� ��������Components send streams of datato each other.
�o o�stac�es �or ��������� ��������������
2 - ��
�ode�s o� �omp�tation� De�inition
� �at does it mean� �to comp�te���ode�s o� comp�tation de�ine�
Components and an e�ecution model for computations for each componentCommunication model for e�chan�e of information �et�een components.
� Shared memory� Messa�e passin�� �
C-�
C-�
2 - ��
S�ared memory
Potential race conditions � inconsistent results possi�le�Critical sections � sections at �hich e�clusi�e access to
resource r �e.�. shared memory� must �e �uaranteed.
process a �..P�S� ��o�tain lock.. �� critical section��S� ��release lock
�
process � �..P�S� ��o�tain lock.. �� critical section��S� ��release lock
�
�ace-free access to shared memory protected �y S possi�le
�his model may �e supported �y�mutual e�clusion for critical sectionscache coherency protocols
2 - ��
�on���oc�ing�async�rono�s message passing
Sender does not ha�e to �ait until messa�e has arri�ed� potential pro�lem� �uffer o�erflo�
�send ���
�recei�e ���
2 - ��
��oc�ing�sync�rono�s message passing
Sender �ill �ait until recei�er has recei�ed messa�e
�send ���
�recei�e ���
2 - ��
Sync�rono�s message passing� �SP
�SP �communicatin� se�uential processes��Hoare� ������rendez-vous-�ased communication�E�ample�
process �..�ar a ...a����c�a� -- output
end
process �..�ar a ...a����c�a� -- output
end
process B..�ar � ......c��� -- input
end
process B..�ar � ......c��� -- input
end
2 - ��
�omponents ���
�iscrete e�ent model
a�c
timeactiona��� ���� c��� a��� a���
�ueue
� �� �� �� �����
�
�on Neumann model
Se�uential e�ecution� pro�ram memory etc.
2 - ��
�omponents ���
�inite state machines
�ifferential e�uations
btx2
2
2 - 2�
��amp�e Discrete ��ent� ��D�
��D� �hard�are description lan�ua�e� is commonly used as a desi�n-entry lan�ua�e for di�ital circuits.
2 - 2�
Sensiti�ity �ists in ��D�Sensi�ity lists are a shorthand for a sin�le �ait on-statement at the end of the process �ody�process ��� y�
�eginprod �� � and y �
end process�is e�ui�alent toprocess
�egin�ait on ��y�prod �� � and y �
end process�
2 - 22
No lan�ua�e that meets all lan�ua�e re�uirementsusin� compromises
2 - 2�
�ontentsModels of Computation
�����������
�ata-�lo� Models
2 - 2�
��assica� ��tomataClassical automata�
� Moore-automata�Y � �Z�� Z� � �X, Z�
� Mealy-automataY � �X�Z�� Z� � �X, Z�
Internal state Zinput X output Y
Ne�t state Z� computed �y function �utput computed �y function
�� ��
����
e��
e��
e��
e��� �
��
clockMoore- � Mealy automata�finite state machines ��SMs�
2 - 2�
State��arts
Classical automata not useful for comple� systems �comple� �raphs cannot �e understood �y humans�.
������������ �� ��������� StateCharts �Harel� �����
2 - 2�
�ntrod�cing �ierarc�y
�SM �ill �e in e�actly one of the su�states of S if S is acti�e�either in � or in B or ..�
2 - 2�
De�initionsCurrent states of �SMs are also called ������states.States �hich are not composed of other states are called �����������.States containin� other states are called �����-������.�or each �asic state s� the super-states containin� s are called �������� ������.Super-states S are called ��-�����-������� if e�actly one of the su�-states of S is acti�e �hene�er S is acti�e.
ancestor state of Esuperstate
su�states
2 - 2�
De�a��t State �ec�anism
�ry to hide internal structure from outside �orld�
�efault state�illed circleindicates su�-state entered �hene�er super-state is entered.Not a state �y itself�
2 - 2�
�istory �ec�anism
�or input m� S enters the state it �as in �efore S �as left �can �e �� B� C� �� or E�. If S is entered for the �ery first time� the default mechanism applies.History and default mechanisms can �e used hierarchically.
��eha�ior different from last slide�
km
2 - ��
�om�ining �istory and De�a��t State
same meanin�
2 - ��
�onc�rrencyCon�enient �ays of descri�in� concurrency are re�uired.���-�����-������: FSM is in all (immediate) sub-states of a super-state.
2 - 32
Entering and Leaving AND-Super-States
Line-monitoring and key-monitoring are entered and left, when service switch is operated.
incl.
2 - 33
�ree representati�n �� state setsbasicstate
��-super-state ���-super-state
� �
��
�
�
�
� � F
� � L
M
� �
�� �
� � F M
� �
� � L
�
� �
� ��
2 - 3�
��� putati�n �� state sets�omputation of state sets by ���������� ��� ���� fromleaves to root:
basic states: state set � state��-super-states: state set � union of children���-super-states: state set � �artesian product of children
�� �
� � F M
� �
� � L
2 - 3�
��pes �� States
�n State�harts, states are either
����� ������� �r
���-�����-������� �r
��-�����-�������
2 - 3�
�i� ersSince time needs to be modeled in embedded systems,timers need to be modeled.�n State�harts, special edges can be used for timeouts.
�f event a does not happen while the system is in the left state for �� ms, a timeout will take place.
2 - 3�
�sing �i�ers in Ans�ering �a��ine
2 - 3�
�epresentati�n �� ���putati�ns
�esides states, arbitrary many other variables can be defined. �his way, not all states of the system are modeled e�plicitly.�hese variables can be changed as a result of a state transition (��������). State transitions can be dependent on these variables (�����������).
condition
action unstructuredstate space
variables
2 - 3�
�eneral ��r� �� Edge La�els
���������ist only for the ne�t evaluation of the model�an be either internally or e�ternally generated
������������efer to values of variables that keep their value until t�e� are reassigned
���������an either be assignments for variables or creation of events
��������service-off �not in Lproc� � service:��
event �condition� � action
2 - ��
Events and a�ti�ns�������can be composed of several events:
���and �2�: event that corresponds to the simultaneous occurrence of e� and e�.����r �2�: event that corresponds to the occurrence of either e� or e� or both.�n�t ��: event that corresponds to the absence of event e.
��������can also be composed:�����2�: actions a� und a� are e�ecuted in parallel.
�ll events, states and actions are globally visible.
2 - ��
E�a�ple
e:a1:a2:
c:
x y ze�a1 �c��a2
e:a1:a2:
c:
truefalse
truefalse
2 - �2
��e State��arts Si�ulati�n ��ases
�ow are edge labels evaluated�
����� ������:
�. �ffect of e�ternal changes on events and conditions is evaluated,
�. �he set of transitions to be made in the current step and right hand sides of assignments are computed,
�. �ransitions become effective, variables obtain new values.
2 - �3
E�a�ple
�n phase �, variables a and b are assigned to temporary variables. �n phase �, these are assigned to a and b. �s a result, variables a and b are swapped.�n a single phase environment, e�ecuting the left state first would assign the old value of b (��) to a and b. ��ecuting the right state first would assign the old value of a (��) to a and b. �he e�ecution would be non-deterministic.
2 - ��
Steps��ecution of a State�hart model consists of a se�uence of (status, step) pairs
Status� values of all variables � set of events � current timeStep � e�ecution of the three phases
Status phase �
phase �
phase �
2 - ��
�e�le�ts ��del �� �l���ed �ard�are
�n an actual clocked (synchronous) hardware system, both registers would be swapped as well.
Same separation into phases found in other languages as well, especially those that are intended to model hardware.Same separation into phases found in other languages as well, especially those that are intended to model hardware.
2 - ��
��re �n se�anti�s �� State��arts�nfortunately, there are several time-semantics of State�harts in use. �his is another possibility:
� step is e�ecuted in arbitrarily small time.�nternal (generated) events e�ist only within the ne�t step.��ternal events can only be detected after a stable state has been reached.
e�ternal events
steptransport of internal events
stablestate
stablestate
tstate transitions
2 - ��
E�a�ples
state diagram:stable state
2 - ��
E�a�ple�on-determinism
A C
B D
E G
F H
a
a a
a
A,B C,DE,H
F,G
a
a
astate diagram:
2 - ��
E�a�ple
� �
� �
� c��a �
��
�
� �
a�c
��
a
state diagram (only stable states are represented, only a and b are e�ternal):
�
���
���
a��
a���
a��� a���� � �
a��� a���� �
2 - ��
Evaluati�n �� State��arts ���
��������������allows arbitrary nesting of ���- and ��-super states.��� ������ ������� in a follow-up paper to original paper.Large number of commercial simulation ����� ���������(StateMate, StateFlow�Matlab, �etterState, �ML, ...)�vailable �back-ends�translate State�harts into � �� ����, thus enabling software or hardware implementations.
2 - ��
Evaluati�n �� State��arts ���
������enerated � �������� ���������� �����������,�ot useful for ����������� applications,�o description of ���-���������� ��������,�o ������-�����������,�o description of ���������� ���������.
2 - �2
SDL
������������� ��� ����������� �������� (S�L) is a specification language targeted at the unambiguous specification and description of the behaviour of reactive and distributed systems.
�sed here as a (prominent) e�ample of a model of computation based on as�n��r�n�us �essage passing.
appropriate also for distributed systems
2 - �3
����uni�ati�n a��ng SDL-�S�s�ommunication between FSMs (or �processes�) is based on �essage-passing, assuming a p�tentiall� inde�initel� large ����-�ueue.
�ach process fetches ne�t entry from F�F�,checks if input enables transition,if yes: transition takes place,if no: input is discarded (e�ception: S���-mechanism).
2 - ��
Deter� inisti��Let tokens be arriving at F�F� at the same time:
�rder in which they are stored, is unknown
�ll orders are legal: simulators can show different behaviors for the same input, all of which are correct.
2 - ��
��ntentsModels of �omputation
State�harts
����-���� ������
2 - ��
Data�l�� Language ��del���������communicating through ���� �������
�rocess � �rocess �
�rocess �
F�F� �uffer
F�F� �ufferF�F� �uffer
2 - ��
��il�s�p�� �� Data�l�� Languages����������� ��������� ��� �� ������� �� �����������:
�mperative language style: program counter is king�ataflow language: movement of data is the priorityScheduling responsibility of the system, not the programmer
����� ��������������:�ll processes run �simultaneously��rocesses can be described with imperative code�rocesses can ���y communicate through buffersSe�uence of read tokens is identical to the se�uence of written tokens
2 - ��
Data�l�� Languages�ppropriate for applications that deal with ������� �� ����:
Fundamentally concurrent: maps easily to parallel hardware�erfect fit for block-diagram specifications (control systems, signal processing)Matches well current and future trend towards multimedia applications
��������������:�ost Language (process description), e.g. �, ���, �ava, .... .�oordination Language (network description), usually �home made�, e.g. �ML.
2 - ��
E�a�ple� ��E�-� vide� de��der
2 - ��
�a�n �r��ess Net��r�s
�roposed by �ahn in ���� as a general-purpose scheme for parallel programming:
����: destructive and blocking (reading an empty channel blocks until data is available)�����: non-blocking����: infinite si�e
�ni�ue attribute: ������ �����
2 - ��
A �a�n �r��essFrom �ahn�s original ���� paper
process f(in int u, in int v, out int w)�
int i� bool b � true�for (��) �
i � b � wait(u) : wait(v)�printf(�� i�n�, i)�send(i, w)�b � �b�
��
f
u
v
w
�rocess alternately reads from u and v, prints the data value, and writes it to w
� hat does this do�
2 - �2
A �a�n �r��essFrom �ahn�s original ���� paper:
process g(in int u, out int v, out int w)�
int i� bool b � true�for(��) �
i � wait(u)�if (b) send(i, v)� else send(i, w)�b � �b�
��
guv
w
�rocess reads from u and alternately copies it to v and w
� hat does this do�
2 - �3
A �a�n �r��essFrom �ahn�s original ���� paper:
process h(in int u, out int v, int init)�
int i � init�send(i, v)�for(��) �
i � wait(u)�send(i, v)�
��
hu v
�rocess sends initial value, then passes through values.
� hat does this do�
2 - ��
A �a�n �r��ess Net��r�� hat does this do��rints an alternating se�uence of ��s and ��s.
fg
hinit � �
hinit � �
�mits a � once and then copies input to output
�mits a � once and then copies input to output
2 - ��
Deter� ina�������� :
� system is random if the information ����� about the system and its inputs is not sufficient to determine its outputs.
������ �����:�efine the ������y of a channel to be the se�uence of tokens that have been both written and read. � process network is said to be �e�e����a�e�if the histories of all channels depend only on the histories of the input channels.
�� ��������:Functional behavior is independent of timing (scheduling, communication time, e�ecution time of processes).Separation of functional properties and timing.
2 - 66
Determinacy
���������������monotonic mapping���������������������������������������
������������������x������������������������������������x��������������������������������������������������������������������������y������������������������������������������������������������������������,����������������������������������������������������������������
F[x1,x2,x3,…] [y1,y2,y3,…]
2 - 6�
Determinacy
�orma� de�inition�������������������� ����[x1,�x2,�x3,����]�����x����������� [x1] [x1,�x2] [x1,�x2,�x3,����]���������������������� � �����,��1,�����,���� ��
�������������������������� � ������ ������ ������������ F���� ��
������������������ � �� F��� F����
F[x1,x2,x3,…] [y1,y2,y3,…]
2 - 6�
�r�������Determini�m� ����������������������������������������������������������������determinate�
�������������������������������������������������������������������������������������������������y�������������
Rea�oning�����������������,���������y�����������y����������������������������y�������������������������������y������������y,�����,���������������������������������������������y���������,����������������������������������y����������������������������������������������������������������y�������
2 - 6�
���in����n��eterminacy��������������������y��������������������������������������
�����������������������������������������������������������������������������������������������������������������������������������������
��amp�e �������������
������������y���������������
2 - ��
���in����n��eterminacy
F
�1���[�,��]
�2���[�]
F�������[�,��,��]������1�,��2��
� ������[�],�[�]�� �[�,��],�[�]��F��� F��������[�,��]� [�,��,��]
F
�1 ��[�]
�2 ��[�]
F������[�,��]� ����1,��2�
2 - ��
�c�e���in���a�n��et��r����������������������������������������������������������������������������������������������
� �
� ����y�������������
���y�����������������������
���������������������������
2 - �2
Deman���ri�en��c�e���in�����y����������������������������������������������y���������������������
� �
� �����y�
�������������
����y��������������
���y�����������������������
�����������������
2 - ��
��m��ar�������rit�m�������������������������������������o�nded memor�������������������
�tart ����������������ith �o�nded ����er si�es �������������������������������������������������������������������any s�hed��in� te�hni��e �����������������x�������������������������������������������������������������������������������������������������������������������������������itho�t dead�o����������y��������������������ontin�e�����y�����dead�o�����������������������������,�in�rease si�e��������������������������������
2 - ��
Fr�m��n�inite�t��Finite�����er��i�e�������������������������������������������������������������
�������y�������������������������������������������������������������������������������������������������������������������������������������n,�����������������������������������n �����������������y����������������������������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������y��������������������������������������������
�����2
2 - �5
Dea���c����am��e����x������������������������������������������������y���������������������������������2��
�
�
�������������������,����������������������������1,���������1,���������1,����
��
�
������������������,����������������������������������������������
��
�
2 - �6
��am��e��Finite��i�e�����er��in��������������������������������������������������������������������������������������
�
�
�����������������������������������������1,�������������������1,�������������������1,����
��
�
���������������������1,����������1,�����������1,�����������������������������1,�������������������1,�������������������1,�������
�
�
��
��
����������2����������1
2 - ��
�ar�������rit�m�in��cti�n���������������������������1������,��,��,���
� �
� �
���y�����������������������
�1
�2
�3
� � � ��1 1 1 � ��2 � 1 1 1�3 � 1 1 �
2 - ��
�ar�������rit�m�in��cti�n��������������������������������������������y�������������������������y�
� �
� �
���y�����������������������
�1
�2
�3
� � � � � � ��1 1 1 � � 1 � ����2 � 1 1 1 1 1 ����3 � 1 1 � � � ���
2 - ��
��a��ati�n�����a�n��r�ce����et��r���ro�
�����������y��������������������������������������������������������������������������������������������������������������������������������������������x�����������������������������������������������������������������
�on�������������������������������������������������������������������������y��������������y�����������������������������������������������������������������������x����������������y�������������
2 - ��
�ync�r�n����Data�������DF�����������������������������������,���������y,�1����
�estri�tion �����������������������������������������������������������������������������������i�ed n�m�er o� token�������������������������������������������������
��amp�e������������������������������������������1�����������������������������
1 1 2 3 2 � � � � 1
��������� �����������
2 - ��
�DF��c�e���in��c�ed��e ���������������������������y�at compi�e time�������������y����������������������� �sta��ish re�ati�e e�e��tion rates �y������������y��������
����������������������������������������� �etermine �eriodi� s�hed��e �y�������������y�����������
�����������������������������������������������������������������������������������
Re���t�����������������������x����������������y���������������������������������������
2 - �2
�a�ancin�����ati�n�������������������������������
�
�
12
3
2
�
�
3
�1
3
21
�
3a �2�������3d ���
��3����2��a ���d �2a ���
����������������������
� �
2 - ��
����in��t�e��a�ancin�����ati�n�ain �D� �c�ed��ing t�eorem ����������
���������������������������n ����������������������������������������������y������x�� ���������n�1��������������n�1�������������x����������������������������������������������������� ������������������y���������y������������������������������������������y����������������������������������������������
��amp�e�
2 - ��
Determine��eri��ic��c�e���e�o��i��e �c�ed��e��
�������������
�������������
�������������
…
�y��������������������y�������������������������������������������e�i�i�it��
���������������������������������,���������y���������������������
�
�
12
3
2
�
�
3
�1
3
21
�
�- �
�ar��are�ar��are�����t�are����e�i�n���t�are����e�i�n
��������������������������������������������
���De�i�n���ace������rati�n
��c���r���re��r��a�a
�- 2
�� ������������ �� ��y�������
�y�tem�De�i�n�������������
�y������y��������
������������ ���������
����������
���������������
�����������������������
����������������������
�- �
De�i�n���ace������rati�n
����icati�n �rc�itect�re
�a��in�
��timati�n
�- �
Detai�e���ie���De�i�n���ace������rati�n
m��ti���ecti�e��timi�ati�n
e�a��ati�n
������������������������
������������������
c�n�tr�ctarc�itect�re
ma�a���icati�n
e�timate�er��rmance
�����������������������
�������������������
���������
��������������������
�- 5
��am��e�����im��e����e�
��������������������������������������������������������
������������������������������������������������������
���������������,�������y,������������������
1
2
3
��
,, 21 ��
3 - 6
Example 1: Evolutionary Algorithms for DSE
��
“chromosome” = encoded allocation + binding
design point(implementation)
allocation
binding
individual
decode allocation
decode binding
scheduling
selectionrecombinationmutation
fitness evaluationfitness
user constraints
3 - �
Example 1: �asi� Model
1
2
3
4
5
6
7
RISC
HWM1
HWM2
SB
PTP
GP EM GA
Definition: A specifica-tion graph is a graphGS=(VS,ES) consistingof a problem graph GP,an architecture graphGA, and edges EM. Inparticular, VS=VP∪VA,ES=EP∪EA∪EM
data flow
3 - �
Example 1: Mapping
1
2
3
4
5
6
7
RISC
HWM1
SB
1
0
8
1
20
1
2
α
τ
0
1
21
30
1
21
29
β RISC HWM1
HWM2
sharedbus
PTP bus
3 - �
Example 1: �hallenges
�ncoding of (allocation+binding)simple encoding
� e�g� one bit per resource� one variable per binding� eas� to implement� man� infeasible partitioning solutions
encoding + repair� e�g� simple encoding and modif� such that for each vp VP there
e�ists at least one va VA �ith a (vp) = va� reduces number of infeasible partitioning solutions
�eneration of the initial population� mutation �ecombination
3 - ��
Example 1: �ase Study
3 - ��
Example 1: �ase Study
3 - 12
��am�le �� �ase �tud��rame memor� dual �orted �rame memor� bloc� matc���� module ���ut module
out�ut module�u��ma� e�coder���/���� module
subtract/add module
3 - 13
��am�le �� �olut�o� �
INMINM OUTMOUTM FMFM RISC2RISC2
SBS
3 - 1�
��am�le �� �olut�o� �
INMINM OUTMOUTM DPFMDPFM HCHC
SBF
DCTMDCTM BMMBMM SAMSAM
3 - 1�
��am�le �� �o�t�are ���t�es�s
C D �� � 2 �
A F� 2 � �� �
CD DATB
��������������������������� �����������������������
��������������
�ec�s�o�s�
CODE(A)CODE(B)CODE(A)CODE(B)CODE(C)
CALL(A)CALL(B)CALL(A)CALL(B)CALL(C)
FOR 1 TO 2CODE(A)CALL(B)CODE(C)CODE(A)
I������������������������������������������������������
S�������
ABABABCCABABA�
C��������������������
3 - 1�
��am�le �� ��t�m��at�o� �r�ter�a
2A
�
PROCEDURE AFOR 1 TO 3CALL(A)CODE(B)CODE(B)
���������������
��������������
��������������������
������������������
P��������������
��������������
��������������������
��������������������
D����������
B
3 - 1�
��am�le �� �rade�o��s
D����������
P������������� ��������������
�������
������������������
���� ��������
�����
�����������
3 - 1�
��am�le �� �rade�o�� �ur�aces
3 - 1�
��am�le �� ���lorat�o� �trate����am�le �� ���lorat�o� �trate��
3 - 2�
��am�le �� ����� �a�� �rocess �et�or���am�le �� ����� �a�� �rocess �et�or�
3 - 21
��am�le �� �ard�are �rc��tecture
���������������
3 - 22
��am�le �� �esult o� �u�ct�o�al ��mulat�o�n(p)
��������p
b(s)
��������s
3 - 23
��am�le �� �esult o� �lat�orm �e�c�mar�s
P�������������������������������������������(�) �����������������������
I��������������������������
���������(p���)p
�����������
3 - 2�
���������������������������������
��������������������������������������������������
�������������������
�����������������������������������
�������������������������������
���������������������������������
��am�le �� �ac��o��t�e�e��elo�e ��al�s�s
3 - 2�
����
���2
��am�le �� �am�le ���lorat�o� �esult��am�le �� �am�le ���lorat�o� �esult
����
���2
�- 1
�ard�are�ard�are//�o�t�are �odes����o�t�are �odes���
������S�����I������������P������������S�����
�� ��stem ��mulat�o�
doc� dr� �re�or �a�a
�- 2
S� �C���������� H� �S��������
��stem �es���S������������
S������S���������
M�������C��� N��������
����������
I�����������S��
I�����������P�����B����
I�����������P�����C���
�- 3
�utl��e
���������������������
D��������������S���������
��������S�����C
S�������������H����A�����������������
�- �
��stem a�d �odelA������� ����������������������������������������������������������������������������������������������������������������������I����S��������D���������������������������������������T������A������������������������������������������������������������������������������������������������������
�- �
�tateT����������������������������� ������������������������������������������������������������������������������ �� ��������������������������������������������������� �� �T�������� ���������������������������������������������������������
���������S�����������������������������������������������������
�- �
�tateI����������������������������������������� ���������������������������������������������������������A������������������������������������������������������p��s�
������������������������������������������������������������������������������������������������������������������������������������������������������������
�- �
��meI����������������������������������������������������������������������������������������������������I��������������������������������������������������������������������������������������������������
����p��s����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
�- �
��e�ts a�d ��screte ��e�t ��stemsA�����������������������������������������������������������������������������T������������������������������������I��������������������������������������������������������������I�������������������������������������������������������������������������
�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������I�������������������������������������������������������������������������������������������������������������������������������������������������������������
�- �
��screte ��e�t ��stems �����A�D�S�����������������������������������������������������������������������������������������������������������������������A����������������������������D�S��������������������������������������������������������������������T��������������������D�S�����������������������������������������������������I�������������D�S�������������������������������������������������������������O������������������������������������������������������������������� ������������������������������������P����������������������������������������������������������������������
�- 1�
��me�dr��e� �s� ��e�t�dr��e������������������������
�(�)�
����������� ������������
������������
������
����������
�����
�����������������������
����
��2�� �� ���� �� ��
����������������
�- 11
��me�dr��e� �s� ��e�t�dr��e���������-����������-������������������
T��������������������������������������������������������������������T������������������������������������������������������������������������������������������������������������������������������������������������������������A������������������������������������������������������
����
�
�- 12
��me�dr��e� �s� ��e�t�dr��e������-������������������
S������������������������������������A�������������������������������������������������������������������
����
�
����
�
�� �� ��� �� �� �����
�2�� �� ���� �� ��
��������������
������������
������
�� �� ��� �� �� �����
�- 13
�utl��e
S������C�������������
�������������������������
��������S�����C
S�������������H����A�����������������
�- 1�
��screte���e�t �odel��� a�d ��mulat�o��������������������������������������������������������������
��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
T����������������������������������������� �����-�����������������
�- 1�
�om�o�e�ts o� a ��screte���e�t ��mulat�o�����������������
�����������������������������������������I�����������������������������������������������������������������������������������������������
����������������������������������������������������������������������������T�������������������������������������������������������������������������������������������������������������������
�����������������C�����������������������������������������������C��������������������������������������A������������������������������������������������������������������������������������������������P��������������������������������������������������������������������
�- 1�
��screte���e�t ��mulat�o� �����e����������������������
I���������������������������������������������������������������������������������������������������������������
��������������D�����������������������������������������������������������������������������������������������������������������������
�������������U�������������������������������������������������������������
���t� rout��e
���le����������������
set ���to ���������������
u�date stat�st�cal ���ormat�o�
�e�erate s�mulat�o� re�ort
�rocess ����������b� call��� subs�stem module�s�� remo�e e�e�t �rom �����������
�- 1�
��screte���e�t ��mulat�o�
���� ����2 �����
����������������
A���������������������������������������������������������P�����������������������������������������������������������������������������������������������s����n� may “produce” new events.
Problem: Within the same simulation cycle, “cause” and “effect” events share the same time of occurrenceSolution: The simulator uses a zero duration virtual time interval, called delta-cycle ( )
The role of a delta-cycle is to order “simultaneous” events within a simulation cycle, i.e. identifying which event caused another; “causes” and “effects” are separated by delta-cycles.
Simulation cycles may be composed of several delta-cycles ( )
A C D B C E
4 - 18
Outline
System Classification
Discrete Event Simulation
Example SystemC
Simulation at High Abstraction Levels
4 - 1�
S��te� ��O�e��ie�
4 - ��
�����le�����
4 - �1
�����le���O�
4 - ��
�o�ule�
processes
4 - ��
��o�e��e�
4 - �4
�o�ule�
4 - ��
��o�e����o��uni��tion�rocesses can directly communicate through s���als.
�odule
�rocess �
�rocess �
�nternal signal
�nput ports
�� port
�utput ports
sensitivity
4 - ��
����n�e���o��uni��tionSystemC �.� introduces general�purpose primitives�
C�a��el� A container for communication and synchronization, e.g. can
have state and private data, transport data, transport events.� They implement one or more ��te��aces
��ter�ace� Specify a set of access methods to the channel� But it does not implement those methods
E�e�t� �le�ible, low�level synchronization primitive, �sed to construct
other forms of synchronization� Have no type and no value
�ther comm. � sync. models can be built based on the above primitives
4 - ��
���nnel���n���o�t�
4 - 28
Wait and NotifyWait: halt �rocess e�ecution until event is raised
wait() with arguments => dynamic sensitivity•wait(sc_event)•wait(time)•wait(time_out, sc_event)
Notify: raise an eventnotify() with arguments => delayed notification•my_event.notify(); // notify immediately•my_event.notify(SC_ZERO_TIME); // notify next delta cycle
•my_event.notify(time); // notify after time
4 - 2�
�i��lation �le�ents ��ain Pro�ra�
��e��te all the �ro�esses �ntil a �lo��in� �oint
��date si�nals
Co���te the set of �ready��ro�esses
N���er of �ready�
�ro�esses�d�an�e si��lation ti�e ��e��te all the �ro�esses
�ntil a �lo��in� �oint
�lo�� �y�le
�� � � �
delta �y�le
�nitiali�ation Phase
��date si�nals
4 - ��
��a��le �i��le ���� Channel
4 - ��
��a��le �i��le ���� Channel ��nterfa�e
4 - �2
��a��le �i��le ���� Channel
4 - ��
��a��le �i��le Prod��er�Cons��er
4 - �4
��a��le� �ahn Pro�ess Net�or�
4 - ��
��a��le� �ahn Pro�ess Net�or�
4 - ��
��a��le� �ahn Pro�ess Net�or�
the ��� will deadloc� unlessan initial to�en is �ut into the loo�:
output1.write(0.0);
4 - �7
��a��le� �ahn Pro�ess Net�or�
4 - �8
��a��le� �ahn Pro�ess Net�or�
4 - ��
�yste�C and �odels of Co���tation
4 - 4�
��tline
�ystem �lassification
�iscrete �vent �imulation
��am�le �ystem�
�i���atio��at��i������t�a�tio��������
4 - 4�
��lti�le �e�els of ��stra�tion��nti� ed� ��n�tional �e�el
�se: model �un��timed functionality�ommunication: shared varia�les� messages�y�ical languages: ������ �atla�
�ransa�tion �e�el�se: �o� architecture analysis� early �� develo�ment� timing estimation �ommunication: method calls to channels�y�ical languages: �ystem�
�e�ister �ransfer �e�el �Pin �e�el�se: �� design and verification�ommunication: wires and registers�y�ical languages: �erilog� ����
�unctional
�ransaction��evel
�egister �ransfer �evel
4 - 42
A�straction ModelsTime �ranularity for communication�computation objects can be classified into 3 basic cate�ories� ��-Timed, Approximate-Timed, Cycle-TimedModels B, C, D and E could be classified as Transaction Level Models (TLM)
D. "Cycle-accurate communication model"
E. "Cycle-accurate computation model”Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
System Modeling Graph(2003 Dan Gajski and Lukai Cai)
A. "Un-timed functional model"
B. "Timed functional model"
C. "Transaction model"
F. "Register transfer model"
4 - 4�
v2 � v� � b�b� v3� v�- b�b�
��
v� � a�a�
��
v� � v2 � v3�c � se�u(v�)�
B�
B�
��
B�
B�
B�B�
Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
A� "Un-Timed Functional Model"Computation
Un-timed be�avior
CommunicationUn-timed transfer�ariables
se�uential execution�B�, B2��B3, B�
parallel execution� B2 �� B3
4 - 44
��
v3� v�- b�b�B�
v� � v2 � v3�c � se�u(v�)�
B�
�E�
v2 � v� � b�b�B�
�E�
v� � a�a�B�
�E�
c��c���
c���
v2 � v� � b�b� v3� v�- b�b�
v�
v� � a�a�
v2
v� � v2 � v3�c � se�u(v�)�
B�
B�
v3
B�
B�
B�B�
A
Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
B� �Timed Functional Model”Computation (on processin� elements - �Es)
Time annotation (estimate)
CommunicationMessa�e-passin�� no protocol implementationUn-timed transfer
Mappin��Es (arc�itecture) allocation and process-to-�E mappin�
code - time estimates� e��� ��DELA�()� or
�ait()
Messa�e-passin�
4 - 4�
Compile�enerated C and
run natively
ldldopldliopts--br
Analy�ebasic blocks�
compute delays
v__st_tmp = v__st;startup(proc);if(events[proc][0] & 1)
execute(proc);
E�ample B� Soft�are Code Annotation�pecification�A��� C �nput
Annotate C code
��� Model��C code �
execution delay
delay c�aracteri�ation
�erformanceEstimation
���UT � A��� C source code �UT�UT � functionally e�uivalent C code au�mented by execution times
v__st_tmp = v__st;__DELAY(LI+LI+LI+LI+LI+LI+OPc);startup(proc);if(events[proc][0] & 1) {__DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF);
execute(proc);}
4 - 46Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
v2 = v1 + b*b;B2
PE2
v1 = a*a;B1
PE1
v3
v3= v1- b*b;B3
v4 = v2 + v3;c = sequ(v4);
B4
PE3
cv12
cv11
cv2
PE4(Arbiter)
3
1 2
Master interface
Slave interface
Arbiter interface
123
C: “Transaction Model”Computation
Approximate-timed (estimate)
CommunicationApproximate-timed (estimate) using simplified (abstract) bus protocols
MappingMapping of computation and communication
4 - 4�
v2 = v1 + b*b;B2
PE2
v1 = a*a;B1
PE1
v3
v3= v1- b*b;B3
v4 = v2 + v3;c = sequ(v4);
B4
PE3
PE4(Arbiter)
3
1 2readyack
address�1�:��data�31:�� e
readyack
address�1�:��data�31:��
Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
Master interface
Slave interface
Arbiter interface
123
D: “C�cle�Accurate Communication Model”Computation
Approximate-timed (estimate)
Communication Protocol bus channels (time�cycle-accurate and pin-accurate)
MappingMapping of computation and communication
3
1 2
4 - 4�
PE3
cv12
cv11
cv2
3
1 2
S�
S1
S2
S3
S4
PE4S�
S1
S2
S3
4
4
PE2
PE1MO� r1� 1�MU� r1� r1� r1
����
���M�A r1� r2� r2� r1
����
4
4
Computation
Communication
A B
C
D F
Un-timed
Approximate-timed
Cycle-timed
Un-timed
Approximate-timed E
Cycle-timed
Master interface
Slave interface
Arbiter interface
1234 � rapper
E: “C�cle�Accurate Computation Model”Computation
Cycle-accurate
CommunicationApproximate-timed (estimate) using simplified (abstract) bus protocols
WrappersSimulation interfaces bet�een cycle-accurate PEs and abstract bus channels interfaces
cycle-accurate and pin-accurate
cycle-accurate and pin-accurate
cycle-accurate and pin-accurate
4 - 4�
Example E: � �at is an ISS�����������
An Instruction Set Simulator (ISS) is a ����������������� coded in a ����-��������������� �hich mimics the behavior of a processor by “reading” instructions and maintaining internal variables �hich representprocessor�s registers
������������������Instruction-accurateCycle-accurate
����Simulate (execute and monitor) machine code instructions, compiled for a target processor
4 - 50
Example E: Types of ISS
int Reg[32];…while(1) {Fetch();Decode();Execute();InterruptHandler();
}
…add r1, r2, r3…
…add(r1, r2, r3);…
original assembly code
Interpretive ISS Compiled ISSISS code
…a = b+c;…
original C code
compilation
#define Add(r1, r2, r3)\r3=r1+r2switch INSN {
case ADD: r3=r1+r2;case SUB: ...
}
intermediary C code generation and recompilation
4 - 5�
�E2�E1
�E3�E�S�
S1
S2
S3
S�
��� r1, 1���� r1, r1, r1
����
������ r1, r2, r2, r1
����
S�
S1
S2
S3
�C�T���������T�
interr�pt
interrupt
interr�pt
Re�. �e��
Comp�tation
Comm�nication
� �
C
� �
Un�timed
Approximate�timed
Cycle�timed
Un�timed
Approximate�timed
E
Cycle�timed
�: ��e�ister Transfer �odel�Computation and Communication
cycle�timedmodeled on the le�el of combinatorial (stateless)functions, memory andand digital signals
��E1, �E2: microprocessors ��E3, �E�: custom�hardware
4 - 5�
�ifferent �bstraction �odels�odels Communication time Computation time Communication
Scheme �E Interface
A. Un��imed Functional �odel
�o �o �ariables �no �E�
B. �imed Functional �odel
�o �pproximate �bstract c�annel �bstract
C. �ransaction �odel
�pproximate �pproximate �bstract b�s c�annel
�bstract
D. Cycle�Accurate Communication �odel
Cycle acc�rate �pproximate �rotocol b�s c�annel
�bstract
E. Cycle�Accurate Computation �odel
�pproximate Cycle acc�rate �bstract b�s c�annel
�in�acc�rate
F. Register �ransfer model
Cycle acc�rate Cycle acc�rate ��s ��ires� �in�acc�rate
4 - 5�
Trace��ased Sim�lation����������� (Un�timed Functional �odel) and � (�ransaction �odel)
Higher simulation speed (for large hardware�software systems, multiprocessors)Uses estimates of non�functional beha�ior
Comp�tation
Comm�nication
� �
C
� �
Un�timed
Approximate�timed
Cycle�timed
Un�timed
Approximate�timed E
Cycle�timed
4 - 54
Trace��ased Sim�lation: 2���ases����������������
Input: application specification�utput: execution traces = se�uence of e�ents ∈ {����; �����; �������}�ethod: un�timed functional simulation
�����-����������������Input:
execution tracesarchitecture specificationmapping specification
�utput: performance estimation results, e.g. execution time, processor load and bus load�ethod: map abstract read, write and compute primiti�es onto �irtual machines that reflect binding and resource sharing (mapping)
�race generation
�race�based simulation
4 - 55
Cosim�lation ��otivation �ixed �odels������ �������������������and the simulation is �ery much dependent on the system description model
How to ��� ��������������se�eral abstraction le�els or se�eral models of computation�
�oti�ating ���������1. Different abstraction le�els2. Different description languages3. Different models of computation
�more abstract less abstract
pac�et
addressdatacmdcnfgstatus
�
����C�C++ �
4 - 5�
Cosim�lation �Example
Se�eral ISSs coupled with H� �R�� simulation: accurate, but slow (especially for multiple ISS running in parallel)
ISSs are replaced with higher�le�el simulation models: speed�up simulation time
H� I�
T� T�
T�
interconnect
T1 T2
T3
nati�e execution (UNI�)
cosim. interface cosim. interface
H� �R�� Simulator (SystemC)
�S model
�S model
En�ironments for multiprocessor system cosimulation:
H� I�
interconnect
cosim. interface cosim. interface
H� �R�� Simulator (SystemC)
T� T�
T�
�S
ISS
T1 T2
T3
�S
ISS
4 - 5�
Cosim�lation �Sin�le vs� ��ltiple En�ines
Sin�le sim�lation en�ine ��ltiple sim�lation en�ines
Simulator# 1
Simulator# 2
Simulator# n
Cosimulation Bus
�1 �2 �n
Unified �odel
�1�2
�n
Simulator
�ard�are�ard�are��Soft�are Codesi�nSoft�are Codesi�n
�o�ef Stefan International �ostgraduate School5 - �
�� � orst Case Exec�tion Time �nalysis
doc� dr� �re�or �apa
5 - �
S� �Compilation H� �Synthesis
System �esi�nSpecification
System Synthesis
�achine Code Net lists
Estimation
Instruction Set
Intellectual�rop. Bloc�
Intellectual�rop. Code
5 - �
Contents������������
problem statement, tool architecture�rogram �ath Analysis �alue AnalysisCaches
must, may analysis�ipelines
Abstract pipeline modelsIntegrated analyses
5 - 4
Ind�strial �eeds���������-��� �������� �, often in safety�critical applications abound
Aeronautics, automoti�e, train industries, manufacturing control
� ing �ibration of airplane, sensing e�ery � mSec
Sideairbag in car,Reaction in �1� mSec
5 - 5
�ard �eal�Time SystemsEmbedded controllers are expected to finish their tas�s reliably within time bounds.
�as� scheduling must be performed.
Essential: ����������������������������������of all tas�s statically �nown.
Commonly called the � ����-�������������������(� CE�)
Analogously, ����-�������������������(BCE�)
5 - �
Execution �ime
Best CaseExecution �ime
� orst CaseExecution �ime
Upper bound�nsafe:Execution �ime�easurement
Dis
tribu
tion
f exe
cutio
n tim
es
� or�s if either�worst�case input can be determined, or�exhausti�e measurement is performed
�therwise,determine upper boundfrom execution times ofinstructions
�eas�rement �Ind�stry�s �best practice�
5 - �
��ost of� Ind�stry�s �est �racticeMeasurements: determine execution times directly by observing the execution or a simulation on a set of inputs.
Does not guarantee an upper bound to all executions.Exhaustive execution in general not possible!
Too large space of input domain x set of initial execution states.
Compute upper bounds along the structure of the program:
Programs are hierarchically structured. Statements are nested inside statements.So, compute the upper bound for a statement from the upper bounds of its constituents
5 - 8
Sequence of Statements
A A1; A2; Constituents of A:A1 and A2
Upper bound for Ais the sum of the upperbounds for A1 and A2
ub(A) = ub(A1) + ub(A2)
5 - �
�on��t�ona� Statement� �f �
t�en ��e�se ��
�
�� ��
�es no
Constituents of A:�� �ondition ��� state�ents A1 and A2
ub(A) =
ub(�) +
max(ub(A1), ub(A2))
5 - ��
�oo�s
i 1
i ≤ 100
A1
�es
no
ub(A) =ub(i 1) +1�� ( ub(i 1��) +
ub(A1) ) +ub( i ≤ 100)
A for i 1 to 1�� do A1
5 - ��
�o� to sta�t��ssignmentx a + b
load a
load b
add
store x
ub(x a + b) = cycles(�oa� a) +cycles(�oa� �) +cycles(a��) +cycles(sto�e �)
cyclesadd �load m 12store m 1�move 1
�ssu�es��onstant�e�e��ution�ti�esfo��inst�u�tions
�ot�a���i�a��eto��ode�n���o�esso�s��
5 - ��
�o�e�n �a���a�e �eatu�es�odern processors increase per�ormance by using: Ca��es���i�e�ines����an�����edi�tion����e�u�ation
These features ma�e � CE� computation di��icult:�xecution times of instructions vary �idely.
�est case �everything goes smoothely: no cache miss, operands ready, needed resources free, branch correctly predicted.� orst case �everything goes �rong: all loads miss the cache, resources needed are occupied, operands are not ready.��an��a���e�se�e�a���und�ed�����es�
5 - ��
LOAD r2, _a
LOAD r1, _b
ADD r3,r2,r1
�
��
1��
1��
2��
2��
���
���
�est �ase � orst �ase
�xecution Time (�loc� �ycles)
�loc� �ycles
��� ���
x = a + b;
�ccess ��mes
5 - ��
��m�n� �cc��ents an� �ena�t�es�iming �ccident �cause for an increase of the execution time of an instruction�iming �enalt��the associated increase��pes of timing accidents
�ache missesPipeline stalls�ranch mispredictions�us collisions�emory refresh of D�A�T�� miss
5 - �5
��e�a�� ����oac�� �o�u�a���at�onMicro-architecture �nal�sis:
Uses Abstract �nterpretation�xcludes as many Timing Accidents as possibleDetermines � ��T for basic bloc�s (in contexts)
� orst-case �ath �etermination�aps control flo� graph to an integer linear programDetermines upper bound and associated path
5 - ��
�ontents�ntroduction
problem statement, tool architecture�rogram �ath �nal�sis�alue Analysis�aches
must, may analysisPipelines
Abstract pipeline models�ntegrated analyses
5 - ��
�ont�o� ��o� ��a�� �����
what_is_this {1 read (a,b);2 done = FALSE;3 repeat {4 if (a>b)5 a = a-b;6 elseif (b>a)7 b = b-a;8 else done = TRUE;9 } until done;10 write (a);
}
1
2
�
�
� �
�
�
1�
a=b
a>b
a<b
a<=b
done!done
5 - �8
��o��am �at� �na��s�s�rogram �ath �nal�sis
�hich se�uence of instructions is executed in the �orst�case (longest runtime)�problem: the number of possible program paths gro�s exponentially �ith the program length
Modelfixed number of cycles for each basic bloc� (from static analysis)loops must be bounded
ConceptTransform structure of ��� into a set of (integer) linear e�uations.Solution of the �nteger �inear Program (��P) yields bound on the � ��T.
5 - ��
�as�c ��oc��e�inition�A basic bloc� is a se�uence of instructions �here the control flo� enters at the beginning and exits at the end, �ithout stopping in�bet�een or branching (except at the end).
t1 := c - dt2 := e * t1t3 := b * t1t4 := t2 + t3if t4 < 10 goto L
5 - ��
�as�c ��oc�s�etermine basic bloc�s o� a program�1. �ete��ine�t�e���o����e�innin�s:
the first instructiontargets of un�conditional �umpsinstructions that follo� un�conditional �umps
2. dete��ine�t�e��asi����o��s:there is a basic bloc� for each bloc� beginningthe basic bloc� consists of the bloc� beginning and runs
until the next bloc� beginning (exclusive) or until the program ends
5 - ��
i := 0t2 := 0
L t2 := t2 + ii := i + 1if i < 10 goto Lx := t2
�ont�o� ��o� ��a�� ��t� �as�c ��oc�s��egenerated� control �lo� graph �C���
the nodes are the basic bloc�s
i < 10i >= 10
5 - ��
��am��e
/* k >= 0 */s = k;WHILE (k < 10) {
IF (ok)j++;
ELSE {j = 0;ok = true;
}k ++;
}r = j;
s = k;
WHILE (k<10)
if (ok)
j++; j = 0;ok = true;
k++;
r = j;
�1
�2
��
�� ��
��
��
5 - ��
�a�cu�at�on of t�e � ���Definition: A program consists of N basic blocks, where each basic block Bi has a worst-case execution time ci and is executed for exactly xi times. Then, the WCET is given by
N
iii xcWCET
1
the ci values are determined using the static analysis.how to determine xi ?
• structural constraints given by the program structure• additional constraints provided by the programmer (bounds for
loop counters, etc.; based on knowledge of the program context)
5 - 24
Structural Constraintss = k;
WHILE (k<10)
if (ok)
j++; j = 0;ok = true;
k++;
r = j;
B1
B2
B3
B4 B5
B6
B7
Flow equations:
d1
d2d1 = d2 = x1
d3
d8
d9
d2 + d8 = d3 + d9 = x2
d4 d5
d3 = d4 + d5 = x3
d6
d4 = d6 = x4
d7
d5 = d7 = x5
d6 + d7 = d8 = x6
d10
d9 = d10 = x7
5 - 25
���itional Constraintss = k;
WHILE (k<10)
if (ok)
j++; j = 0;ok = true;
k++;
r = j;
B1
B2
B3
B4 B5
B6
B7
d1
d2
d3
d4 d5
d6 d7d8d9
d10
loop is executed for at most 10 times�
x3 �= 10 �x1
B5 is executed for at most one time�
x5 �= 1 �x1
5 - 26
WCET - ILPILP with structural and additional constraints:
}
{...1,
1max
sconstraint additional
)()(
11
Ni���
�������
iBoutk
kBinj
j
N
iii
ii structuralconstraints
program is executed once
5 - 2�
Cont�nts�ntroduction
pro�lem statement� tool arc�itecture�rogram �at� �nal�sis �alu� �nal�sis�ac�es
must� ma� anal�sis�ipelines
��stract pipeline models�ntegrated anal�ses
5 - 2�
A�stra�t Int�r�r�tation �AI���� antics-�as�d � �thod �or static program anal�sis
�asic id�a o� �I� �er�orm t�e program�s computations using �alue descriptions or abstract values in place o� t�e concrete �alues� start �it� a description o� all possi�le inputs�
�� supports corr�ctn�ss proo�s�
5 - 2�
A�stra�t Int�r�r�tation �t�� In�r�di�ntsa�stract do� ain �related to concrete domain �� a�straction and concreti�ation �unctions� e�g� � Intervals, where Intervals = LB UB, LB = UB = Int {- , }instead of L Int abstract transfer functions for each statement type –abstract versions of their semantics e.g. + : Intervals Intervals Intervals where [a,b] + [c,d] = [a+c, b+d] with + extended to - , a join function combining abstract values from different control-flow paths e.g. t : Interval Interval Interval where[a,b] t [c,d] = [min(a,c),max(b,d)]
5 - 30
Value AnalysisMotivation:
Provide access information to data-cache/pipeline analysisDetect infeasible pathsDerive loop bounds
Method: calculate intervals at all program points, i.e. lower and upper bounds for the set of possible values occurring in the machine program (addresses, register contents, local and global variables).
5 - 3�
Value Analysis
�Intervals are computed along the ��� edges
��t �oins, intervals are �unioned�
D�: [-�,+�] D�: [-�,�]
D�: [-�,+�]
move #4,D0
add D1,D0
move (A0,D0),D1
D�:[-�,�], ��:[�x����,�x����]
D�:[�,�], D�:[-�,�],��:[�x����,�x����]
D�:[�,�], D�:[-�,�],��:[�x����,�x����]
access [�x����,�x����]� hich address is accessed here�
5 - 3�
��n�en�sIntroduction
problem statement, tool architectureProgram Path �nalysis �alue �nalysis�aches
must, may analysisPipelines
�bstract pipeline modelsIntegrated analyses
5 - 33
�a��es� �as� �e���y �n ��i��aches are used, because
�ast main memory is too expensive�he speed gap between �PU and memory is too large and increasing
�aches wor� well in the avera�e case:Programs access data locally (many hits)Programs reuse items (instructions, data)�ccess patterns are distributed evenly across the cache
5 - 3�
�a��es
Processor
Memory
Bus
Cachefast, small,expensive
(relatively)slow, large,cheap
accesstakes
~ 1 cycle
accesstakes
~ 100 cycles
5 - 35
�a��es� ��� ��e �����PU wants to read��rite at � e� or� address a, sends a re�uest for a to the bus.�ases:
Bloc� m containing a in the cache (hit): re�uest for a is served in the next cycle.Bloc� m not in the cache (miss): m is transferred from main memory to the cache, m may replace some bloc� in the cache,re�uest for a is served asap while transfer still continues.
�everal re��ace� ent strate�ies: L�U, PL�U, �I��,...determine which line to replace.
5 - 3�
��� ay �e� Ass��ia�i�e �a��e
5 - 3�
��� ���a�e�y�ach cache set has its own re��ace� ent �o�ic =� �ache sets are independent. �verything explained in terms of one set���-�e��ace�ent �trate��:
�eplace the bloc� that has been Least �ecently Used�odeled by �ges
��a���e: �-way set associative cacheaccess age � age � age � age �
m� m� m� m�
m� (miss) m� m� m� m�
m� (hit) m� m� m� m�
m� (miss) m� m� m� m�
5 - 3�
�a��e Analysis�ow to statically precompute cache contents:
Must �na��sis:�or each program point (and calling context), find out which bloc�s are in the cache.Determines safe information about cache hits. �ach predicted cache hit reduces � ���.
Ma� �na��sis: �or each program point (and calling context), find out which bloc�s may be in the cache. �omplement says what is not in the cache.
Determines safe information about cache misses. �ach predicted cache miss increases B���.
5 - 50
��n�e��s�ache contents depends on the context, i.e. calls and loops
�irst Iteration loads the cache:Intersection looses most of the information.
Distinguish as many contexts as useful: � unrolling for caches� unrolling for branch prediction (pipeline)
��ile cond ���oin (must)
5 - 5�
��n�en�sIntroduction
problem statement, tool architectureProgram Path �nalysis �alue �nalysis�aches
must, may analysis�i�e�ines
�bstract pipeline modelsIntegrated analyses
5 - 5�
����a�is�n �� A���i�e��u�es
L� ��
�� ���I� �� � B
L��� ���I� ��
��
�� ��
�� �� �� �� �� �� �� �� ��
�� ���I� �� � B L����� ���I� �� � B
�in�y�lenverarb.
�ehr�y�lenverarb.
Pipelineverarb.
single cycle
multiple cycle
pipelining
5 - 53
���d���e�Fe�tu�e������e���e�
Ideal Case: 1 Instruction per Cycle
Fetch
Decode
Execute
WB
Fetch
Decode
Execute
WB
���t�� ���t�� ���t�� ���t��
Fetch
Decode
ExecuteWB
Fetch
Decode
Execute
WB
Fetch
Decode
Execute
WB
5 - 5�
D�t���th�o�������e���e���ch�tectu�e
5 - 55
���d���e�Fe�tu�e������e���e���������������������������������������������������.
Several instructions can be e�ecuted in parallel.
Some pipelines can begin more than one instruction per cycle: VLIW, Superscalar.
Some CP�s can e�ecute instructions out�o��order.
����������������� �: Hazards and cache misses.
5 - 5�
���e���e������d�Pipeline �azards:
������������: �perands not yet available ��ata �ependences�
����������������: Consecutive instructions use same resource
���������������: Conditional branch
�����������-�������������: Instruction �etch causes cache miss
5 - 5�
�o�t�o�������d
��
5 - 5�
D�t�������d
5 - 5�
��������������: prediction o� cache hits on instruction or operand �etch or store
�t�t�c����������o��h����d�
l�z r4� 2��r1� Hi�
�������������������: analysis o� data�control hazards
���������������������������: analysis o� resource hazards
add r4� r5�r6l�z r7� 1��r1�add r8� r4� r4
�pera�dread�
�FE��F
5 - ��
�����������o�c�ete���t�te���ch��eProcessor �pipeline� cache� memory� inputs� vie�ed as a ���������� ������� per�orming transitions every clock cycle.Starting in an initial state �or an instruction transitions are per�ormed� until a ����������� is reached:
���������: instruction has le�t the pipeline�������������: e�ecution time o� instruction
�u�ct�o� e�ec �� : ����c���oc�� s : co�c�ete����e���e��t�te� �: t��ceinterprets instruction stream o� �starting in state s producing trace �successor basic block is interpreted starting in initial state las����le���h���gives number o� cycles
5 - ��
������t��ct����e���e��o����B���c�B�oc��u�ct�o� e�ec �� : ����c���oc�� s : ���t��ct����e���e��t�te� �: t��ce
interprets instruction stream o� ��annotated �ith cache in�ormation� starting in state s producing trace �le���h��� gives number o� cycles
� ������������������bstract states may lack in�ormation� e.g. about cache contents.�ssume local �orst cases is sa�e�in the case o� no timing anomalies�Traces may be longer �but never shorter�.
5 - �2
Wh�t����d���e�e�t����������������or successor basic block� In particular� i� there are several predecessor blocks�������������:
sets o� statescombine by assuming that local �orst case is sa�e
s�s�s�
5 - �3
�u������o���te����������������
��������������using statically computed e��ective addresses and loop bounds
�����������������assume cache hits �here predicted�assume cache misses �here predicted or not e�cluded.�nly the ��orst�result states o� an instruction need to be considered as input states �or successor instructions�
�- �
���d���e���d���e���o�t���e��ode�����o�t���e��ode����
�o�e� Ste�an International Postgraduate School
����u�t�����te������t�� ���t�o�
doc��d�����e�o������
�- 2
SW�Compilation �W�Synthesis
���te��De����Speci�ication
System Synthesis
Machine Code �et lists
Estimation
Instruction Set
IntellectualProp. Block
IntellectualProp. Code
�- 3
De��������ce�Ex��o��t�o�
�����c�t�o� ��ch�tectu�e
�������
E�t���t�o�
�multi�ob�ective� optimization
�- �
�ectu�e����o�����������������ptimization�esignImplementation
�- 5
Wha� are ���lu�i��ar� �l��ri�hms�
randomized� ��o��e����de�e�de�t search heuristics→ applicable to black�bo� optimization problems
H�� d� �he� ��r��
by iteratively improving a �o�u��t�o�o� solutions by variation and selection→ can �ind many di��erent optimal solution in a single run
E�o�ut�o������u�t�o��ect��e���t�����t�o�����o��th��
�- �
�he�������c����o��e�
�
�eight � 75�gpro�it � 5
�eight � 15��gpro�it � 8
�eight � 3��gpro�it � 7
�eight � 1���gpro�it � 3
�o���choose subset that
�ma�imizes overall pro�it
�minimizes total �eight
�- �
�he��o�ut�o�����ce
5��g 1���g 15��g 2���g 25��g 3���g 35��g�ei�h�
pr��i�
5
1�
15
2�
�- �
�inding the goodsolutions
5��g 1���g 15��g 2���g 25��g 3���g 35��g�ei�h�
pr��i�
5
1�
15
2�
�he����de�o���F�o�t���e���t�o��� there is no single optimal solution� but
some solutions � � are better than others � �
selecting asolution
�- �
����o�che�� pro�it more important than cost �ranking�
too heavy
5��g 1���g 15��g 2���g 25��g 3���g 35��g�ei�h�
pr��i�
5
1�
15
2�
Dec���o�����������e�ect�������o�ut�o�
�eight must not e�ceed 24��g �constraint�
�- ��
��te����t�� ���t�o��
searches �or a set o��green� solutions
selects one solutionconsidering constraints
decision making o�ten easier
evolut. algorithms �ell suited
Whe��to����e�the�Dec���o�Be�o�e���t�����t�o��
ranks ob�ectives�de�ines constraints��
searches �or one �green� solution
too heavy
5��g 1���g 15��g 2���g 25��g 3���g 35��g
�ei�h�
pr��i�
5
1�
15
2�
�- ��
��t�����t�o����te���t��e��se o� classical si��le ���ec�i�e �p�imiza�i��methods
simulated annealing� tabu searchinteger linear programother constructive or iterative heuristic methods
�ecisi�� ma�i����eighting the di��erent ob�ectives� is done �e��re �he �p�imiza�i��.
��pula�i�� �ased �p�imiza�i�� me�h�dsevolutionary algorithmsgenetic algorithms
�ecisi�� ma�i�� is done a��er �he �p�imiza�i��.
�- �2
F�t�e�����d��u�t���e����ect��e�
y1
y2
y1
y2
����e��t�o�����ed do�����ce����ed
parameter�orientedscaling�dependent
set�orientedscaling�independent
�ei�h�ed sum
�- �3
We��hted��o�t�Fu�ct�o�
y2
y1
trans�ormation
parameters
y�y1� y2� � � yk�
multipleob�ectives
singleob�ective
e�ample: �eighting approach
y � �1y1 � � � �kyk
��1� �2� � � �k�
ma�imization problem
�- ��
�ut���e�o���������e�E�o�ut�o��������o��th�
�1��
��11 � 1 solution
�itnessevaluation
�111 �itness � 19
matingselection
��11
����
mutation
��11
1�11
environmentalselection recombination
recombination � mutation � variation
�- �5
�t��t�����o��t��E�cod����o���o�ut�o��
�
�� �� 11 11item 1 item 2 item 3 item 4
subset
�- ��
���e�e��c��u�t�o��ect��e�E�archivepopulation
ne� population ne� archive
samplevary
select
updatetruncate
�- ��
���E�o�ut�o��������o��th������ct�o�ma�. y2
min. y1
hypothetical trade�o�� �ront
�- ��
�1��
t
vv ���t �
2 134
5Rlane�� v� a
ngear
point o� gear
changegear
lanev� a
��
�X
Stretch�Module �andling�Module
lane
a re�uire
Br�r
�r
gears clutch
lane� �� va�gear� n
�ehicle�Module
�ecision�Module
B��c��Box���t�����t�o�
�ptimization �lgorithm:
only allo�ed to evaluate � �direct search�
decision vector �
ob�ectivevector ����
o��ect��e��u�ct�o�
�e.g. simulation model�
�- ��
De��������ce�Ex��o��t�o�
cost
latencypo�er
consumption
Speci�icationSpeci�ication �ptimization�ptimization ImplementationImplementationEvaluationEvaluation
�- 2�
��c�et���oce����������et�o���
Mobile InternetMobile Internet
Embedded Internet�evices
Embedded Internet�evices
�ccess Core
�ethod�������d��o������to��do��oth���c����co�����d����e��e�d��o�
Wearable ComputingWearable Computing
���B�����e�
�- 2�
�et�o�����oce��o���et�o�����oce��o�� high�per�ormance� programmable device
designed to e��iciently e�ecute communication
�orkloads ��r��le� e� al�� �����
�et�o����oce��o�
����
�et�o����oce��o�
����
routing � �or�ardingtranscoding
encryption � decryption
incoming �lo�s�packet streams�
outgoing �lo�s�processed packets�
real�time �lo�s
non�real�time �lo�s
e.g.� voice
e.g.� s�tp
�
�- 22
��t�����t�o���ce����o����e���e����e�� speci�ication o� the task structure �t�����ode�� �
�or each �lo� the corresponding tasks to be e�ecuteddi��erent usage scenarios ���o���ode�� �sets o� �lo�s �ith di��erent characteristics
�ou�ht� net�ork processor implementation � architecture � task mapping � scheduling
���ect��e�� ma�imize per�ormanceminimize cost
�u��ect�to� memory constraintdelay constraints
��e��o����ce��ode��
�- 23
Ex��o��t�o���t��te��
� u�t�o��ect��eo�t�����t�o�
e���u�t�o�
per�ormance �cost vector
allocationbindings
co��t�uct��ch�tectu�e
�����o��
e�t���te�e��o�� ��ce
per�ormancearchitecture
bindingrestrictions
taskgraph
architecturetemplate
�or each usagescenario separately
�- 2�
�ectu�e����o�����ectu�e����o����Introduction�������������esignImplementation
�- 25
Do�����ce�����eto��o��t�� �design� point �� is d�mi�a�ed by �i� i� �i is
better or e�ual than �� in all criteria and better in at least one criterion.
� point is Pareto�optimal or a �are���p�i���i� it is not dominated.
The domination relation imposes a partial order on all design points
We are �aced �ith a set o� optimal solutions.�ivergence o� solutions vs. convergence.
�- 2�
�u�t��o��ect��e���t�����t�o�
�- 2�
�u�t�o��ect��e���t�����t�o�Ma�imize �y1� y2� � � yk� � ��1� �2� � � �n�
���eto��et � set o� all Pareto�optimal solutions
y2
y1
�orse
better
incomparable
incomparable
y2
y1
Pareto optimal � not dominated
dominated
�- 2�
���do���ed��B��c��Box���e��ch����o��th��
t������randomly� choose asolution �1 to start �ith
Randomizedsearch algorithm
t� t����randomly� choose a solution �t�1 using solutions �1� � � �t
�de�� �ind good solutions �ithout investigating all solutions���u��t�o���better solutions can be �ound in the neighborhood
o� good solutionsin�ormation available only by �unction evaluations
�- 2�
���e��o�����do���ed��e��ch����o��th���e�ect�o�
environmentalselection
matingselection
�����t�o�� e� o��
E� ≥ 1 bothevolutionary algorithm
�� 1 no mating selectiontabu search
�� 1 no mating selectionsimulated annealing
6 - 30
Limitations of Randomized Search AlgorithmsLimitations of Randomized Search Algorithms
Remarks:Not all functions equally likely and realisticWe cannot expect to design the algorithm beating all othersOngoing research: which algorithm suited for which class of problem?
The No-Free-Lunch Theorem
All search algorithms provide in average the sameperformance on a all possible functionswith finite search and objective spaces.
[Wolpert, McReady: 1997]
6 - 3�
�ourse Synopsis�ourse Synopsis�ntroductionOptimi�ation�������mplementation
6 - 32
�esign �hoices�esign �hoices
�1��
��11 �111
��11����
��11
1�11
representation fitness assignment mating selection
environmental selection variation operators
parameters
6 - 33
�omparison of Three �mplementations�����
����
e�te�ded����
�-o��ective knapsack pro�lem
�rade�off betweendistance and diversity?
6 - 3�
�esign �hoices�esign �hoices
�1��
��11 �111
��11����
��11
1�11
�������������� fitness assignment mating selection
environmental selection variation operators
parameters
6 - 3�
RepresentationRepresentationsearch space decoder solution space o��ectives o��ective space
�1 � 1 1 1 �
1 � 1 1 1 �
1 � 1 1 1 �
solutions encoded by vectors� matrices� trees� lists� ...
�ssues:completeness �each solution has an encoding�uniformity �all solutions are represented equally often�redundancy �cardinality of search space vs. solution space�feasibility �each encoding maps to a feasible solution�
fixed length variable length
6 - 36
E�ample: �inary �ector EncodingE�ample: �inary �ector Encoding�iven: graph�oal: find minimum subset of nodes such that each edge
is connected to at least one node of this subset�minimum vertex cover�
1A
��
1�
1�
1�
�� �
�nodes
selected?
A � � � �
6 - 3�
E�ample: �nteger �ector EncodingE�ample: �nteger �ector Encoding�iven: graph� k colors�oal: assign each node one of the k colors such that the
number of connected nodes with the same color is minimi�ed �graph coloring problem�
1A
��
1�
��
1�
�� �
�nodescolors
A � � � �
6 - 3�
E�ample: Real �ector EncodingE�ample: Real �ector Encoding
�.��x1
�.��x�
1.��x�
�.��x�
��
�.��xnparameters
values
6 - 39
Tree E�ample: �arking a TruckTree E�ample: �arking a Truck
steeringangle
dock
cab
trailer
position �x�y�
u
constant speed
�oal:find function c with
u � c�x� y� d� t�
d
t
6 - �0
Search Space for the Truck �ro�lemSearch Space for the Truck �ro�lem�perators:
Arguments: � position x� position y���� cab angle d�AN� trailer angle t
Search space :set of symbolic expression using the above operators and arguments
6 - ��
E�ample Solution: Tree RepresentationE�ample Solution: Tree Representation
����
������N��
���� � �AN��
encodes the function �symbolic expression�: u � �x �d� � �y � t�
6 - �2
A Solution Found �y an EAA Solution Found �y an EAtruck simulation encoded tree
6 - �3
�esign �hoices�esign �hoices
�1��
��11 �111
��11����
��11
1�11
representation ������������������ mating selection
environmental selection variation operators
parameters
6 - ��
Fitness AssignmentFitness AssignmentFitness F � scalar value representing quality of an individual �
The simple case:single objective optimi�ation:
���� � ������ �
�ore difficult cases:fitness not only takes into account the different objectives �compliance to �areto optimality� but also properties of the whole populationmultiple optima need to be approximated �diversity�constraints are involved which have to be met
solution in search spacesolution in solution space
solution in objective space
6 - ��
Simple e�ample: �areto RankingSimple e�ample: �areto Ranking
�itness function:
cost
execution time
�
�
�
�
�
� 0)6(1)5(2)4(1)3(1)2(3)1(
FFFFFF
6 - �6
�onstraint �andling�onstraint �andling�onstraint ���x1� x�� � � xn�≥ � �≥ �
�< �Approaches:construct initiali�ation and variation such that infeasiblesolutions are not generated �resp. not inserted�representation is such that decoding always yields a feasible solutioncalculate constraint violation ��x1� x�� � � xn� and incorporate it into fitness� e.g.� ��� � � penalty���x1� x�� � � xn�� �fitness to be maximi�ed� use of a penalty�function penalty�y� � � if y � �� include the constraints as new objectives
feasi�le
infeasi�lesolution in solution space
6 - ��
�esign �hoices�esign �hoices
�1��
��11 �111
��11����
��11
1�11
representation fitness assignment ����������������
����������������������� variation operators
parameters
6 - ��
SelectionSelection
T�o types of selection:
mating selection � select for variation
environmental selection � select for survival
6 - �9
Tournament SelectionTournament Selection
� � tournament si�e �binary tournament selection means ����
population mating pool
uniformly choose� individuals at
random independentlyof fitness
compare fitnessand copy best
individualin mating pool
6 - �0
�esign �hoices�esign �hoices
�1��
��11 �111
��11����
��11
1�11
representation fitness assignment mating selection
environmental selection �������������������
parameters
6 - ��
�ector �utation: E�amples�ector �utation: E�amples
�it vectors:
�ermutations:1 � � � � �
swap
1 � � � � �
1 � � � � �
rearrange
1 � � � � �
1 � 1 1 1 �
1 � � 1 1 �
each bit is flipped with probability 1��
6 - �2
�utation �perators on Trees: �ro��utation �perators on Trees: �ro�
����
������N��
���� � �AN���N��
��
����
������N��
���� � ������N��
�� ��
gro�
6 - �3
�utation �perators on Trees: Shrink�utation �perators on Trees: Shrink
����
������N��
���� � �AN���N��
��
����
������N��
���� � �AN��AN�
shrink
6 - ��
�utation �perators on Trees: S�itch�utation �perators on Trees: S�itch
����
������N��
���� � �AN���N��
��
����
�����AN�
����
� ��N��
��N��
��
s�itch
6 - ��
�utation �perators on Trees: Replace�utation �perators on Trees: Replace
����
������N��
���� � �AN���N��
��
����
������N��
���� � �AN���N��
��
replace
6 - �6
�ector Recom�ination: E�amples�ector Recom�ination: E�amples�it vectors:
�ermutations:
1 � 1 � � 1
1 1 � � 1 �1 1 � � � 1
1 � � � � �
� � � � 1 �
1 � � � � �parents
child
6 - ��
Recom�ination of TreesRecom�ination of Trees
����
������N��
���� � �AN���N��
��
����
��N���AN�
����
���� ����
����
�AN��e�change
6 - ��
A �eneric �ultio��ective EAarchivepopulation
new population new archive
samplevary
select
updatetruncate
6 - �9
�tep 1: �enerate initial population �� and empty archive �external set� A�. �et t � �.
�tep �: �alculate fitness values of individuals in �t and At.�tep �: At�1 � non�dominated individuals in �t At.
�f si�e of At�1 � N then reduce At�1� else ifsi�e of At�1 � N then fill At�1 with dominatedindividuals in �t and At.
�tep �: �f t � � then output the non�dominated set of At�1.�top.
�tep �: �ill mating pool by binary tournament selection.�tep �: Apply recombination and mutation operators to
the mating pool and set �t�1 to the resultingpopulation. �et t � t � 1 and go to �tep �.
S�EA� Algorithm
6 - 60
�dea �Step ��: calculate dominance rank weighted by dominance count
Note: higher objective function �bettersmaller fitness � better
S�EA� Fitness Assignment
y�
y1
�
��
�������
���
�����
���
non-dominated solutions:� � �dominated solutions dominated solutions� � � of non��areto solutions � ∑ strengths of dominators
6 - 62
�ourse Synopsis�ntroductionOptimi�ation�esign��������������
6 - 6�
�mplementation: �omponentsA frame�ork that
�rovides ready�to�use modules �algorithms � applications��s simple to use�s independent of programming language and O��omes with minimum overhead
�dea: separate problem�dependent from problem�independent part
Selection
Archiving
Representation
Objective functions
Mutation
RecombinationFitness assignment
cut
6 - 6�
The �oncept of ��SA
���A�
N��A���
�A��
Algorithms Applications
knapsack
���
networkprocessordesign
text�based�latform and programming language independent �nterface
for Search Algorithms [�le�ler et al�: ����]
6 - 66
��SA: �mplementation
selectorprocessselectorprocess
textfiles
sharedfile
system
sharedfile
system
variatorprocessvariatorprocess
application independent:mating � environmental selectionindividuals are describedby ��s and objective vectors
handshake protocol:state � actionindividual ��sobjective vectorsparameters
application dependent:variation operatorsstores and manages individuals
�- �
�ard�are�ard�are��Soft�are �odesignSoft�are �odesign
�o�ef �tefan �nternational �ostgraduate �chool
�� �apping Applications To Architectures
doc� dr� �regor �apa
�- 2
�W��ompilation �W��ynthesis
System �esign�pecification
�ystem �ynthesis
�achine �ode Net lists
�stimation
�nstruction �et
�ntellectual�rop. �lock
�ntellectual�rop. �ode
�- 3
Synthesis
�ynthesis transforms behavior into structure.
���������������:
����������: select components
�������: assign functions to components
scheduling: determine execution ordermapping
(allocation and) binding sometimes called partitioning
7 - 4
Application SpecificationDepends on the underlying model of computation.Examples (see also next slides):
Task graphs (data flow graph, control flow graph)Process Networks (Kahn Process Network, Synchronous Dataflow)State Machine Representations (SpecCharts, StateCharts, Polis) [not covered in this course].
For the mapping, very often only the network structureand abstract properties of the processes are relevant (abstraction from detailed process function).
7 - �
�ata �lo� ��ap� �����
x = 3*a + b*b - c; y = a + b*x;z = b - c*(a + b);
� a
a
ab b b
b
b
c
c
xy �
7 - �
�ont�ol �lo� ��ap� �����
what_is_this {1 read (a,b);2 done = FALSE;3 repeat {4 if (a>b)5 a = a-b;6 elseif (b>a)7 b = b-a;8 else done = TRUE;9 } until done;10 write (a);
}
�
�
�
�
� �
�
�
��
a=b
a>b
a<b
a<=b
done!done
7 - 7
�a�n ��oce�� �et�o���ierarchical network for M�P�� application:
7 - �
A�c�itect��e SpecificationDepends on the underlying model of the platform.�sually a graph notation is used� to the elements, properties of the underlying platform are usually attached.
7 - �
��ample ��� A�c�itect��e Specification
- <processor name="processor1" type="DSP"><port name="processor_port" type="duplex" /><configuration name="clock" value="100 MHz" />
</processor>+ <processor name="processor2" type="RISC">+ <memory name="sharedmemory" type="DXM">- <hw_channel name="in_tile_link" type="bus">
<port name="port1" type="duplex" /><port name="port2" type="duplex" /><port name="port3" type="duplex" /><configuration name="buswidth" value="32bit" />
</hw_channel>- <connection name="processor1link">
<origin name="processor1"><port name="processor_port" />
</origin><target name="in_tile_link">
<port name="port1" /></target>
</connection>+ <connection name="processor2link">+ <connection name="memorylink">
busbus
DSPDSP R�SCR�SC D�MD�M
7 - ��
�apping SpecificationRelates application and architecture specification:
maps processes to computing resourcesmaps communication between processes (in case of process networks) to communication paths of the architecturespecifies resource sharing disciplines and scheduling
7 - ��
��ample ����asic model with a data flow graph and static scheduling
Problem graph GP(VP,EP):
1 2
3
4
5 6
7
Interpretation:
• VP consists of functional
nodes VPf (task, proce-
dure) and communication
nodes VPc .
• EP represent data depend-encies
Data flow graph �P(�P, �P)
7 - 12
Example (2)Architecture graph GA(VA,EA):
• VA consists of functional resources VAf (RISC, ASIC) and
bus resources VAc. These components are potentially allo-
catable.• EA model directed communication.
RISC HWM1
HWM2
sharedbus
PTP bus
RISC HWM1
HWM2
shared bus PTP bus
Architecture Architecture graph
7 - 1�
Example (�)1
2
3
4
5
6
7
RISC
HWM1
HWM2
SB
PTP
GP EM GA
Definition: A specifica-tion graph is a graphGS=(VS,ES) consistingof a problem graph GP,an architecture graphGA, and edges EM. Inparticular, VS=VP∪VA,ES=EP∪EA∪EM
�a�a �l��
7 - 1�
Example (�)Three main tasks of synthesis:
• Allocation α is a subset of VA.
• Binding β is a subset of EM, i.e., a mapping of functionalnodes of VP onto resource nodes of VA.
• Schedule τ is a function that assigns a number (start time) toeach functional node.
7 - 1�
Example (�)
Definition: Given aspecification graph GSan implementation is atriple (α,β,τ), where αis a feasible allocation,β is a feasible binding,and τ is a schedule.
1
2
3
4
5
6
7
RISC
HWM1
SB
1
0
8
1
20
1
2
α
τ
0
1
21
30
1
21
29
β RISC HWM1
HWM2
sharedbus
PTP bus
7 - 1�
�� ��ompilation �� ���nthesis
����em �e�����pecification
��stem ��nthesis
�achine �ode �et lists
Estimation
�nstruction �et
�ntellectualProp� Bloc�
�ntellectualProp� �ode
7 - 1�
�e���� �pa�e Expl��a����
Determine mappingDetermine important paramerters (end�to�end dela�, throughput, �uffer space output �itter, ���)Gi�e feed�ac� to optimi�ation
�ppl��a���� ������e����e
�app���
E���ma����
7 - 1�
�e����� ����e���� �a�� ���el
�- 1
�a���a�e�a���a�e�������a�e ���e���������a�e ���e����
�o�ef �tefan �nternational Postgraduate �chool
�� ����em �a����������
���� ��� ��e��� �apa
�- 2
�� ��ompilation �� ���nthesis
����em �e�����pecification
��stem ��nthesis
�achine �ode �et lists
Estimation
�nstruction �et
�ntellectualProp� Bloc�
�ntellectualProp� �ode
�- �
�a��������������������������������
low level: at the register transfer (���) le�el, at the netlist le�el
� split a digitial circuit and map it to se�eral de�ices (�PG�s, ����s)
� s�stem parameters are relati�el� well��nown (area, dela�)
high level: at the s�stem le�el� comparison of design alternati�es mandator� (design space
e�ploration) � s�stem parameters are un�nown� importance of estimation (anal�sis, simulation, rapid
protot�ping)
�- �
���el����������� (see pre�ious lecture�)�� model application�� define architectural template�� identif� possi�le �indings
����������������������������������Ver� often, parameters are attached to the a�o�e models that simpl� allow to �������������������of the partitioning (allocation and �inding)��ometimes, ��� ���������� �������������(simulation, anal�sis) are applied to gi�e more accurate predictions��������� � allocation gi�es cost � as the sum of the allocated component costs� scheduling gi�es latenc� �� constraints� feasi�le schedule � �ma�� feasi�le allocation � �ma�
�- �
��e �a���������� ����lem������������he partitioning pro�lem is to assign no��ects O ={o1, ..., on} to m �loc�s (also called partitions) P={p1, ..., pm}, such that
p1 p2 ... pm = O
pi pj = � � i,j: i j andcost c(P) are minimi�ed�
�n ������ ����������(simple model)� o��ects � data flow graph nodes
�loc�s � architecture graph nodes
�- �
���� ������������������������of a design point
ma� include C � s�stem cost in ���L � latenc� in �sec�P� power consumption in �� �
re�uires ����� ����� to find C, L, P
�������� linear cost function with penalt�
hC , hL , hP � denote how strong C, L, P �iolate the design constraints Cmax, Lmax, Pmax
k1 , k2 , k3 � weighting and normali�ation
f(C, L, P) = k1·hC(C,Cmax) + k2·hL(L,Lmax) + k3·hP(P,Pmax)
�- 7
�e�e�al �a���������� �e������������ �������
enumeration �nteger �inear Programs (��P)
������������������constructi�e methods
� random mapping� hierarchical clustering
iterati�e methods� �ernighan��in �lgorithm� �imulated �nnealing� E�olutionar� �lgorithms (E�) �� see ne�t lecture
8 - 8
Integer Programming �o�el�Ingredients:
Cost functionConstraints
In�ol�in� linear e�pressions of inte�er �ariables from a set X
Def.: �he problem of minimi�in� (1) sub�ect to the constraints (2) iscalled an integer programming (IP) problem.
If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem.
Cost function )1(,with NxRaxaC iXx
iiii
Constraints: )2(,with: ,, RcbcxbJjXx
jjijijii
8 - �
��ample
��� ��� xxxC
�1,0�,,2
�21
�21
xxxxxx
�ptimal
C
minimi�e�
��b�e�t to�
8 - ��
�emar�� on Integer Programming�aximi�ing the cost function can be done b� settin� C���C
Inte�er pro�rammin� is �P�complete.
In practice, running times can increase ex�onentia�l� with the si�e of the problem, but problems of some thousands of �ariables can still be sol�ed with commercial sol�ers, dependin� on the si�e and structure of the problem.
IP models can be a good starting �oint for modelin�, e�en if in the end heuristics ha�e to be used to sol�e them.
8 - ��
Integer �inear Program �or Partitioning (1)�inar� �ariables xi��
xi��� 1: ob�ect �i in bloc� ��
xi��� 0: ob�ect �i not in bloc� ��
Cost ci��, if ob�ect �i is in bloc� ��
Inte�er linear pro�ram:
nimkcx
nix
mknix
m
k
n
ikiki
m
kki
ki
����������e
��
������
� ���
��
�
8 - ��
Integer �inear Program �or Partitioning (�)�dditiona� constraints
e�ample: ma�imum number of h�ob�ects in bloc� �
�he idea of mappin� the s�nthesis problem to an I�P is �er��o�u�ar:
�chedulin� can be inte�rated.�arious additional constraints can be added.If not sol�in� to optimalit�, run times are acceptable and a solution with a �uaranteed �ualit� can be determined.�indin� the ri�ht e�uations to model the constraints is an art � .
mkhxn
ikki �
��
8 - ��
�on�tr��ti�e �et�o���andom ma��ing
each ob�ect is assi�ned to a bloc� randoml�
Hierarchica� c�usteringstepwise �roupin� of ob�ectscloseness function determines how desirable it is to �roup two ob�ects
�onstructi�e methodsare often used to �enerate a startin� partition for iterati�e methodsshow the difficult� of findin� proper closeness functions
8 - 14
Hierarchical Clustering - Example (1)
2010
10�
4 �
v1
v3v2
v4
v5 = v1 v3
10
7
4 v4
v5
v2
closeness function: arithmetic mean of weights
8 - 1�
Hierarchical Clustering - Example (�)
v�= v2 v5
5�5
v4
v�10
7
4 v4
v5
v2
8 - 1�
Hierarchical Clustering - Example (�)
v7 = v� v4
v75�5
v4
v�
8 - 1�
Hierarchical Clustering - Example (�)
v7 = v� v4
v4
v�= v2 v5
v5 = v1 v3
v1 v2 v3
ste� �:
ste� �:
ste� �:
cut lines��artitions�
8 - 18
�terative Methods - �ernighan-�in (1)�imple greed� heuristic:
�ntil there is no im�ro�ement in cost: re�grou� a �air of o��ects which lea�s to the largest gain in cost
v�
v2
v4v5
v7
v1
v3v�
v�
e�am�le: cost � num�er of e�ges crossing the �artitions�efore re�grou�: � � after re�grou�: � � gain � �
8 - 1�
�terative Methods - �ernighan-�in (�)�ro�lem
�im�le gree�� heuristic can get stuc� in a local minimum�
�mproved algorithm ��ernighan��in�:as long as a �etter �artition is foun�:
� from all �ossi�le �airs of o��ects� �irtuall� re�grou� the ��est��lowest cost of the resulting �artition�� then from the remaining not �et touche� o��ects �irtuall� re�grou� the ��est��air� etc�� until all o��ects ha�e �een re�grou�e��
� from these n/2 �artitions ta�e the one with smallest cost an� actuall� �erform the corres�on�ing re�grou� o�erations�
8 - ��
�terative Methods - �imulated �nnealing�rom �h�sics:
metal an� gas ta�e on a minimal�energ� state �uring cooling �own �un�er certain constraints�:
� at each tem�erature� the s�stem reaches a thermo��namic e�uili�rium� the tem�erature is �ecrease� sufficientl� slowl�
�ro�a�ilit� that a �article ��um�s�to a higher�energ� state:
�pplication to �om�inatorial ��timi�ation:energ� � cost of a solution ��artition�cost �ecreases with tem�erature� sometimes �with a certain �ro�a�ilit�� increases in cost are acce�te��
Tkee
ii B
ii
eTeeP1
���� 1
8 - �1
�terative Methods - �imulated �nnealing
tem� � tem��start�cost � c�����hile ��ro�en���� ������ �
�hile ���uili�rium���� ������ {P’ = RandomMove(P);cost’ = c(P’);deltacost = cost’ - cost;if (Accept(deltacost, temp) > random[0,1)) {
P = P’;cost = cost’;
}}
temp = DecreaseTemp (temp);}
tempkdeltacost
etempdeltacost ),Accept(
8 - 22
Iterative Methods - Simulated AnnealingCooling Down: DecreaseTemp(), Frozen()
• temp_start = 1.0• temp = • temp (typical: 0.8 0.99)• terminate when temp < temp_min or there is no more improvement
Equilibrium: Equilibrium()• after defined number of iterations or when there is no more
improvement
Complexityfrom exponential to constant, depending on the implementation ofthe functions Equilibrium(), DecreaseTemp(), and Frozen()the longer the runtime, the better the quality of resultstypical: construct functions to get polynomial runtimes
�- �
�ard�are�ard�are��Soft�are �odesignSoft�are �odesign
�o�ef �tefan �nternational Postgraduate �chool
�� Allo�ation
do�� dr� �regor �a�a
�- 2
Integer �rogramming models�ngredients:
�ost function�onstraints
�nvolving linear expressions of integer variables from a set �
�ef.: The problem of minimizing (1) sub�ect to the constraints (�) is called an integer �linear� �rogramming �I��� �ro�lem.
�f all ��are constrained to be either 0 or 1, the �P problem said to be a ��� integer �linear� �rogramming �ro�lem.
�ost function )1(,with ������� ���
����
�onstraints: )�(,with: ,, ����������
��������
�- �
��am�le
��1 ��� ����
�1,0�,,�
��1
��1
������
�ptimal
�
�- �
�emar�s on integer �rogramming
Maximizing the cost function: �ust set ��=���nteger programming is �P-complete.Running times depend exponentially on problem size,but problems of >1000 vars solvable with good solver (depending on the size and structure of the problem)The case of �� is called ������������������(�P).�P has polynomial complexity, but most algorithms are exponential, still in practice faster than for ��P problems.The case of some �� and some �� is called �����������������������������������P��P models can be a good starting point for modeling, even if in the end heuristics have to be used to solve them.
�- �
Simulated Annealing
�eneral method for solving combinatorial optimization problems.
�ased the model of slowly cooling crystal liquids.
�ome configuration is sub�ect to changes.
�pecial property of �imulated annealing: �hanges leading to a poorer configuration (with respect to some cost function) are accepted with a certain probability.
This probability is controlled by a temperature parameter: the probability is smaller for smaller temperatures.
�- �
���lanation�nitially, some random initial configuration is created.�urrent temperature is set to a large value.�uter loop:• Temperature is reduced for each iteration• Terminated if (temperature lower limit) or
(number of iterations upper limit).�nner loop: For each iteration:• �ew configuration generated from current configuration• Accepted if (new cost cost of current configuration)• Accepted with temperature-dependent probability if
(cost of new config. > cost of current configuration).
�- �
Multio��e�tive ��timi�ationMaximize (y1, y2, …, yk) = (x1, x2, …, xn)
y2
y1
worse
better
incomparable
incomparable
y2
y1
Pareto optimal = not dominated
dominated
Pareto set = set of all Pareto-optimal solutions
9 - 8
SummarySingle objective optimization methods
decision is performed during optimizationExamples: integer programming, simulated annealing
Multiple objective optimization methodsdecision is done after optimizationExample: Evolutionary algorithmsRefer to publications of Thiele or Schwefel et al. for more information
Concept of Pareto pointseliminates large set of non-relevant design pointsallows separating optimization and decision
9 - 9
�m�ro���� �re���ta����ty �or �a��es�oop cachesMapping code to less used part(s) of the index spaceCache locking�freezingChanging the memory allocation for code or data Mapping pieces of software to specific waysMethods:
- �enerating appropriate way in software- �llocation of certain parts of the address space to a specific way- �ncluding way-identifiers in virtual to real-address translation�Caches behave almost like a scratch pad�
9 - ��
Summary
�llocation strategies for SPM� �ynamic sets of processes� Multiprocessors� MM�s� Sharing between SPMs in a multi-processor
�ptimizations for Caches� Code �ayout transformations� � ay prediction
��- �
�ar��are�ar��are��So�t�are �o�es���So�t�are �o�es���
�o�ef Stefan �nternational Postgraduate School
��� �o�e o�t�m��at�o�
�o�� �r� �re�or Pa�a
��- �
�as���e�e� �o��urre��y ma�a�eme�t
Granularity: size of tasks (e.g. in instructions)Readable specifications and efficient implementations can possibly re�uire different task structures.
�ranularity changes
��- �
�er���� o� tas�s
Reduced overhead of context switches,More global optimization of machine code,Reduced overhead for inter-process�task communication.
��- �
S���tt��� o� tas�s
�o blocking of resources while waiting for input,more flexibility for scheduling, possibly improved result.
��- �
�er���� a�� s���tt��� o� tas�s
The most appropriate task graph granularity depends upon the context merging and splitting may be re�uired.Merging and splitting of tasks should be done automatically, depending upon the context.
��- �
system���am��e �
��- �
�ttr��utes o� a system t�at �ee�s re�r�t���
Tasks blocking after they have already started running
��- 8
� or� �y �orta�e��a et a��1. Transform each of the tasks into a Petri net,2. �enerate one global Petri net from the nets of the tasks,�. Partition global net into �se�uences of transition��. �enerate one task from each such se�uence
Mature, commercial approach not yet available
��- 9
�esu�t� as �u���s�e� �y �orta�e��aReads only at the beginning
�nitialization task
�lways true
�evertrue
��- ��
��t�m��e� �ers�o� o� ���
Tin () �RE�� (��, sample, 1)�sum �= sample� i�����T� = sample� d = ��T����: �� (i < �) retur����T� = sum��� d = ��T��d = d�c� � R�TE(��T,d,1)�sum = �� i = ��retur����lways true
j==i-1j i
�ever true
��- ��
�as���e�e� �o��urre��y ma�a�eme�t ���
The dynamic behavior of applications getting more attention.Energy consumption reduction is the main target.Some classes of applications (i.e. video processing) have a
considerable variation in processing power re�uirements depending on input data.
Static design-time methods becoming insufficient.Runtime-only methods not feasible for embedded systems.
�ow about mixed approaches�
��- ��
��am��e o� a m��e� ���
������ �e���um� �tt���������me���e��
…or they can define a probability for violating the deadline.
t
�eadline
Task1
Task2
Task�
Static (compile-time) methods can ensure � CET feasible schedules, but waste energy in the average case.
t
�
�eadline
Runtime scheduler selects the most energy saving, deadline preserving combination.
t
�eadline
Mixed methods use compile-time analysis to define a set of possible execution parameters for each task.
��- ��
��oat�����o��t to ���e� �o��t �o��ers�o�
Pros:�ower cost�aster�ower power consumptionSufficient S��R, if properly scaledSuitable for portable applications
Cons:�ecreased dynamic range�inite word-length effect, unless properly scaled
� �verflow and excessive �uantization noiseExtra programming effort
© Ki-Il Kum, et al. (Seoul �ational �niversity): � �loating-point To �ixed-point C Converter �or �ixed-point �igital Signal Processors, 2nd S��� � orkshop, 1���
��- ��
���e��Po��t �ata �ormat
S 1 � � . . . � � � � 1 �
hypothetical binary point
�� �=�
S 1 � � . . . � � � � 1 �
(a) �nteger
(b) �ixed-Point
�� �
© Ki-Il Kum, et al
�loating-Point vs. �ixed-Point�loating-Point vs. �ixed-Point �nteger vs. �ixed-Point�nteger vs. �ixed-Point
exponent, mantissa�loating-Point
� automatic computation and update of each exponent at run-time
�ixed-Point� implicit exponent� determined off-line
exponent, mantissa�loating-Point
� automatic computation and update of each exponent at run-time
�ixed-Point� implicit exponent� determined off-line
��- ��
�ss���me�t a�� ����t�o��Su�tra�t�o�
�ssume y = x, with- x (�� �=2) and- y (�� �=�):
s
s
�
����
y
s
�et result = x � y:e�ualizing each �� �
sy
sresu�t
�
© Ki-Il Kum, et al
s
�
����
s
��- ��
�u�t�����at�o�
�ssume result = x � y, with
- x (�� �=2) and- y (�� �=�)- -� result (�� �=2��) s
�
� y
s
s
resu�t
© Ki-Il Kum, et al
s
s
��- ��
�e�e�o�me�t Pro�e�ure
�a��e �st�mat�o�� Pro�ram
���������
��oat����Po��t� Pro�ram
���e��Po��t� Pro�ram
��������-���������
�����-������������������������
��������������
�a�ua�s�e�����at�o�
�� ������������
© Ki-Il Kum, et al
��- �8
�a��e �st�mator
� �re��ro�essor
� �ro�t�e��
�� ass���me�t
Su�rout��e �a�� ��sert�o�
S����to�� �o��erter
��oat����Po��t� Pro�ram
�a��e �st�mat�o�� Pro�ram
�� ����ormat�o�
���������
float iir1(float x)�
static float s = ��float y�
y = �.� � s � x�range(y, 0);s = y�range(s, 1);
return y��
float iir1(float x)�
static float s = ��float y�
y = �.� � s � x�range(y, 0);s = y�range(s, 1);
return y��
�a��e �st�mat�o� � Pro�ram
© Ki-Il Kum, et al
��- �9
��erat�o�s �� ���e� �o��t �ro�ram
�.� x 21�siwl=�.xxxxxxxxxxxx
�
�
xiwl=�.xxxxxxxxxxxx
���overflow if
result
��- ��
��oat����Po��t to ���e��Po��t Pro�ram �o��erter
int iir1(int x)�static int s = ��int y�y=sll(mulh(29491,s)+ (x>> 5),1);s = y�return y�
�
�ixed-Point C Program
mulhto access the upper half of the multiplied resulttarget dependent implementation
sllto remove 2nd sign bitopt. overflow check
© Ki-Il Kum, et al
��- ��
Per�orma��e �om�ar�so���a����e �y��es �
�ourt� �r�er ��� ���ter
21�
2���
�
1���
2���
����
����
�ixed-Point (1�b) �loating-Point
Cycles
© Ki-Il Kum, et al
��- ��
Per�orma��e �om�ar�so���a����e �y��es �
��P��
2��1�
�1��1
12�2��
�2�������������������
1�����12����1�����
�ixed-Point(1�b)
�ixed-Point(�2b)
�loating-Point
Cycles
© Ki-Il Kum, et al
��- ��
Per�orma��e �om�ar�so��S�� �
��P��
�
�
1�
1�
2�
2�
� � C �
S�R (d�)
�ixed-Point (1�b)�ixed-Point (�2b)�loating-Point
© Ki-Il Kum, et al
��- ��
�m�a�t o� memory a��o�at�o� o� e�����e��y
�rray �������
Row major order (C)
Column major order (��RTR��)
���
���
���
�
�
���
���
���
�������
�������
�������
��- ��
�est �er�orma��e �� ���ermost �oo� �orres�o��s to r���tmost array ���e�
��o �oo�s� assum��� ro� ma�or or�er �����or (k=�� k<=m� k��) �or (j=�� j<=n� j��)�or (j=�� j<=n� j��) ) �or (k=�� k<=m� k��)p�j��k� = ... p�j��k� = ...
�or row major order
���
���
���
�ood cache behavior Poor cache behavior
Same behavior for homogenous memory access, but:
memory architecture dependent optimization
��- ��
Pro�ram tra�s�ormat�o� ��oo� ��ter��a��e�
(S��� interchanges array indexes instead of loops)
�mproved localityExample:…#define iter 400000int a[20][20][20];void computeijk() {int i,j,k;
for (i = 0; i < 20; i++) {for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {a[i][j][k] += a[i][j][k];}}}}
void computeikj() {int i,j,k;for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {for (k = 0; k < 20; k++) {
a[i][k][j] += a[i][k][j] ;}}}}…start=time(&start);for(z=0;z<iter;z++)computeijk();
end=time(&end);printf("ijk=%16.9f\n",1.0*difftime(end,start));
��- ��
stro�� ����ue��e o� t�e memory ar���te�ture
�oop structure: i j k
��m
e �s
�
�Till �uchwald, �iploma thesis, �niv. �ortmund, �nformatik 12, 12�2����
�� ����� ���
��te� Pe�t�um��� �
Su� SP������
Pro�essorre�u�t�o� to ���
�ramatic impact of locality
�ot always the same impact ..
��- �8
��oo� �us�o���mer������ ��oo� ��ss�o���or(j=�� j<=n� j��) �or (j=�� j<=n� j��)p�j�= ... � �p�j�= ... �
�or (j=�� j<=n� j��) , p�j�= p�j� � ...�p�j�= p�j� � ...
�oops small enough to �etter locality for allow zero overhead access to p.�oops �etter chances for
parallel execution.
� hich of the two versions is best��rchitecture-aware compiler should select best version.
��- �9
��am��e� s�m��e �oo�s
void ss1() {int i,j;for(i=0;i<size;i++){for
(j=0;j<size;j++){a[i][j]+= 17;}}
for(i=0;i<size;i++){for
(j=0;j<size;j++){b[i][j]-=13;}}}
void ms1() {int i,j;for (i=0;i<size;i++){for
(j=0;j<size;j++){a[i][j]+=17; }for
(j=0;j<size;j++){b[i][j]-=13; }}}void mm1() {int i,j;
for(i=0;i<size;i++){
for(j=0;j<size;j++){a[i][j] += 17;b[i][j] -= 13;}}}
#define size 30#define iter 40000int a[size][size];float b[size][size];
#define size 30#define iter 40000int a[size][size];float b[size][size];
��- ��
�esu�ts� s�m��e �oo�s
�u�t�me
�
2�
��
��
��
1��
12�
��� gcc �.2 -�� x�� gcc 2.�� -o� Sparc gcc �xo1 Sparc gcc �x o�
P�att�orm
�
Merged loops superior� except Sparcwith �o�
Merged loops superior� except Sparcwith �o�
ss1ms1
mm1
(1��� max)
��- ��
�oo� u�ro�����
�or (j=�� j<=n� j��) p�j�= ... �
�or (j=�� j<=n� j�=2)�p�j�= ... � p�j�1�= ...�
factor = 2�etter locality for access to p.�ess branches per execution of the loop. More opportunities for optimizations.Tradeoff between code size and improvement. Extreme case: completely unrolled loop (no branch)
��- ��
��am��e� matr��mu�t#define s 30#define iter 4000inta[s][s],b[s][s],c[s][s];void compute(){inti,j,k;for(i=0;i<s;i++){
for(j=0;j<s;j++){
for(k=0;k<s;k++){c[i][k]+=
a[i][j]*b[j][k];}}}}
extern void compute2(){int i, j, k;for (i = 0; i < 30; i++) {for (j = 0; j < 30; j++) {for (k = 0; k <= 28; k += 2){{int *suif_tmp;suif_tmp = &c[i][k];*suif_tmp=*suif_tmp+a[i][j]*b[j][k];}{int *suif_tmp;suif_tmp=&c[i][k+1];*suif_tmp=*suif_tmp
+a[i][j]*b[j][k+1];}}}}return;}
��- ��
�esu�ts�� ���� ��te� Pe�t�umSu� SP���Pro�essor
�enefits �uite small� penalties may be large
�Till �uchwald, �iploma thesis, �niv. �ortmund, �nformatik 12, 12�2����
�a�tor�a�tor
��- ��
�esu�ts� �e�e��ts �or �oo� �e�e��e��es
Small benefits�
�� ����Pro�essorre�u�t�o� to ���
#define s 50#define iter 150000int a[s][s], b[s][s];void compute() {int i,k;for (i = 0; i < s; i++) {for (k = 1; k < s; k++) {a[i][k] = b[i][k];b[i][k] = a[i][k-1];
}}}
�Till �uchwald, �iploma thesis, �niv. �ortmund, �nformatik 12, 12�2����
�a�tor
��- ��
�oo�t�������oo���o������ ��r����a� �ers�o� �
�or (i=1� i<=�� i��)�or(k=1� k<=�� k��)�
r=��i,k�� �� to be allocated to a register���or (j=1� j<=�� j��)
��i,j� �= r� ��k,j�� � �ever reusing information in the cache for � and � if � is large or cache is small (2 ��references for �).
���
���
��� ������
���
������
���
��- ��
�oo� t�������oo� ��o������t��e� �ers�o� �
�or (kk=1� kk<= �� kk�=�)�or (jj=1� jj<= �� jj�=�)�or (i=1� i<= �� i��)�or (k=kk� k<= min(kk��-1,�)� k��)�r=��i��k�� �� to be allocated to a register���or (j=jj� j<= min(jj��-1, �)� j��)��i��j� �= r� ��k��j�
�
�euse �a�tor o� � �or �� � �or �
������� a��esses to ma�� memory
���� ���
��
�����
���
���
��
���
���
Same elements for next iteration of i
Compiler should select best option
Monica �am: The Cache Performance and �ptimization of �locked �lgorithms, �SP��S, 1��1
��- ��
��am��e
�� �ra�t��e� resu�ts �y �u���a�� are ��sa��o��t������e o� t�e �e� �ases ��ere a� �m�ro�eme�t�as a���e�e��Sour�e� s�m��ar to matr�� mu�t�
�Till �uchwald, �iploma thesis, �niv. �ortmund, �nformatik 12, 12�2����
��������a�tor
SP���
Pe�t�um
��- �8
Summary
Task concurrency management� Re-partitioning of computations into tasks� �ynamic exploitation of slack
�loating-point to fixed point conversion� Range estimation� Conversion� �nalysis of the results
�igh-level loop transformations� �usion� �nrolling� Tiling
��- �9
�ra�s�ormat�o� ��oo� �est s���tt����
��am��e� Se�arat�o� o� mar��� �a������
�many if-statements for margin-checking
no checking,efficient
only few margin elements to be processed
��- ��
if (x�=1���y�=1�)for (� y���� y��)for (k=�� k��� k��)
for (l=�� l���l�� )for (i=�� i��� i��)for (j=�� j���j��) �then�block�1� then�block�2�
else �y1=��y�for (k=�� k��� k��) �x2=x1�k-��for (l=�� l��� ) �y2=y1�l-��for (i=�� i��� i��) �x�=x1�i� x�=x2�i�for (j=�� j���j��) �y�=y1�j� y�=y2�j�if (� �� ���x� ��� �� ���y�)then-block-1� else else-block-1�if (x����� ���x���y��������y�)then�block�2� else else�block�2�
������
�oop nest from MPE�-� full search motion estimation
for (z=�� z�2�� z��)for (x=�� x���� x��) �x1=��x�for (y=�� y���� y��) �y1=��y�for (k=�� k��� k��) �x2=x1�k-��for (l=�� l��� ) �y2=y1�l-��for (i=�� i��� i��) �x�=x1�i� x�=x2�i�for (j=�� j���j��) �y�=y1�j� y�=y2�j�if (x��� �� ���x���y��������y�)then�block�1� else else�block�1�if (x����� ���x���y��������y�)then�block�2� else else�block�2�
������
for (z=�� z�2�� z��)for (x=�� x���� x��) �x1=��x�for (y=�� y���� y��)
analysis of polyhedral domains, selection with genetic algorithm
��. �alk et al., �nf 12, �ni�o, 2��2�
��- ��
�esu�ts �or �oo� �est s���tt������e�ut�o� t�mes �
��
���
���
���
���
���
���
���
���
���
����
����
Su�
Pe�t�um �P
��PSPo�erP
���� ����a
�r��e��a
�� ���
���� t�m�
���� arm
��era�
e
Cavity Motion Estimation �S�PCM
��. �alk et al., �nf 12, �ni�o, 2��2�
��- ��
�esu�ts �or �oo� �est s���tt�����o�e s��es �
��alk, 2��2�
��
���
���
���
���
����
����
����
����
����
����
Su�
Pe�t�um �P
��PSPo�erP
���� �
���a
�r��e��a
��������� t�
m�����arm
��era�
e
Cavity Motion Estimation �S�PCM
��- ��
�rray �o������nitial arrays
��- ��
�rray �o������nfolded arrays�nfolded arrays
��- ��
��ter�array �o�����
��tra�array�o�����
��- ��
������at�o��rray folding is implemented in the �TSE optimization proposed by �MEC. �rray folding adds div and mod ops. �ptimizations re�uired to remove these costly operations. �t �MEC, ���PT address optimizations perform this task.�or example, modulo operations are replaced by pointers (indexes) which are incremented and reset.
��- ��
��������
����������������������������������������
Pe�t�um��
��PS �r��e��a �P���S� �P���S��o �P�
���t�a�
���t�a� � ��S�
���t�a� � ���P�
���t�a� � ��S� ����P�
�esu�ts ���y��es �or �a��ty �e���mar��
���PT��TSE re�uired to achieve real benefit
[C.Ghez et al.: Systematic high-level Address Code Transformations for Piece-wise Linear Indexing: Illustration on a Medical Imaging Algorithm, IEEE WS on Signal Processing System: design & implementation, 2000, pp. 623-632]
10 - 48
Prilagoditev kodeprenos zapisa iz ANSI-C v Handel-C
VHDL zahteva bistveno ve sprememb
opis algoritma v C kodi je treba pred strojno izvedbo ustrezno prilagoditi
SystemC oz. Handel-C vsebujeta samo podmnožico ukazov obi ajnega Cdruga e je treba realizirati aritmetiko plavajo e vejice, ki je strojne izvedbe na eloma ne podpirajo
• zavzame preve razpoložljivih virov• zmanjšuje frekvenco delovanja
vnos ukazov za vzporedno izvajanje delov kodeprilagoditev velikosti vseh spremenljivk
10 - 4�
Prilagoditev �rogra��ke kode �����nadomestek aritmetike plavajo e vejice
uporaba fiksne vejiceuporaba celoštevil nih vrednosti �manjša enota mere�
vrednosti s fiksno vejico so pomnožene in predstavljene kot celoštevilske vrednosti
si� � �62�� ��si� � �.62�
celoštevilski in decimalni del sta predstavljena kot zgornji in spodnji del celoštevilske spremenljivke
signed int � var�, var2�signed int �6 si��
si� � 0x0�a0� ��si� � �.62�var� � si�[��:�]� �� var� � 0x0� � �var2 � si�[�:0]� �� var2 � 0xa0 � �60
10 - �0
Prilagoditev �rogra��ke kode �����ukazi za vzporedno izvajanje delov kode
ukaz ���namesto ���• kjer je mogo e, glede na vsebino zanke
for �i � 0� i �� 3� i����
a[i] � b[2�i]��
se��a[i] � b[2�i]�a[i] � a[i] � c[i]�b[2�i] � a[i]�
�
par �i � 0� i �� 3� i����
a[i] � b[2�i]��
se��
par�
a[i] � b[2�i]�a[i] � a[i] � c[i]�
�b[2�i] � a[i]�
�
10 - �1
Prilagoditev �rogra��ke kode �����prilagoditev velikosti vseh spremenljivk
vse velikosti morajo biti vnaprej definirane• za manjšo porabo virov naj bodo minimizirane
vnaprej je treba dolo iti predzna ene�nepredzna enepri ra unanju s spremenljivkami razli nih velikosti
• uporaba operatorja spajanja: manjši spremenljivki dodamo manjkajo a mesta
• uporaba spodnjih mest pri ve ji spremenljivki[signed � unsigned] int n �� n-bit
unsigned int �6 var�, var3�unsigned int � var2, var��
var3 � var� � ��������� var2�var� � var�������var2�
11 - 1
�ard�are�ard�are���o�t�are �ode�ig��o�t�are �ode�ig�
�ožef Stefan International Postgraduate School
��� �o��ilatio�
do�� dr� �regor Pa�a
11 - �
�o� �iler� �or e� �edded ���te� ��� �� are �o��iler� a� i���e�Many reports about low efficiency of standard compilers
- Special features of embedded processors have to be exploited.- High levels of optimization more important than compilation
speed.- Compilers can help to reduce the energy consumption.- Compilers could help to meet real-time constraints.
Less legacy problems than for PCs.- There is a large variety of instruction sets. - Design space exploration for optimized processors makes
sense
11 - �
� ke��ro�le� ��or ��t�re � e� or����te� �
Energy
Access times
�� ��verage� ��eed�� ��erg��Po�er�� Predi�ta�ilit��� ���
11 - 4
�a� e a� o�ti� i�atio� �or �ig� �er�or�a��e�
int a[�000]�c � a�for �i � �� i � �00� i��� � b �� �c� b �� ��c���� c �� ���
int a[�000]�c � a�for �i � �� i � �00� i��� � b �� �c� b �� ��c���� c �� ���
LD� r3, [r2, �0]ADD r3,r0,r3M�V r0,�2�LD� r0, [r2, r0]ADD r0,r3,r0ADD r2,r2,��ADD r�,r�,��CMP r�,��00�LT LL3
ADD r3,r0,r2M�V r0,�2�M�V r2,r�2M�V r�2,r��M�V r��,rr�0M�V r0,r�M�V r�,r�M�V r�,r�LD� r�, [r�, r0]ADD r0,r3,r�ADD r�,r�,��ADD r�,r�,��CMP r�,��00�LT LL3
���� ���le������ ��
���� ���le������ ��
�o �• High-performance if available memory bandwidth fully used�low-energy consumption if memories are at stand-by mode
• �educed energy if more values are kept in registers
11 - �
�o� �iler o�ti� i�atio���or i��rovi�g e�erg� e��i�ie���
Energy-aware schedulingEnergy-aware instruction selection�perator strength reduction: e.g. replace � by � and ��Minimize the bitwidth of loads and storesStandard compiler optimizations with energy as a cost function
E.g.: �egister pipelining:
for i:� 0 to �0 doC:� 2 � a[i] � a[i-�]�
�2:�a[0]�for i:� � to �0 dobegin
��:� a[i]�C:� 2 � �� � �2��2 :� ���
end�
Exploitation of the memory hierarchy Exploitation of the memory hierarchy
11 - �
��i�g ��rat�� �ad � e� orie���P��
Address space A�M�TDMI
cores, well-known for low power consumption
main
SPM
processor
HierarchyHierarchyExampleExample
scratch pad memory
0
���..
no tag memory
11 - �
�er�li� ited ����ort i��������a�ed tool �lo��
����e �rag�a i� ���o�r�e to allo�ate to ��e�i�i� �e�tio���or example:#pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region)int const z2[3] = {1,2,3}; // in bar
������t ��atter loadi�g �ile to li�ker �or allo�ati�g �e�tio� to ��e�i�i� addre�� ra�ge
http:��www.arm.com�documentation� Software�Development�Tools�index.html
11 - 8
glo�al o�ti� i�atio� �odel ��� �ort���d�
Which memory object �array,loop, etc.� to be stored in SPM�
�o��overla�i�g ���tati��� allo�atio��
Gain gk and size sk for each segment k. Maximise gain G = gk,respecting size of SPM SSP sk.
Solution: knapsack algorithm.
�verla�i�g ��d��a� i��� allo�atio��
Moving objects back and forthProcessor
Scratch pad memory,capacity SSP
mainmemory
�
�or i .� �
for j ..� �
while ...
�epeat
call ...
Array ...
Int ...
Array
Example:
11 - �
�P re�re�e�tatio��� igrati�g ����tio�� a�d varia�le��
��� �ol��S�vark � � size of variable knk � number of accesses to variable ke�vark �� energy �aved per variable access, if vark is migratedE�vark � � energy �aved if variable vark is migrated �� e�vark �n�vark ��x�vark � � decision variable, �� if variable k is migrated to SPM,
�0 otherwiseK � set of variables
Similar for functions I
��teger �rogra��i�g �or��latio��Maximize k K x�vark �E�vark � � i I x�Fi �E�Fi �Subject to the constraint
k K S �vark �x�vark �� i I S �Fi �x�Fi � SSP
11 - 10
�ed��tio� i� e�erg� a�d average r���ti� e
Multi�sort�mix of sort algorithms�
Cyc
les
[x�0
0]E
nerg
y [�
�]
�easible withstandard compiler& postpassoptimization
Measured processor � external memory energy � CACTI values for SPM �combined model�
Numbers will change with technology, algorithms remain unchanged.
11 - 11
�llo�atio� o� �a�i� �lo�k�
�ine-grained granularitysmoothens dependency on the size of the scratch pad.
�e�uires additional jump instructions to return to �main� memory.
�ine-grained granularitysmoothens dependency on the size of the scratch pad.
�e�uires additional jump instructions to return to �main� memory.
Mainmemory
���
��2
�ump�
�ump2
�ump�
�ump3
�or consecutive basic blocks
Statically 2 jumps,but only one is taken
11 - 1�
�llo�atio� o� �a�i� �lo�k�� �et� o� ad�a�e�t �a�i� �lo�k� a�d t�e �ta�k
�e�uiresgeneration ofadditional jumps�special compiler�
Cyc
les
[x�0
0]E
nerg
y [�
�]
11 - 1�
�avi�g� �or �e�or� ���te� e�erg� alo�e
Combined model for memories
11 - 14
�i� i�g �redi�ta�ilit�
aiT:WCET analysis toolsupport for scratchpad memories by specifying different memory access timesalso features experimental cache analysis for A�M�
aiT:WCET analysis toolsupport for scratchpad memories by specifying different memory access timesalso features experimental cache analysis for A�M�
11 - 1�
�r��ite�t�re� �o��ideredA�M�TDMI with 3 different memory architectures:
�� �ai� � e� or�LD�-cycles: �CP�,I�,D����3,2,2�ST�-cycles: �2,2,2�� � ��,2,0�
�� �ai� �e�or� � ��i�ied �a��eLD�-cycles: �CP�,I�,D����3,�2,6�ST�-cycles: �2,�2,3�� � ��,�2,0�
�� �ai� �e�or� � ��rat�� �adLD�-cycles: �CP�,I�,D����3,0,2�ST�-cycles: �2,0,0�� � ��,0,0�
11 - 1�
�e��lt� �or �����
�eferences:• Wehmeyer, Marwedel: Influence of �nchip Scratchpad Memories on
WCET: �th Intl Workshop on worst-case execution time �WCET�analysis, Catania, Sicily, Italy, �une 2�, 200�
• Second paper on SP�Cache and WCET at DATE, March 200�
�sing Scratchpad: �sing �nified Cache:
11 - 1�
��lti�le ��rat�� �ad�
11 - 18
��ti� i�atio� �or � �lti�le ��rat�� �ad�
iiij
jj nxeC ,Minimize
With ej: energy per access to memory j,and xj,i� � if object i is mapped to memory j, �0 otherwise,and ni: number of accesses to memory object i,subject to the constraints:
ijiij SSPSxj ,:
jijxi �: ,
With Si: size of memory object i,SSPj: size of memory j.
11 - 1�
�o��idered �artitio��
11 - �0
�e��lt� �or �art� o� ��� �oder�de�oder
A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.
�Working set�
11 - �1
���a�i� re�la�e�e�t �it�i� ��rat�� �ad
Effectively results in a kind of �o� �iler��o�trolled �eg� e�tatio�� �agi�g for SPM Address assignment
within SPM re�uired�paging or segmentation-like�
�eference: Verma, Marwedel: Dynamic �verlay of Scratchpad Memory for Energy Minimization, ISSS 200�
CP�
Memory
Memory
SPM
11 - ��
��rat�� �ad� �a�ed o� live�e��a�al��i�
M� � �A, T�, T2, T3, T��SP Size � �A� � �T�� � � �T��
Solution:A SP & T3 SP
Solution:A SP & T3 SP
�P��������������P�������������
�P��������������P�������������
�P������������
�P������������
��
��� �
��� �
��� �
��� � ��� ��
��� ��
��
��
��
��
��
��
��
��
��
���
11 - ��
��� � ar�High-level transformations
Loop nest splittingArray folding
Impact of memory architecture on execution times & energy.The SPM provides
�untime efficiencyEnergy efficiencyTiming predictability
Achieved savings are sometimes dramatic, for example:savings of � ��� of the memory system energy
1�- 1
�ard�are�ard�are���o�t�are �ode�ig��o�t�are �ode�ig�
�ožef Stefan International Postgraduate School
��� Per�or� a��e ��ti� atio�
do�� dr� �regor Pa�a
1�- �
SW-Compilation HW-Synthesis
���te� �e�ig�Specification
System Synthesis
Machine Code Net lists
Estimation
Instruction Set
IntellectualProp. �lock
IntellectualProp. Code
1�- �
�otivatio�The values of the objectivefunctions that should guide the design space exploration are obtained through �����������������������Design space exploration intends to change
mapping �binding and resource sharing�
architecture �hardware platform�
application �choice between different algorithms and�or partitioning into concurrent components�
Application Architecture
Mapping
Estimation
1�- 4
��tli�e
�vervie�
Performance Metrics
Subsystems
Abstraction Levels
Performance Estimation Methods
1�- �
Per�or�a��e ��ti�atio� ��lo�al Pi�t�re
ABSTRACTION LEVEL
PERFORMANCE ESTIMATION METHOD
CPUsubsystem
CPU
AD
Mem
I/O
interconnect subsystem
70
1
2
34
5
6
0
4
6 25
1
3
7
blackboxM1 M2
communication
Intermediary levele.g. TLM, OS
Task1 Task2
Task3
High-levele.g. functional, HLL
MPSoC
HW IP
API
communication
API API
SW ss. SW ss.
SWsubsystem
Task1 Task2
Task3
communication
HW IP
HW itf.
SESE(CPU)
Low-levele.g. RTL, ISA
HW itf.
Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE
M1 M2 …
interface HW subsystem
METRIC
statistic
simulation
analytic
x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593
y
x
Time
Cost AreaPower
Other: Quality, SNR, …
1�- �
Po�itio� i� t�e ���te� �e�ig� �lo�����-����������������������������
Advantages: short simulation time, no details of implementation necessaryDrawbacks: limited accuracy, e.g. no information about timing
���-����������������������������
Advantages: higher accuracyDrawbacks: long simulation time, many implementation details need to be known
��ti� atio�
��ti� atio�
��ti� atio�
�a��i�g a�d Partitio�i�g
�o����i�atio�
���� ��
�� �od�le�� �od�leParallel ��e�i�i�atio�
�� ��
��
�ig��level �����tio�al�
��e�i�i�atio�
�e�i�e�e�t
i��
�P�P���
a��li�� ��� ��
i��
���
�o��level ��e�i����lo�er to t�e
i� �le�e�tatio��
�� �le�e�tatio�
1�- �
��e� o� t�e ��ti�atio�
Prere�uisite for ������������������������������������������������������������
part of the feedback cycle �see global flow�functional and non-functional validation �e.g. power, energy, timing, memory consumption�
�����������������show e�uivalence of specification and implementationfunctional and non-functional aspects
1�- 8
��tli�e
�verview
Per�or� a��e �etri��
Subsystems
Abstraction Levels
Performance Estimation Methods
1�- �
Per�or�a��e ��ti�atio� ��lo�al Pi�t�re
ABSTRACTION LEVEL
PERFORMANCE ESTIMATION METHOD
CPUsubsystem
CPU
AD
Mem
I/O
interconnect subsystem
70
1
2
34
5
6
0
4
6 25
1
3
7
blackboxM1 M2
communication
Intermediary levele.g. TLM, OS
Task1 Task2
Task3
High-levele.g. functional, HLL
MPSoC
HW IP
API
communication
API API
SW ss. SW ss.
SWsubsystem
Task1 Task2
Task3
communication
HW IP
HW itf.
SESE(CPU)
Low-levele.g. RTL, ISA
HW itf.
Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE
M1 M2 …
interface HW subsystem
METRIC
statistic
simulation
analytic
x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593
y
x
Time
Cost AreaPower
Other: Quality, SNR, …
�����
Performance MetricsPer�ormance metric = function defined on relevant non-functional properties of a system which indicates a quantitative performance of the system.
Time [second]for example end-to-end delay, throughput, latency
Power, Energy, Temperature [mW, mJ, °C] for example power consumed by the network, energyexecute a task, maximal temperature
Area [mm2]for example area of an integrated circuit
Cost [$]for example cost of parts, labor, development cost
Other metrics:SNR (signal to noise ratio), quality of the video image/sound, size of the hardware platform
usually, performance metrics are conflicting
�����
Eam�les of Performance �ra�e��ffsMa��in� �omain
change the mapping of the application to the architecturesee example 1
�rc�itecture �omainchange the hardware platform
see example 2
���lication �omainchange the application implementation (e.g. degree of parallelization, partitioning into concurrent processes, use of different algorithms with a similar functional behavior)
�����
E�� �� �ra�e��ffs in t�e Ma��in� �omain
�PE�� �apping Optimi�ation �2� mapping optimization space
ob��� Worst load of computation nodeob�2� Worst load of communication node
ob����
ob����
worst bus load
�����
E�� �� �ra�e��ffs in t�e �ar��are Platform
�imin� �erformanceEner�� Efficienc� �le�ibilit�
���lication�s�ecific inte�rate� circuits �����s�
���lication�s�ecific instruction set �rocessors ����Ps�
�Microcontroller���Ps ��i�ital si�nal �rocessors�
General��ur�ose �rocessors
Pro�rammable �ar��are
��PG� �fiel���ro�rammable �ate arra�s�
�����
�utline
�verview
�erformance �etrics
�ubs�stems
�bstraction �evels
�erformance �stimation �ethods
�����
Performance Estimation – Global Picture
ABSTRACTION LEVEL
PERFORMANCE ESTIMATION METHOD
CPUsubsystem
CPU
AD
Mem
I/O
interconnect subsystem
70
1
2
34
5
6
0
4
6 25
1
3
7
blackboxM1 M2
communication
Intermediary levele.g. TLM, OS
Task1 Task2
Task3
High-levele.g. functional, HLL
MPSoC
HW IP
API
communication
API API
SW ss. SW ss.
SWsubsystem
Task1 Task2
Task3
communication
HW IP
HW itf.
SESE(CPU)
Low-levele.g. RTL, ISA
HW itf.
Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE
M1 M2 …
interface HW subsystem
METRIC
statistic
simulation
analytic
x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593
y
x
Time
Cost AreaPower
Other: Quality, SNR, …
�����
��stem �om�osition
�c�e�ulin� an� �rbitration�em�lates
�ro�ortionals�are� ��
static��namicfi�e� �riorit�
E����M�
����
�ommunication �em�lates �om�utation �em�lates
��P
m� ���interface
����M
����E��
�rc�itecture
��������M
E��
�riorit�
E��
E��
�����
� �� �s Estimation �ifficult ��om�utation an� �ommunication
(Non-deterministic) computations in processing nodes(Non-deterministic) communication delaysComplex resource interaction via scheduling and arbitration policies
��clic timin� �e�en�encies�nternal data streams interact on computing and communication resources�nteraction determines stream characteristics
�ncertain en�ironment�ifferent load scenarios�nknown (worst case) inputs
�����
�llustration of E�aluation �ifficulties
�n�ut�tream
�om�le� �n�ut���imin� ��itter� bursts� ������ifferent E�ent ���es
�as� �ommunication�as� �c�e�ulin�
ab acc b
�����
�llustration of E�aluation �ifficulties
Processor�as�
�uffer�n�ut�tream
�as� �ommunication�as� �c�e�ulin�
ab acc b
�om�le� �n�ut���imin� ��itter� bursts� ������ifferent E�ent ���es
�ariable �esource ��ailabilit��ariable E�ecution �eman���n�ut ��ifferent e�ent t��es���nternal �tate �Pro�ram� �ac�e� ����
�����
�e�uirements for Performance Estimation
�stimation should be com�osable in terms of�su�systems and their interactions, i.e. �W, SW, interconnectcomputation, communication, and sche�u�ing�ar�itration
�stimation should cover different metrics, for example power, energy, delay, memory, throughput
�stimation method should represent a reasonable tra�e�off between (a) estimation effort in terms of computation/simulation time and set-up time and (b) accuracy
�����
�utline
�verview
�erformance �etrics
Subsystems
�bstraction �e�els
�erformance �stimation �ethods
�����
Performance Estimation – Global Picture
ABSTRACTION LEVEL
PERFORMANCE ESTIMATION METHOD
CPUsubsystem
CPU
AD
Mem
I/O
interconnect subsystem
70
1
2
34
5
6
0
4
6 25
1
3
7
blackboxM1 M2
communication
Intermediary levele.g. TLM, OS
Task1 Task2
Task3
High-levele.g. functional, HLL
MPSoC
HW IP
API
communication
API API
SW ss. SW ss.
SWsubsystem
Task1 Task2
Task3
communication
HW IPHW itf.
SESE(CPU)
Low-levele.g. RTL, ISA
HW itf.
Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE
M1 M2 …
interface HW subsystem
METRIC
statistic
simulation
analytic
x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593
y
x
Time
Cost AreaPower
Other: Quality, SNR, …
�����
�����s
� �rief �istor� in �bstraction
cluster
cluster
abst
ract
�ate level model�/�/�/� (� ns) ab
stra
ct R��
Register-transfer level modeldata[�������] (critical path latency)
2����s 2����
cluster
on-chipcommunication
Network
SW tasks�S
���Comm. int.
SW tasks�S
���Comm. int.
SW tasksSW adaptation
C�� core�W adaptation
�W adaptation
��s
abst
ract
Com
m.N
etw.
SW
SW
�W
cluster
abst
ract
�S/drivers
SW �asks
C��
��s
�W adaptation
SW �W
abst
ract
�ransistor model(t=RC)
�����s
tec�nolo���transistors, layouts
�����s
si�nal�gate, schematic, R��
transaction�SW, �W systems
to�ens�SW tasks, comm. backbones, ��s
simulator�S��C� simulator����� simulator�SystemC/�SS
simulator�So� �W/SW codes./cosim. tools�
formal methods
�����
�utline
�verview
�erformance �etrics
Subsystems
�bstraction �evels
Performance Estimation Met�o�s
�����
Performance Estimation – Global Picture
ABSTRACTION LEVEL
PERFORMANCE ESTIMATION METHOD
CPUsubsystem
CPU
AD
Mem
I/O
interconnect subsystem
70
1
2
34
5
6
0
4
6 25
1
3
7
blackboxM1 M2
communication
Intermediary levele.g. TLM, OS
Task1 Task2
Task3
High-levele.g. functional, HLL
MPSoC
HW IP
API
communication
API API
SW ss. SW ss.
SWsubsystem
Task1 Task2
Task3
communication
HW IPHW itf.
SESE(CPU)
Low-levele.g. RTL, ISA
HW itf.
Note: RTL – Register Transfer LevelISA – Instruction Set ArchitectureTLM – Transaction-Level ModelOS – Operating SystemHLL – High-Level LanguageSUBSYSTEM TO ANALYZE
M1 M2 …
interface HW subsystem
METRIC
statistic
simulation
analytic
x(y) = x0 * exp (-k0*y)x0 = 105k0 = 1.2593
y
x
Time
Cost AreaPower
Other: Quality, SNR, …
12 - 26
e.g. delay
Real System
Worst-Case
Best-Case
MeasurementProbabilisticEstimation
Worst Case(Formal) Analysis
presented later
Simulation
presented in Lecture 6
(next lecture)
System-Level Performance Estimation Methods
12 - 2�
System�o� to e�aluate�
Measurements Formal Analysis Statistics
�e�elop a mat�ematical
abstraction o� t�e system and
deri�e �ormulas ��ic� describe
t�e system per�ormance.
�e�elop a program ��ic� implements a model o� t�e
system. Per�orm experiments by
running t�e program.
�se existing instance o� t�e
system to per�orm
per�ormance measurements.
Simulation
�e�elop a statistical
abstraction o� t�e system and
deri�e statistic per�ormance �ia
analysis or simulation.
�vervie�
12 - 2�
Performance Estimation Methods
model o�en�ironmentmodel o�
en�ironmentsystemmodel
systemmodel
estimationresults
estimationresults
inputtracesinput
traces
spec. o�inputs
spec. o�inputs
model o�applicationmodel o�
application
model o�arc�itecturemodel o�
arc�itecture
datas�eetsdata
s�eets
plat�ormbenc�mar�splat�orm
benc�mar�s
componentsimulation
componentsimulation
designersexperiencedesigners
experience
estimationtool (met�od)
12 - 2�
�� �nalytic ModelsStatic analytic �sym�olic� models�
�escribe computing� communication� and memory resources by algebraic e�uations� e.g.
�escribe system properties by parameters� e.g. data rateCombine relations
Fast and simple estimation�enerally inaccurate modeling� e.g. resource s�aring not modeled
timecommsizeburst
wordsdelay __
#
12 - ��
�� �ynamic �nalytic ModelsCombination bet�een
Static models possibly extended by non-determinism in run-time and e�ent processing�ynamic models �or describing e.g. resource s�aring mec�anisms (sc�eduling and arbitration).
Existing approac�es��������������-��� ����������� t�eory ��������������������������(statistical bounds)���-������ �������������������������orst case�best case be�a�ior)
12 - �1
E�am�le - ��e�in� Systems���� ����clients re�uest some ser�ice �rom a ser�er o�er a net�or�.���������
� Per�ormance o� t�e ser�er� Per�ormance o� t�e net�or�
12 - 32
Stochastic Models - Queuing Systems� queuing system is described by
�rrival rateService mechanism�ueuing discipline
Performance measuresaverage delay in queue
• Customer point of viewtime-average number of customers in queue.
• System point of viewproportion of time server is busy
The classical M/M/1 queuing system: (M = Markovian (exp.) distribution )
12 - 33
�ondete�ministic Models - Queuing Systems� queuing system is described by
�rrival function (bounds on arrival times)Service functions (bounds on server behavior)�esource interaction
Performance measuresworst case delay in queueworst-case number of customers in queue.worst-case and best-case end-to-end delay in the system
��� ���
��� ���
���
��M�
12 - 3�
�� SimulationConsider the underlying hardware platform and the mapping of the application onto that architectureCombine functional simulation and performance data�valuate average-case behavior� for one simulation scenario
Complex set-up and extensive runtimes... �ut accurate results and good debugging possibilities
�nputtrace
Model
application� hardware platform� mapping
Model
application� hardware platform� mapping�utputtrace
12 - 3�
Example� ��ace-�ased SimulationA�stract simulation at system-le�el �it�out timing
�aster than simulation� but still based on a single input traceA�straction
�pplication - represented by abstract execution traces graph of events: read, write, and execute�rchitecture - represented by “virtual machines” and “virtual channels”including non-functional properties (timing� power� energy)
�teps�xecution trace determined by functional application simulation�xtension of the event graph by non-functional propertiesSimulation of the extended model
application �unctional model
completet�ace
a�chitectu�edesc�iption
a�st�acte�ent g�aph
t�acesimulation
estimation�esults
e�g� ��ahi�i et al�� ������ ��imentel et al�� �����