11
Structure of Computer SystemsStructure of Computer Systems
(Advanced Computer Architectures)(Advanced Computer Architectures)
Course:Course:Gheorghe SebestyenGheorghe Sebestyen
Lab. worksLab. works::
Anca HanganAnca HanganMadalin NeaguMadalin NeaguIoana DobosIoana Dobos
22
Objectives and contentObjectives and content
design of computer components and design of computer components and systems systems
study of methods used for increasing the study of methods used for increasing the speed and the efficiently of computer speed and the efficiently of computer systemssystems
study of advanced computer architecturesstudy of advanced computer architectures
33
BibliographyBibliography
Baruch, Z. F., Baruch, Z. F., Structure of Computer SystemsStructure of Computer Systems, , U.T.PRES, Cluj-U.T.PRES, Cluj-Napoca, 2002Napoca, 2002Baruch, Z. F., Baruch, Z. F., Structure of Computer Systems with ApplicationsStructure of Computer Systems with Applications, , U. U. T. PRES, Cluj-Napoca, 2003 T. PRES, Cluj-Napoca, 2003 Gorgan, G. Sebestyen, Gorgan, G. Sebestyen, Proiectarea calculatoarelorProiectarea calculatoarelor, Editura , Editura Albastra, 2005Albastra, 2005Gorgan, G. Sebestyen, Gorgan, G. Sebestyen, Structura calculatoarelorStructura calculatoarelor, Editura Albastra, , Editura Albastra, 20002000J. Hennessy , D. Patterson, J. Hennessy , D. Patterson, Computer Architecture: A Quantitative Computer Architecture: A Quantitative ApproachApproach, 1-5, 1-5thth edition edition D. Patterson, J. Hennessy, D. Patterson, J. Hennessy, Computer Organization and Design: The Computer Organization and Design: The Hardware/Software Interface, Hardware/Software Interface, 1-3th edition1-3th edition
any book about computer architecture, microprocessors, microcontrollers or any book about computer architecture, microprocessors, microcontrollers or digital signal processorsdigital signal processors
Search: Intel Academic Community, Intel technologies Search: Intel Academic Community, Intel technologies ((http://www.intel.com/technology/product/demos/index.htmhttp://www.intel.com/technology/product/demos/index.htm), etc.), etc.
my web page: my web page: http://http://users.utcluj.ro/~sebestyenusers.utcluj.ro/~sebestyen
44
Course ContentCourse Content Factors that influence the performance of a computer Factors that influence the performance of a computer
systems, technological trendssystems, technological trends Computer arithmetic – ALU designComputer arithmetic – ALU design CPU design strategiesCPU design strategies
pipeline architectures, super-pipelinepipeline architectures, super-pipeline parallel architectures (multi-core, multiprocessor systems)parallel architectures (multi-core, multiprocessor systems) RISC architectures RISC architectures microprocessorsmicroprocessors
Interconnection systemsInterconnection systems Memory designMemory design
ROM, SRAM, DRAM, SDRAM, etc.ROM, SRAM, DRAM, SDRAM, etc. cache memorycache memory virtual memoryvirtual memory
Technological trendsTechnological trends
55
Performance featuresPerformance features
execution timeexecution time reaction time to external eventsreaction time to external events memory capacity and speedmemory capacity and speed input/output facilities (interfaces)input/output facilities (interfaces) development facilitiesdevelopment facilities dimension and shapedimension and shape predictability, safety and fault tolerancepredictability, safety and fault tolerance costs: absolute and relative costs: absolute and relative
66
Performance featuresPerformance features
Execution timeExecution time execution time of:execution time of:
• operations – arithmetical operations operations – arithmetical operations e.g. multiply is 30-40 times slower than addinge.g. multiply is 30-40 times slower than adding single or multiple clock periodssingle or multiple clock periods
• instructionsinstructions simple and complex instructions have different execution simple and complex instructions have different execution
timestimes average execution time = average execution time = ΣΣ t tinstructioninstruction(i)*p(i)*pinstructioninstruction(i)(i)
• where pwhere pinstructioninstruction(i) – probability of instruction “i”(i) – probability of instruction “i” dependable/predictable systems – with fixed execution time dependable/predictable systems – with fixed execution time
for instructionsfor instructions
77
Performance featuresPerformance features
Execution timeExecution time execution time of:execution time of:
• procedures, tasksprocedures, tasks the time to solve a given function (e.g. sorting, printing, the time to solve a given function (e.g. sorting, printing,
selection, i/o operations, context switch)selection, i/o operations, context switch)
• transactionstransactions execution of a sequence of operations to update a execution of a sequence of operations to update a
databasedatabase
• applicationsapplications e.g. 3D rendering, simulation of fluids’ flow, computation e.g. 3D rendering, simulation of fluids’ flow, computation
of statistical dataof statistical data
88
Performance featuresPerformance features
reaction timereaction time response time to a given eventresponse time to a given event solutions:solutions:
• best effort – batch programmingbest effort – batch programming
• interactive systems – event driven systemsinteractive systems – event driven systems
• real-time systems – worst case execution time (WCET) is real-time systems – worst case execution time (WCET) is guaranteed guaranteed
scheduling strategies for single or multi processor systemsscheduling strategies for single or multi processor systems
influences:influences:• execution time of interrupt routines or proceduresexecution time of interrupt routines or procedures
• context-switch timecontext-switch time
• background execution of operating system’s threadsbackground execution of operating system’s threads
99
Performance featuresPerformance features memory capacity and speed:memory capacity and speed:
cache memory: SRAM, very high speed (<1ns), low capacity (1-8MB)cache memory: SRAM, very high speed (<1ns), low capacity (1-8MB) internal memory: SRAM or DRAM, average speed (15-70ns), medium internal memory: SRAM or DRAM, average speed (15-70ns), medium
capacity (1-8GB)capacity (1-8GB) external memory (storage): HD, DVD, CD, Flash (1-10ms), very big external memory (storage): HD, DVD, CD, Flash (1-10ms), very big
capacity (0,5-12TB)capacity (0,5-12TB) input/output facilities (interfaces):input/output facilities (interfaces):
very divers or dedicated for a purposevery divers or dedicated for a purpose input devices: keyboard, mouse, joystick, video camera, microphone, input devices: keyboard, mouse, joystick, video camera, microphone,
sensors/transducerssensors/transducers output devices: printer, video, sound, actuators, output devices: printer, video, sound, actuators, input/output: storage devicesinput/output: storage devices
development facilities:development facilities: OS services (e.g. display, communication, file system, etc.), OS services (e.g. display, communication, file system, etc.), programming and debugging frameworks, programming and debugging frameworks, development kits (minimal hardware and software for building dedicated development kits (minimal hardware and software for building dedicated
systems)systems)
1010
Performance featuresPerformance features dimension and shapedimension and shape
supercomputers – minimal dimensional restrictionssupercomputers – minimal dimensional restrictions personal computers – desktop, laptop, tabletPC – some personal computers – desktop, laptop, tabletPC – some
limitationslimitations
mobile devicesmobile devices – – “hand held devices” phones, medical devices“hand held devices” phones, medical devices dedicated systems – significant dimensional and shape related dedicated systems – significant dimensional and shape related
restrictionsrestrictions
predictability, safety and fault tolerancepredictability, safety and fault tolerance predictable execution timepredictable execution time controllable quality and safety controllable quality and safety safety critical systems, industrial computers, medical devicessafety critical systems, industrial computers, medical devices
costscosts absolute or relative (cost/performance, cost/bit)absolute or relative (cost/performance, cost/bit) cost restrictions for dedicated or embedded systemscost restrictions for dedicated or embedded systems
1111
Physical performance parametersPhysical performance parameters Clock signal’s frequencyClock signal’s frequency
a good measure of performance for a long period of timea good measure of performance for a long period of time depends on:depends on:
• the integration technology – the dimension of a transistor and path the integration technology – the dimension of a transistor and path lengths lengths
• supply voltage and relative distance between high and low statessupply voltage and relative distance between high and low states clock period = the time delay for the longest signal pathclock period = the time delay for the longest signal path
= no_of_gates * delay_of_a_gate= no_of_gates * delay_of_a_gate clock period grows with the complex CPUs clock period grows with the complex CPUs
• RISC computers increase clock frequency by reducing the CPU RISC computers increase clock frequency by reducing the CPU complexitycomplexity
1212
Physical performance parametersPhysical performance parameters
Clock signal’s frequencyClock signal’s frequency we can compare computers with the same internal architecturewe can compare computers with the same internal architecture for different architectures the clock frequency is less relevantfor different architectures the clock frequency is less relevant after 60 years of steady grows in frequency, now the frequency after 60 years of steady grows in frequency, now the frequency
is saturated to 2-3 GHz because of the power dissipation is saturated to 2-3 GHz because of the power dissipation limitationslimitations
• where: where: αα activation factor (0,1-1), C-capacitance, V-voltage, f-frequency activation factor (0,1-1), C-capacitance, V-voltage, f-frequency
increasing the clock frequency:increasing the clock frequency:• technological improvement – smaller transistors, through better technological improvement – smaller transistors, through better
lithographic methodslithographic methods
• architectural improvement – simpler CPU, shorter signal pathsarchitectural improvement – simpler CPU, shorter signal paths
·f2·C·V wer dynamic_po
1313
Physical performance parametersPhysical performance parameters Average instructions executed per second Average instructions executed per second
(IPS(IPS))
where pwhere pii = probability of using instruction i = probability of using instruction i
ppi i = no_instr= no_instri i / total_no_instructions/ total_no_instructions ttii – execution time of instruction i – execution time of instruction i
instruction types:instruction types:• short instructions (e.g. adding) – 1-5 short instructions (e.g. adding) – 1-5
clock cyclesclock cycles• long instructions (e.g. multiply) – 100-120 long instructions (e.g. multiply) – 100-120
clock cyclesclock cycles• integer instructionsinteger instructions• floating point instructions (slower)floating point instructions (slower)
measuring units: MIPS, MFlops, Tflopsmeasuring units: MIPS, MFlops, Tflops can compare computers with same or can compare computers with same or
similar instruction setssimilar instruction sets not good for CISC v.s. RISC comparisonnot good for CISC v.s. RISC comparison
TypeType YearYear Freq.Freq. MIPSMIPS
I4004I4004 19711971 0,74MHz0,74MHz 0,090,09
I80286I80286 19821982 12 MHz12 MHz 2,662,66
I80486I80486 19921992 66MHz66MHz 5252
P IIIP III 20002000 600MHz600MHz 2.0542.054
Intel I7Intel I7 20112011 3.33GHz3.33GHz 177.730177.730
it*ip
1 _instr average_no
1414
Physical performance parametersPhysical performance parameters
Execution time of a programExecution time of a program more realisticmore realistic can compare computers with different architecturescan compare computers with different architectures influenced by the operating system, communication and storage influenced by the operating system, communication and storage
systemssystems How to select a good program for comparison? (a good How to select a good program for comparison? (a good
benchmark)benchmark)• real programs: compilers, coding/decoding, zip/unzipreal programs: compilers, coding/decoding, zip/unzip• significant parts of a real program: OS kernel modules, significant parts of a real program: OS kernel modules,
mathematical libraries, graphical processing functionsmathematical libraries, graphical processing functions• synthetic programs: combination of instructions in a percentage synthetic programs: combination of instructions in a percentage
typical for a group of applications (with no real outcome):typical for a group of applications (with no real outcome): Dhrystone – combination of integer instructionsDhrystone – combination of integer instructions Whetstone – contains floating point instructions tooWhetstone – contains floating point instructions too
issues with benchmarks: issues with benchmarks: • processor architectures optimized for benchmarksprocessor architectures optimized for benchmarks• compilation optimization techniques eliminate useless instructions compilation optimization techniques eliminate useless instructions
1515
Physical performance parametersPhysical performance parameters
Other metrics:Other metrics: number of transactions per secondnumber of transactions per second
• in case of databases or server systemsin case of databases or server systems• number of concurrent accesses to a database or warehousenumber of concurrent accesses to a database or warehouse• operations: read-modify-write, communication, access to operations: read-modify-write, communication, access to
external memoryexternal memory• describes the whole computer system not only the CPUdescribes the whole computer system not only the CPU
communication bandwidthcommunication bandwidth• number of Mbytes transmitted per secondnumber of Mbytes transmitted per second• total bandwidths or useful/usable bandwidthtotal bandwidths or useful/usable bandwidth
context switch timecontext switch time• for embedded and real-time systemsfor embedded and real-time systems• example: EEMBC – EDN embedded microprocessor example: EEMBC – EDN embedded microprocessor
benchmark consortiumbenchmark consortium
1616
Principles for performance Principles for performance improvementimprovement
Moor’s LawMoor’s Law Ahmdal’s LawAhmdal’s Law Locality: time and spaceLocality: time and space Parallel executionParallel execution
1717
Principles for performance improvementPrinciples for performance improvement
Moor’s LawMoor’s Law (1965, Gordon Moor*) - “the number of (1965, Gordon Moor*) - “the number of transistors on integrated circuits doubles approximately transistors on integrated circuits doubles approximately every two years”every two years”
18 months law18 months law (David House, Intel) – “the performance (David House, Intel) – “the performance of a computer is doubled every 18 month” (1,5 year), as of a computer is doubled every 18 month” (1,5 year), as a result of more transistors and faster onesa result of more transistors and faster ones
1818
8086
4004
Pentium 4
‘486
‘386‘286
Pentium
8080
Moor’s law
1919
Principles for performance improvementPrinciples for performance improvement
Moor’s law (cont.)Moor’s law (cont.) the grows will continue but not for long !!! the grows will continue but not for long !!!
(2013-2018)(2013-2018) now the doubling period is 3 yearsnow the doubling period is 3 years Intel predicts a limitation to 16 Intel predicts a limitation to 16
nanometer technology (read more on nanometer technology (read more on Wikipedia)Wikipedia)
Other similar grows:Other similar grows: clock frequency – saturated 3-4 years clock frequency – saturated 3-4 years
agoago capacity of internal memories (DRAMs)capacity of internal memories (DRAMs) capacity of external memories (HD, capacity of external memories (HD,
DVD)DVD) number of pixels for image and video number of pixels for image and video
devices devices
Semiconductor manufacturingprocesses
(source wikipedia)• 10 µm — 1971 • 3 µm — 1975 • 1.5 µm — 1982 • 1 µm — 1985 • 800 nm . 1989
• 600 nm 1994• 350 nm 1995 • 250 nm 1998• 180 nm 1999 • 130 nm 2000• 90 nm — 2002 • 65 nm — 2006 • 45 nm — 2008 • 32 nm — 2010 • 22 nm — 2012 • 14 nm — approx. 2014 • 10 nm — approx. 2016 • 7 nm — approx. 2018 • 5 nm — approx. 2020
2020
Principles for performance improvementPrinciples for performance improvement Precursors:Precursors:
• 90/10 principle: 90% of the time the processor executes 10% 90/10 principle: 90% of the time the processor executes 10% of the codeof the code
• principle: “make the common case fast”principle: “make the common case fast”• invest more in those parts that counts moreinvest more in those parts that counts more
Amdahl’s lawAmdahl’s law How to measure the impact of a new technology?How to measure the impact of a new technology? speedup – speedup – ηη – – how many times the execution is fasterhow many times the execution is faster
where: where: ηη’ - the speedup of the new component’ - the speedup of the new component f - the fraction of the program that benefit from the improvement f - the fraction of the program that benefit from the improvement
• Consequence: the speedup is limited by the Amdahl’s lawConsequence: the speedup is limited by the Amdahl’s law
Numerical example:Numerical example: f = 0,1; f = 0,1; ηη’=2 => ’=2 => ηη = 1,052 (5% grows) = 1,052 (5% grows) f= 0,1 ; f= 0,1 ; ηη’=∞ => ’=∞ => ηη = 1,111 (11% grows) = 1,111 (11% grows)
'old_exect*
old_execf)t-[(1
old_exect
_old_exect
fexecnewt
Old time New time
]’ / f f)-[(1 / 1
2121
Principles for performance improvementPrinciples for performance improvement
Locality principlesLocality principles Time localityTime locality
• ““if a memory location is accessed than it has a high if a memory location is accessed than it has a high probability of being accessed in the near futureprobability of being accessed in the near future””
• explanations:explanations: execution of instructions in a loop execution of instructions in a loop a variable is used for a number of times in a program sequencea variable is used for a number of times in a program sequence
• consequence: consequence: good practice: bring the newly accessed memory location good practice: bring the newly accessed memory location
closer to the processor for a better access time in case of a closer to the processor for a better access time in case of a next access => justification of cache memoriesnext access => justification of cache memories
2222
Principles for performance improvementPrinciples for performance improvement
Locality principlesLocality principles Space localitySpace locality
• ““if a memory location is accessed than its neighbor locations if a memory location is accessed than its neighbor locations have a high probability of being accessed in the near futurehave a high probability of being accessed in the near future””
• explanations:explanations: execution of instructions in a loop execution of instructions in a loop consecutive access to the elements of a data structure (vector, consecutive access to the elements of a data structure (vector,
matrix, record, list, etc.)matrix, record, list, etc.)
• consequence: consequence: good practice: good practice:
• bring the location’s neighbors closer to the processor for a bring the location’s neighbors closer to the processor for a better access time in case of a next access => justification better access time in case of a next access => justification of cache memoriesof cache memories
• transfer blocks of data instead of single locations; block transfer blocks of data instead of single locations; block transfer on DRAMs is much fastertransfer on DRAMs is much faster
2323
Principles for performance improvementPrinciples for performance improvement
Parallel execution principleParallel execution principle ““when the technology limits the speed increase a further when the technology limits the speed increase a further
improvement may be obtained through parallel execution”improvement may be obtained through parallel execution” parallel execution levels:parallel execution levels:
• data leveldata level – multiple ALUs – multiple ALUs
• instruction levelinstruction level – pipeline architectures, super-pipeline and – pipeline architectures, super-pipeline and superscalar, wide instruction set computerssuperscalar, wide instruction set computers
• thread levelthread level – multi-cores, multiprocessor systems – multi-cores, multiprocessor systems
• application levelapplication level – distributed systems, Grid and cloud systems – distributed systems, Grid and cloud systems parallel execution is one of the explanations for the speedup of parallel execution is one of the explanations for the speedup of
the latest processors (look at the table at slide 11) the latest processors (look at the table at slide 11)
2424
Improving the CPU performanceImproving the CPU performance Execution timeExecution time – the measure of the CPU performance – the measure of the CPU performance
where: IPS – instructions per secondwhere: IPS – instructions per second
CPI – cycles per instructionCPI – cycles per instruction
TTclkclk, f, fclkclk – clock signal’s period and frequency – clock signal’s period and frequency
Goal Goal – reduce the execution time in order to have a better CPU – reduce the execution time in order to have a better CPU performanceperformance
Solution – influence (reduce or increase) the parameters in the Solution – influence (reduce or increase) the parameters in the above formulas in order to reduce the execution time above formulas in order to reduce the execution time
IPS
noInstrexect
_
clkfCPInoInstrclkTCPInoInstrexect *_**_
2525
Improving the CPU performanceImproving the CPU performance Solutions: Solutions: increase the number of instructions per secondincrease the number of instructions per second
• How to do it ? reduce the duration of instructions reduce the frequency (probability) of long and complex instructions (e.g.
replace multiply operations) reduce the clock period and increase the frequency reduce CPI
• external factors that may influence IPS: access time to instruction code and data may influence drastically the
execution time of an instruction example: for the same instruction type (e.g. adding):
• < 1ns for instruction and data in the cache memory• 15-70 ns for instruction and data in the main memory• 1-10 ms for instruction and data in the virtual (HD) memory
CPIclkf
clkTCPIIPS
itipIPS
*
1
*
1External view
Architectural view
2626
Improving the CPU performanceImproving the CPU performance Solutions: Solutions: reduce the number of instructionsreduce the number of instructions
Instr_noInstr_no – number of instructions executed by the CPU during – number of instructions executed by the CPU during an application executionan application execution
• improve algorithms, improve algorithms, • reduce the complexity of the algorithm, reduce the complexity of the algorithm, • more powerful instructions: multiple operations during a single more powerful instructions: multiple operations during a single
instructioninstruction parallel ALUs, SIMD architectures, string operationsparallel ALUs, SIMD architectures, string operations
Instr_no = op_no / op_per_instrInstr_no = op_no / op_per_instr
• op_no – number of elementary operations required to solve a given op_no – number of elementary operations required to solve a given problem (application)problem (application)
• op_per_instr – number of operations executed in a single instruction op_per_instr – number of operations executed in a single instruction (average value)(average value)
• increasing the op_per_instr may increase the CPI (next parameter increasing the op_per_instr may increase the CPI (next parameter in the formula)in the formula)
2727
Improving the CPU performanceImproving the CPU performance Solutions (cont.): reduce CPISolutions (cont.): reduce CPI
CPI – cycles per instructionCPI – cycles per instruction – number of clock periods – number of clock periods needed to execute an instructionneeded to execute an instruction
• instructions have variable CPIs; an average value is neededinstructions have variable CPIs; an average value is needed
where: nwhere: ni i – number of instructions of type “i” in the analyzed program – number of instructions of type “i” in the analyzed program
sequence sequence
CPICPIii – CPI for instruction of type ”i” – CPI for instruction of type ”i”
• methods to reduce the CPI: methods to reduce the CPI: pipeline execution of instructions => CPI close to 1pipeline execution of instructions => CPI close to 1 superscalar, superpipeline => CPI superscalar, superpipeline => CPI єє (0.25 – 1) (0.25 – 1) simplify the CPU and the instructions – RISC architecturesimplify the CPU and the instructions – RISC architecture
iCPIip
iniCPIin
vaCPI **
2828
Improving the CPU performanceImproving the CPU performance Solutions (cont.): reduce the clock Solutions (cont.): reduce the clock
signal’s period or increase the signal’s period or increase the frequencyfrequency TTclkclk – the period of the clock signal – the period of the clock signal or or
ffclkclk – – the frequency of the clock signalthe frequency of the clock signal
Methods:Methods:• reduce the dimension of a switching element reduce the dimension of a switching element
and increase the integration ratioand increase the integration ratio• reduce the operating voltagereduce the operating voltage• reduce the length of the longest path – simplify reduce the length of the longest path – simplify
the CPU architecturethe CPU architecture
ΔtΔt’
Vcc
2929
ConclusionsConclusions
ways of increasing the speed of the ways of increasing the speed of the processors:processors: less instructionsless instructions smaller CPI – simpler instructionssmaller CPI – simpler instructions parallel execution at different levelsparallel execution at different levels higher clock frequencyhigher clock frequency