1 - Performance

CS152: Computer Architecture and Engineering

Computer Architecture II PerformanceChapter 1, Hennesy & PattersonAugusto SalazarDepartamento de Ingeniera de SistemasUniversidad del [email protected] from Northwestern University1Performance ConceptsTaken from Northwestern UniversityTaken from Northwestern UniversityTaken from Northwestern University2Speed is often an important design criteria. However, other applications have other criteria e.g., power, reliability, EMI, Performance PerspectivesPurchasing perspective Given a collection of machines, which has the Best performance ?Least cost ?Best performance / cost ?Design perspectiveFaced with design options, which has the Best performance improvement ?Least cost ?Best performance / cost ?Both requirebasis for comparisonmetric for evaluationOur goal: understand cost & performance implications of architectural choicesTaken from Northwestern University3Two Notions of PerformanceWhich has higher performance?Execution time (response time, latency, )Time to do a taskThroughput (bandwidth, )Tasks per unit of timeResponse time and throughput often are in oppositionPlaneBoeing 747ConcordeSpeed610 mph1350 mphDC to Paris6.5 hours3 hoursPassengers470132Throughput (pmph)286,700178,200Taken from Northwestern University4DefinitionsPerformance is typically in units-per-secondbigger is betterIf we are primarily concerned with response timeperformance = 1 execution_time

" X is n times faster than Y" means

Taken from Northwestern University5ExampleTime of Concorde vs. Boeing 747?Concord is 1350 mph / 610 mph= 2.2 times faster = 6.5 hours / 3 hours

Throughput of Concorde vs. Boeing 747 ?Concord is 178,200 pmph / 286,700 pmph = 0.62 times fasterBoeing is 286,700 pmph / 178,200 pmph= 1.60 times faster

Boeing is 1.6 times (60%) faster in terms of throughputConcord is 2.2 times (120%) faster in terms of flying time

We will focus primarily on execution time for a single jobLots of instructions in a program => Instruction thruput important!

Taken from Northwestern University6BenchmarksTaken from Northwestern UniversityTaken from Northwestern University7Speed is often an important design criteria. However, other applications have other criteria e.g., power, reliability, EMI, Evaluation ToolsBenchmarks, traces and mixesMacrobenchmarks and suitesMicrobenchmarksTracesWorkloadsSimulation at many levelsISA, microarchitecture, RTL, gate circuitTrade fidelity for simulation rate (Levels of abstraction)Other metricsArea, clock frequency, power, cost, AnalysisQueuing theory, back-of-the-envelopeRules of thumb, basic laws and principles

Taken from Northwestern UniversityBenchmarksMicrobenchmarksMeasure one performance dimensionCache bandwidthMemory bandwidthProcedure call overheadFP performanceInsight into the underlying performance factorsNot a good predictor of application performanceMacrobenchmarksApplication execution timeMeasures overall performance, but on just one applicationNeed application suite

Taken from Northwestern UniversityWhy Do Benchmarks?How we evaluate differencesDifferent systemsChanges to a single systemProvide a targetBenchmarks should represent large class of important programsImproving benchmark performance should help many programsFor better or worse, benchmarks shape a fieldGood ones accelerate progressgood target for developmentBad benchmarks hurt progresshelp real programs v. sell machines/papers?Inventions that help real programs dont help benchmarkTaken from Northwestern University10Popular Benchmark SuitesDesktopSPEC CPU2000 - CPU intensive, integer & floating-point applicationsSPECviewperf, SPECapc - Graphics benchmarksSysMark, Winstone, WinbenchEmbeddedEEMBC - Collection of kernels from 6 application areasDhrystone - Old synthetic benchmarkServersSPECweb, SPECfsTPC-C - Transaction processing systemTPC-H, TPC-R - Decision support systemTPC-W - Transactional web benchmarkParallel ComputersSPLASH - Scientific applications & kernelsMost markets have specific benchmarks for design and marketing.Taken from Northwestern UniversitySPEC CINT2000

Taken from Northwestern UniversitytpC

Taken from Northwestern UniversityBasis of EvaluationActual Target WorkloadFull Application BenchmarksSmall Kernel BenchmarksMicrobenchmarksProsCons representative very specific non-portable difficult to run, or measure hard to identify cause portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks less representative easy to fool peak may be a long way from application performanceTaken from Northwestern University14Programs to Evaluate Processor Performance(Toy) Benchmarks10-100 linee.g.,: sieve, puzzle, quicksortSynthetic Benchmarksattempt to match average frequencies of real workloadse.g., Whetstone, dhrystoneKernelsTime critical excerptsTaken from Northwestern University15Now its your turn Download at least two benchmark apps on your cellphone and run the test.Compare those results with those obtained by the members of your group and try to make sense of the results based on the HW and SW specifications of each phone.Homework: Write a report (english and at least two pages) and deliver it for next class.Taken from Northwestern UniversityProcessor Design MetricsTaken from Northwestern UniversityTaken from Northwestern University17Speed is often an important design criteria. However, other applications have other criteria e.g., power, reliability, EMI, PerformanceIn this exercise, you should evaluate the difference in performance between two CPU architectures: CISC (Complex Instruction Set Computing) and RISC (Reduced Instrucion Set Computing). Overall, the CISC CPUs are more complex than RISC CPU instructions. Therefore require fewer instructions to perform the same tasks.However, a CISC instruction, since it is more complex, takes longer to be completed than a RISC operation. Assume that a certain task requires P and 2P CISC instruction manual RISC, CISC instruction and takes 8T ns to complete, while a RISC operation takes 2T ns. Under this assumption,

Which has better performance?Taken from Northwestern University18PerformanceSometimes software optimization may dramatically improve the performance of a computer system.

Assume that the CPU can execute a multiplication in 10 ns, and execute a subtraction in 1 ns.

How much will it take the CPU to calculate the result of

d = a x b - a x c?

You could optimize the equation to take less time?

Taken from Northwestern University19Performance measurement and reportingWhat is said in "A is faster than B?"A user of a desktop could say that a program is running in less timeA user of a server tell you that means you can complete more tasks per hour

What the user is interested in reducing?The user is interested in computer response time (runtime)The user of a data center is interested in throughput, ie the number of completed tasks per unit timeTaken from Northwestern University20Performance measurement and reportingPerformance and runtime"X is faster than Y" means that the execution time or response is lower in X than in YX is n times faster than Y "means:

Since the runtime performance is reciprocal, the following relationship holds:

Taken from Northwestern University21A program consists of a set of instructions to be executed, I

The average number of clock cycles it takes to complete home instruction (CPI)Measured as cycles / instruction, CPI

CPU has a fixed number of clock cycle time (C)C = 1 / clock speedMeasured in seconds / cycle

Formula for runtimeTaken from Northwestern University22CPU Execution TimeRuntime is the product of these 3 parameters

T = I x CPI x CTiempo de ejecucin por programa por segundoNmero de instrucciones ejecutadasCPI promedio por programaCiclo del reloj de la CPU

Taken from Northwestern University23The following are the parameters of execution of a program running on a computerNumber of executed instructions: 10,000,000CPI program average: 2.5 x instruction cyclesCPU clock speed: 200 MHz (clock cycle: 5x10-9 s)What is the runtime for this program:

Tiempo CPU = Instrucciones x CPI x Ciclo del reloj = 10.000.000 x 2.5 x 1 / velocidad reloj = 10.000.000 x 2.5 x 5x10-9 = .125 segundos

T = I x CPI x CRun Time CPU

Taken from Northwestern University24Tiempo CPU = Instrucciones x CPI x del Ciclo relojNmero de Instrucciones ICiclo del reloj C CPIDepende de:

Organizacin CPUTegnologa (VLSI)Depende de:

Programa usado

CompiladorISAOrganizacin CPUDepende de:

Programa usadoCompiladorISA(CPI Promedio)T = I x CPI x CRun Time CPUTaken from Northwestern University25Factors that affect CPU performanceCPICiclos reloj (C)Nmero de Instrucciones IProgramaCompiladorOrganizacin (Diseo de la CPU)Tecnologa (VLSI)Instruction SetArchitecture (ISA)X X X X X X X X X

Taken from Northwestern University26Performance ExampleReturning to the previous example: a program is run with the following parameters:Number of executed instructions: 10,000,000CPI program average: 2.5 x instruction cyclesCPU clock speed: 200 MHzBy using the same program with these changes:A new compiler which is used:Number of executed instructions: 9,500,000CPI program average: 3.0Faster CPU. Clock Speed: 300 MHz

Taken from Northwestern University27What is the increase (Speedup)?

= 0.125 / 0.095

= 1.32 or 32 % faster after changes

Performance ExampleTaken from Northwestern University28Types of instructions and CPIGiven:A program with n types of class instructionExecuted on a CPU with the following characteristics:

Ci = Type number instruction i CPIi = Cycles per instruction type i

Then:

Donde:

i = 1, 2, . n

Taken from Northwestern University29An instruction set has the following 3 classes:

Two sequences of code have the following number of instructions:

Clase CPI A 1 B 2 C 3 Code Sequence Number of instructions per class A B C 1 2 1 2 2 4 1 1To design a CPUTypes of instructions and CPITaken from Northwestern University30CPU cycles for Sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cyclesCPI for Sequence 1 = Ciclos de reloj / Num. Instrucciones = 10 /5 = 2

CPU cycles for tier 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles CPI para secuencia 2 = 9 / 6 = 1.5

CPI = CPU Cycles / ITypes of instructions and CPITaken from Northwestern University31Frequency Instructions and CPIGiven a program with n types of class instruction with the following characteristics:Ci = Type number instruction iCPIi = Average number of cycles per instruction type iFi = Frequency or fraction of instruction type i = Frequency or fraction of instruction type = Ci/ I

Then:

Fraction of the total execution time for instruction type i = CPIi x FiCPIi = 1, 2, . nTaken from Northwestern University32Frecuencia: Ejemplo con RISCMquina base (Reg / Reg)CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2 = .5 + 1 + .3 + .4

CPIi x FiCPISuma = 2.2OpFrec (Fi)CPIiCPIi x Fi% TiempoALU50%10.523% = 0.5/2.2Load20%51.045% = 1.0/2.2Store10%30.314% = 0.3/2.2Branch20%20.418% = 0.4/2.2Taken from Northwestern University33Performance metricsCompiladorLenguaje de programacinAplicacinDatapathControlTransistoresCablesPinesISAUnidades de FuncinCiclos por segundo (velocidad del reloj).Megabytes per second.Tiempo de ejecucin: Carga de trabajo,SPEC, etc.(milliones) de instrucciones por segundo MIPS(milliones) de operaciones (P.F.) por segundo MFLOPS(Medidas)Taken from Northwestern University34Amdahl's Law: Make the Common Case FastSpeedup due to enhancement E: ExTime w/o E Performance w/ ESpeedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the taskby a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = ExTime(without E) ((1-F) + F/S) X ExTime(without E)

Performance improvement is limited by how much the improved feature is used Invest resources where time is spent.Taken from Northwestern UniversitySummaryTime is the measure of computer performance!Good products created when have:Good benchmarksGood ways to summarize performanceIf not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales sales almost always winsRemember Amdahls Law: Speedup is limited by unimproved part of programCPU time= Seconds= Instructions x Cycles x Seconds Program Program Instruction CycleTaken from Northwestern UniversityAmdahls Law with multiple improvmentsThe following proposed improvements are made with its respective percentage of affections: Speedup1 = S1 = 10 Percentage 1 = F1 = 20% Speedup2 = S2 = 15 Percentage 2 = F2 = 15% Speedup3 = S3 = 30 Percentage 3 = F3 = 10%

All the improvements use the new design, but each affect a different part of the codeWhich is the result of the speed up?Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]= 1 / [ .55 + .0333 ] = 1 / .5833 = 1.71

Taken from Northwestern University37A graphical viewBefore: Execution time without the improvements: 1After: Execution time with the improvements : .55 + .02 + .01 + .00333 = .5833

Speedup = 1 / .5833 = 1.71

Fraccin no afectada: .55Sin cambiosFraccin no afectada: .55F1 = .2 F2 = .15 F3 = .1 S1 = 10S2 = 15S3 = 30/ 10/ 30/ 15Taken from Northwestern University38

Date post:	21-Nov-2015
Category:	Documents
Upload:	yesid-soto-cobos
View:	214 times
Download:	2 times

1 - Performance

Documents