+ All Categories
Home > Documents > Peformance Measurements

Peformance Measurements

Date post: 30-Dec-2015
Category:
Upload: roary-brennan
View: 34 times
Download: 0 times
Share this document with a friend
Description:
Peformance Measurements. Performance. The entire point of computer hardware is to “perform” Operate correctly Implement useful operations Do so as fast as possible. What differences do we see in performance? Almost all computers operate correctly (within reason) - PowerPoint PPT Presentation
Popular Tags:
35
Transcript
Page 1: Peformance Measurements
Page 2: Peformance Measurements

Performance

• What differences do we see in performance?

• Almost all computers operate correctly (within reason)

• Most computers implement useful operations• This is a matter of taste...

• Computers all operate at different speeds

• Speed is the most important performance metric

2.1

• The entire point of computer hardware is to “perform”

• Operate correctly

• Implement useful operations

• Do so as fast as possible

Page 3: Peformance Measurements

Measuring speed

• Raw speed

• Ferrari wins

2.1

• Which is faster?

• School Bus: 57 MPH, 40 people

• Ferrari: 170 MPH, 2 people

• Throughput

• Ferrari: 340 passenger-MPH

• School Bus: 2280 passenger-MPH

• Other issues...

• Range, reliability, cost

Page 4: Peformance Measurements

Peformance of computers

• How long does it take to run my favorite program?

2.2

• To compare two computers, we compare the execution time of the same program on the two computers

• Faster one wins

• Lower execution time is better

• Batch throughput

• CPU time

• Response time

Page 5: Peformance Measurements

• The CPU interprets machine-language instructions nd xecutes them

A little background...

• The compiler converts this code into machine-language instructions

2.2

• Computer programs are (usually) written in a high-level language (e.g. C)

• The performance of a program depends on:• The number and types of instructions executed• How fast the CPU can execute those instructions

Page 6: Peformance Measurements

Tick-tock• Almost all modern computers are based on a clock

Period

2.2

• All events are controlled by and synchronized to a regular clock

• Clocks are just regular periodic waveforms

• Cycle time: time for the waveform to repeat itself

• Also known as the clock period

• Frequency: 1/Period

• Example:• 10ns clock cycle --> period = 10-8 s• Frequency 1/10ns = 1/10-8 s = 108 cycles/sec

Page 7: Peformance Measurements

Execution time

• Performance can be improved by:

• Decreasing the cycle time• Hardware solution: Use faster technology

• Decreasing the number of cycles for the program• Software: Write a better program• Hardware: Re-design CPU

2.3

• Time = cycles * cycle time

• Time = cycles / clock frequency

• Since the cycle time of a computer is constant, we can express time in terms of CPU cycles

Page 8: Peformance Measurements

Instruction execution time• Every instruction takes time to execute

• Some instructions may take more or less time than others

• The time for an instruction is expressed in terms of clock cycles

Instruction CyclesADD 1MULT 4CMP 1SUB 2

Example:Example:

• The time to run a program depends on:

• How many instructions

• What type of instructions

• 30 ADDs and 4 MULTs --> 46 cycles

2.3

Page 9: Peformance Measurements

Average CPI

• The Cycles-Per-Instruction (CPI) varies depending on what instructions are used

• Take an Average CPI

• Cycles = Number of Instructions * Average CPI

2.3

• Average CPI should reflect the mix of instructions in the program

• A large proportion of 4-cycle MULTs should raise the CPI, a large proportion of 1-cycle ADDs should lower it

• The average should be the weighted average

Page 10: Peformance Measurements

Weighing the average

Instruction Cycles %ADD 1 40MULT 4 10CMP 1 20SUB 2 30

Average CPI = 1 * 40% + 4 * 10% + 1 * 20% + 2 * 30% = .4 + .4 + .2 + .6 = 1.6

Average CPI = 1 * 40% + 4 * 10% + 1 * 20% + 2 * 30% = .4 + .4 + .2 + .6 = 1.6

Notice: The average CPI depends on the code we’re executing!Notice: The average CPI depends on the code we’re executing!

Example mix of instructions

Example mix of instructions

2.3

Page 11: Peformance Measurements

How long?

• Remember, lower is better

• Reducing any one of the three components reduces execution time

2.3

• Execution time = Cycles * Cycle Time

• Cycles = Average CPI * Instruction Count

• Execution time = Instruction Count * CPI * Cycle Time

• Cycle time - Reduced through technology change, change in CPU design

• CPI - Reduced through better code, better compiler, change in CPU design

• Instruction count - Reduced through better code, better compiler, change in CPU design

Page 12: Peformance Measurements

Examples

2.3

System A: 10s to run a program. Clock period is 20ns.System B: Change clock to 10ns, no other changes. How long does it take to run the same program on System B?

--> TimeD = CPID x PeriodD x InstructionsD = 1.10 x 22ns x 4 x 108 = 9.68s

--> TimeA = CPIA x PeriodA x InstructionsA = 10s

System D: 400,000,000 instr., 22ns clock and a CPI of 1.10.How long does it take to run the program on system D?

--> TimeB = CPIA x PeriodB x InstructionsA = ? (PeriodB = PeriodA * 0.5)--> TimeB = CPIA x PeriodA * 0.5 x InstructionsA = TimeA * 0.5 = 5s

System C: 10s to run a program, 20ns clock, 400,000,000 instr.What is the CPI? --> CPIC = TimeC / (PeriodC x InstrC) = 10s / (20 x 10-9 x 4 x 108) = 1.25

Page 13: Peformance Measurements

Examples

2.3

Assume an add takes 1 cycle, a mult 4 cycles, and a sub 2 cycles

Two different compilers produce the following loops for the same code:

addaddmultsubaddadd

multaddmultsub

A: B:

loop1000000times

loop1000000times

What’s the CPI?CPIA = (4 + 1 + 4 + 2)/4 = 2.75

CPIB = (1 + 1 + 4 + 2 + 1 + 1)/6 = 1.667How long does it take to run each program on a 200MHz CPU?

TimeA = CPIA x PeriodA x InstructionsA = 2.75 x 5ns x 4000000 = .0055sTimeB = CPIB x PeriodB x InstructionsB = 1.667 x 5ns x 6000000 = .0050s

Page 14: Peformance Measurements
Page 15: Peformance Measurements

Performance metrics

• I’m concerned with how long it takes to run my program

• Chances are, that number isn’t published with the specs for the computer

2.4

• Standardized metrics

• Benchmarks (SPEC, etc.)

• MIPS

• MFLOPS

Page 16: Peformance Measurements

Benchmarks

• Run a suite of benchmark programs, average the performance

• Benchmarks - programs thought to be representative of commonly-used programs

2.5

• Advantages

• Actually corresponds to execution time!

• Represents a wider range of programs• Disadvantages

• Are they running your program?

• Who picks the benchmarks? Be wary if the manufacturer does!

Page 17: Peformance Measurements

• New tests use SPEC CPU2000

• CINT2000 - Performance on integer programs

• CFP2000 - Performance on floating-point programs

• Larger numbers indicate better performance

• Tests prior to 2000 used CPU95

• CPU 2000 only has only a few years of data

SPEC Benchmarks

• SPEC (System Performance Evaluation Cooperative) maintains a set of benchmark suites

2.6

• SPEC Web Page (www.spec.org)

Page 18: Peformance Measurements

SPECint95 Results for Intel Processors

Clock Speed (MHz)

SP

EC

int9

5

Note: Results depend on Cache size, memory system, and motherboard

0

5

10

15

20

25

30

35

40

Pentium

Pentium Pro

Pentium II

Celeron

Pentium III

100 200

300

400

500

600 700

800

Better cache design(On-chip vs Off-chip)

Page 19: Peformance Measurements

SPECfp95 Results for Intel Processors

Note: Results depend on Cache size, memory system, and motherboard

0

5

10

15

20

25

30

35

Pentium

Pentium Pro

Pentium II

Celeron

Pentium III

Clock Speed (MHz)

SP

EC

fp9

5

100 200

300

400

500

600 700

800

Page 20: Peformance Measurements

CINT2000 Results for Various Processors

Clock Speed (GHz)

CIN

T2

000

Note: Results depend on Cache size, memory system, and motherboard

400

600

800

1000

1200

1400

1600

1800

Pentium 3

Pentium 4

Athlon

P4 Extreme

Opteron

Pmac G5

Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph

18

00

+1

60

0+

15

00

+

22

00

+2

40

0+

26

00

+2

70

0+

32

00

+

Page 21: Peformance Measurements

CFP2000 Results for Various Processors

Note: Results depend on Cache size, memory system, and motherboard

200

400

600

800

1000

1200

1400

1600

Pentium 3

Pentium 4

Athlon

Opteron

Pmac G5

P4 Extreme

Clock Speed (GHz)

CF

P2

000

Note: Athlon Part numbers are not the CPU MHz! Part numbers labeled on graph

18

00

+1

60

0+

15

00

+

22

00

+2

40

0+

26

00

+ 27

00

+3

20

0+

Page 22: Peformance Measurements

Limited benefits...

• Assume we’re running a program that spends 40% of its time accessing memory

• Now, we upgrade the processor from 200 MHz to 800 MHz

• How much faster does the program run?

2.7

• We’ve reduced the time for 60% of the program by 4

• But we haven’t touched the memory access time

• New total = Old * (40% + (60% / 4)) = Old * (40% + 15%) = Old * 55% Not even twice as fast!

Page 23: Peformance Measurements

Amdahl’s Law

2.7

• Practical effect: “Make the common case fast”

• Corollary: “Forget about the rare case”

• New Execution time =Execution time affected by impr. + Unaffected Execution Time

Amount of Improvement

• Example: 70% of my execution time is done on integer ADDs, and 6% on floating point ADDs. Total execution time is 100 seconds.

• What’s the effect of making integer ADDs twice as fast?• New time = (100 * .70) / 2 + (100 * .30) = 35+30=65 seconds

• What’s the effect of making F.P. Adds twice as fast?

• New time = (100 * .06) / 2 + (100 * .94) = 3+94 = 97 seconds

Page 24: Peformance Measurements

(Native) MIPS

2.4

cycles

secondCPI=

* 10-6

cyclessecond CPI=

10-6

* clock rateCPI

=10-6

*

Million Instructions Per Second

Instructionssecond

* 10-6MIPS =

• MIPS does not take into account how many instructions must be executed in a program

1. 1,000 instructions, CPI 1.2, 1.0 MHz clock• Execution time = 1.2 ms, MIPS = 1/1.2 = .833

2. 500 instructions, CPI 2.0, 1.0 MHz clock• Execution time = 1.0ms, MIPS = 1/2.0 = .500

• Example: Same program, written two ways

Page 25: Peformance Measurements

Avoid MIPS (the metric, not the processor)

• Higher MIPS doesn’t always mean better performance

• Highest MIPS corresponds to using the smallest (fastest) instructions to lower CPI

MIPS = clock rate / (CPI * 1,000,000)

2.4

• Peak MIPS is pointless

• Peak MIPS is just what MIPS you get with smallest instructions

• Usually, CPI is 1.0 for this

• Just re-expressing clock rate in MHz

Page 26: Peformance Measurements

MFLOPS

• Million Floating-point Operations Per Second• MFLOPS is similar to MIPS• Measures floating-point operations (mult, divide,

add,...)• Suffers same problems as MIPS

• Different operations cost different amounts

2.4

• Peak MFLOPS is especially bad

Page 27: Peformance Measurements

Performance Summary

• Execution time is the most important performance metric

• Basic formula for performance:

• Execution time = instructions * cycle time * CPI

• Amdahl’s law describes how making limited improvements affects the bottom line

• Only make improvements in areas that are commonly used

• Standard benchmarks help us to compare performance of various computers

• Beware of overly-simplified comparisons

Page 28: Peformance Measurements

Pitfalls and Fallacies

• Processors with the same ISA can be compared by clock rate or a single benchmark suite alone

• We don’t know the pipeline structure and memory system

• Peak performance tracks observed performance

• One processor may operate closer to peak performance most of the time than another

• MIPS is an accurate measure of performance

Page 29: Peformance Measurements

ExampleWe wish to consider the performance of two different machines: M1 and M2. The clock

frequencies for the two machines are as follows:• M1 M2• Clock Frequency 300 MHz 200 MHz

Two programs were run on both machines and the following measurements were made:• Program Time on M1 Time on M2• 1 06 seconds 04 seconds• 2 08 seconds 10 seconds

In addition, the following additional measurements were made:• Program No. of Instructions No. of Instructions• Executed on M1 Executed on M2• 1 180x10^6 100x10^6

1. For each program, which machine is faster and by how much?

2. Find the clock cycles per instruction (CPI or average CPI) for Program 1 on both machines

3. On M1, each multiplication instruction involves 20 clock cycles. Suppose 20% of the instructions in Program 1 running on M1 are multiplications. What percentage of the CPU time is spent doing multiplications during the execution of Program 1 on M1?

4. Find the instruction execution rate (i.e., the number of instructions executed per second) for each machine when running Program 1

5. Assuming the CPI for the machines is constant, find the instruction count for Program 2 running on each machine using the execution times.

Page 30: Peformance Measurements

Solution

1. For program 1, M2 is 2sec or (6-4)/6 = 33% fasterFor program 2, M1 is 2 sec or (10-8)/10 = 20% faster2. tM1P1 = INSTRM1P1 x CPIM1P1 x 1/fM1 => CPIM1P1 = (tM1P1 x fM1)/INSTM1P1 = (6 x 300)/180 = 10Likewise CPIM2 = (4 x 200)/100 = 83. INSTRMULTM1P1 = 0.2 x 180x10^6 = 36x10^6 instructionstMULTM1P1 = INSTRMULTM1P1 x 20 x 1/(300x10^6) = 720/300 = 2.4

sectMULTM1P1/ tM1P1 = 2.4/6 = 40%4. MIPSM1P1 = (INSTRM1P1 / tM1P1)*10^6 = 180/6 = 30MIPSM2P1 = (INSTRM2P1 / tM2P1)*10^6 = 100/4 = 255. tM1P2 = INSTRM1P2 x CPIM1P2 x 1/fM1 =>INSTRM1P2 = (tM1P2 x fM1)/ CPIM1P1 = (8 x 300x10^6)/10 = 240x10^6INSTRM2P2 = (tM2P2 x fM2)/ CPIM2P1 = (10 x 200x10^6)/8 = 250x10^6

Page 31: Peformance Measurements

Example

Page 32: Peformance Measurements

Review Questions

• Is CPI constant for a given processor (does not change from one program to another)?

• Two processors with the same Instruction Set Architecture have the same CPI• True• False

• Is MIPS constant for a given processor (does not change from one program to another)?

• Two processors with the same Instruction Set Architecture have the same MIPS• True• False

Page 33: Peformance Measurements

Review Questions

• Which of the following performance metrics is generally easier for the programmer to improve?

• The instruction count• The average CPI• The clock frequency• peak MIPS

• What would you consider as most important when selecting the fastest processor for a certain application domain?

• The operating clock frequency• MIPS• Peak MIPS• Execution time for relative benchmarks

• How can you increase a processor’s clock frequency?• Write a better program• Use a better compiler• Implement the processor in a faster VLSI technology• Use a larger memory

Page 34: Peformance Measurements

ExampleWe wish to consider the performance of two different machines: M1 and M2. The clock frequencies

for the two machines are as follows: M1 M2

Clock Frequency: 800 MHz 1000 MHzA program was run on both machines and the following measurements were made:

Time on M1 Time on M22.5 seconds 2 seconds

In addition, the following additional measurements were made:No. of Instructions No. of InstructionsExecuted on M1 Executed on M2100x10^6 125x10^6

Finally, the frequency that instructions occur in the program for M1 and M2 are shown in the following table

Instruction M1% M2%ADD 40 60MULT 10 8CMP 20 12SUB 30 20

1. Find the clock cycles per instruction (CPI or average CPI) for Program on both machines

2. How much faster will the program run on M1 and M2 respectively if we

a) reduce the execution time of the ADD instruction by 20%, assuming that an ADD instruction requires 5 cycles on both machines

b) reduce the execution time of the MULT instruction by 20%, assuming a MULT instructions requires 20 cycles on M1 and 25 cycles on M2

c) Which is better for M1 and which for M2?

Page 35: Peformance Measurements

Example 2

Integer and floating point operations benchmarks were run on an Intel Atom Z2760 @ 1.8 GHz using Novabench.

• The results were 69815088 IOPs and 41113792 FLOPs per second, respectively

• 1. Calculate the MIOPS and MFLOPs

• 2. Assuming 2 Instructions/integer operation, calculate the CPI

• 3. Calculate the execution time of a program composed of 70% integer operations and 30% floating-point operations with a total of 500 million operations


Recommended