Post on 29-May-2018
transcript
8/9/2019 L03 Principles
1/37
1
Roman
Japanese
Chinese (compute in hex?)
8/9/2019 L03 Principles
2/37
2
COMP 206:COMP 206:Computer Architecture andComputer Architecture andImplementationImplementation
Montek SinghMontek Singh Thu, Jan 22, 2009 Thu, Jan 22, 2009
Lecture 3: Quantitative PrinciplesLecture 3: Quantitative Princip
les
8/9/2019 L03 Principles
3/37
3
Quantitative Principles of ComputerQuantitative Principles of ComputerDesignDesign This is intro to design and analysis This is intro to design and analysis
Take Advantage of Parallelism Take Advantage of ParallelismPrinciple of LocalityPrinciple of LocalityFocus on the Common CaseFocus on the Common CaseAmdahls LawAmdahls Law
The Processor Performance Equation The Processor Performance Equation
8/9/2019 L03 Principles
4/37
4
1) Taking Advantage of Parallelism1) Taking Advantage of Parallelism(exs.)(exs.)Increase throughput of server computer viaIncrease throughput of server computer via
multiple processors or multiple disksmultiple processors or multiple disksDetailed HW designDetailed HW design
Carry lookahead adders uses parallelism to speed upCarry lookahead adders uses parallelism to speed upcomputing sums from linear to logarithmic in numbercomputing sums from linear to logarithmic in number
of bits per operandof bits per operandMultiple memory banks searched in parallel in set-Multiple memory banks searched in parallel in set-associative cachesassociative caches
Pipelining (next slides)Pipelining (next slides)
8/9/2019 L03 Principles
5/37
5
PipeliningPipeliningOverlap instruction executionOverlap instruction execution
to reduce the total time to complete an instructionto reduce the total time to complete an instructionsequence.sequence.
Not every instruction depends on immediateNot every instruction depends on immediatepredecessorpredecessor
executing instructions completely/partially inexecuting instructions completely/partially inparallel possibleparallel possible
Classic 5-stage pipeline:Classic 5-stage pipeline:1) Instruction Fetch (Ifetch),1) Instruction Fetch (Ifetch),2) Register Read (Reg),2) Register Read (Reg),3) Execute (ALU),3) Execute (ALU),4) Data Memory Access (Dmem),4) Data Memory Access (Dmem),5) Register Write (Reg)5) Register Write (Reg)
8/9/2019 L03 Principles
6/37
6
Pipelined Instruction ExecutionPipelined Instruction Execution
I n s t r.
O r d e r
Time (clock cycles)
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
8/9/2019 L03 Principles
7/377
Limits to pipeliningLimits to pipeliningHazardsHazards prevent next instruction fromprevent next instruction from
executing during its designated clock cycleexecuting during its designated clock cycle
Structural hazardsStructural hazards : attempt to use the same hardware: attempt to use the same hardwareto do two different things at onceto do two different things at onceData hazardsData hazards : Instruction depends on result of prior: Instruction depends on result of priorinstruction still in the pipelineinstruction still in the pipelineControl hazardsControl hazards : Caused by delay between the fetching: Caused by delay between the fetchingof instructions and decisions about changes in controlof instructions and decisions about changes in controlflow (branches and jumps).flow (branches and jumps).
8/9/2019 L03 Principles
8/378
Increasing Clock RateIncreasing Clock RatePipelining also used for thisPipelining also used for this
Clock rate determined by gate delaysClock rate determined by gate delays
Latchor
register
combinationallogic
8/9/2019 L03 Principles
9/379
2) The Principle of Locality2) The Principle of Locality The Principle of Locality: The Principle of Locality:
Programs access a relatively small portion of thePrograms access a relatively small portion of theaddress space. Also, reuse data.address space. Also, reuse data.
Two Different Types of Locality: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soonreferenced, it will tend to be referenced again soon(e.g., loops, reuse)(e.g., loops, reuse)Spatial Locality (Locality in Space): If an item isSpatial Locality (Locality in Space): If an item isreferenced, items whose addresses are close by tendreferenced, items whose addresses are close by tendto be referenced soonto be referenced soon(e.g., straight-line code, array access)(e.g., straight-line code, array access)
Last 30 years, HW relied on locality forLast 30 years, HW relied on locality formemory perf.memory perf.
8/9/2019 L03 Principles
10/3710
Levels of the Memory HierarchyLevels of the Memory Hierarchy
CPU Registers 100s Bytes300 500 ps (0.3-0.5 ns)
L1 and L2 Cache 10s-100s K Bytes~1 ns - ~10 ns$1000s/ GByte
Main Memory G Bytes80ns- 200ns~ $100/ GByte
Disk 10s T Bytes, 10 ms(10,000,000 ns)~ $1 / GByte
Capacity Access Time Cost
Tape infinitesec-min
~$1 / GByte
Registers
L1 Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler1-8 bytes
cache cntl32-64 bytes
OS4K-8K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
L2 Cachecache cntl64-128 bytesBlocks
8/9/2019 L03 Principles
11/3711
3) Focus on the Common Case3) Focus on the Common CaseIn making a design trade-off, favor the frequent caseIn making a design trade-off, favor the frequent caseover the infrequent caseover the infrequent case
e.g., Instruction fetch and decode unit used moree.g., Instruction fetch and decode unit used morefrequently than multiplier, so optimize it 1stfrequently than multiplier, so optimize it 1ste.g., If database server has 50 disks / processor, storagee.g., If database server has 50 disks / processor, storagedependability dominates system dependability, so optimizedependability dominates system dependability, so optimizeit 1stit 1st
Frequent case is often simpler and can be done fasterFrequent case is often simpler and can be done fasterthan the infrequent casethan the infrequent case
e.g., overflow is rare when adding 2 numbers, so improvee.g., overflow is rare when adding 2 numbers, so improveperformance by optimizing more common case of noperformance by optimizing more common case of nooverflowoverflowMay slow down overflow, but overall performance improvedMay slow down overflow, but overall performance improved
by optimizing for the normal caseby optimizing for the normal caseWhat is frequent case and how much is performanceWhat is frequent case and how much is performanceimproved by making case faster => Amdahls Lawimproved by making case faster => Amdahls Law
8/9/2019 L03 Principles
12/3712
Validity of the single processor approach to achieving large scale computing capabilities, G. M. Amdahl,
AFIPS Conference Proceedings, pp. 483-485, April 1967 http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
4) Amdahls Law (History, 1967)4) Amdahls Law (History, 1967)Historical contextHistorical context
Amdahl was demonstrating the continued validity of Amdahl was demonstrating the continued validity of the single processor approach and of the weaknessesthe single processor approach and of the weaknessesof the multiple processor approachof the multiple processor approachPaper contains no mathematical formulation, justPaper contains no mathematical formulation, justarguments and simulationarguments and simulation
The nature of this overhead appears to be sequential so The nature of this overhead appears to be sequential sothat it is unlikely to be amenable to parallel processingthat it is unlikely to be amenable to parallel processingtechniques.techniques. A fairly obvious conclusion which can be drawn at thisA fairly obvious conclusion which can be drawn at thispoint is that the effort expended on achieving high parallelpoint is that the effort expended on achieving high parallelperformance rates is wasted unless it is accompanied byperformance rates is wasted unless it is accompanied byachievements in sequential processing rates of very nearlyachievements in sequential processing rates of very nearlythe same magnitude.the same magnitude.
Nevertheless, it is of widespread applicabilityNevertheless, it is of widespread applicabilityin all kinds of situ ationsin all kinds of situ ations
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf8/9/2019 L03 Principles
13/3713
SpeedupSpeedupBook shows two forms of speedup eqnBook shows two forms of speedup eqn
We will use the second because you getWe will use the second because you getspeedup factors like 2Xspeedup factors like 2X
oldoverallnew
ExTimeSpeedup ExTime
=
newoverall
old
ExTimeSpeedupExTime
=
8/9/2019 L03 Principles
14/3714
4) Amdahls Law4) Amdahls Law
( )enhanced
enhancedenhanced
newoldoverall
SpeedupFraction Fraction
1 ExTime
ExTime Speedup
+=
1
Best you could ever hope to do:
( )enhancedmaximum Fraction-1
1 Speedup =
( )+enhanced
enhancedenhancedoldnew Speedup
FractionFractionExTimeExTime 1
8/9/2019 L03 Principles
15/37
15
Amdahls Law exampleAmdahls Law exampleNew CPU 10X fasterNew CPU 10X faster
I/O bound server, so 60% time waitingI/O bound server, so 60% time waiting
( )
( )56.1
64.01
10
0.4 0.41
1
Speedup
Fraction Fraction1
1 Speedup
enhanced
enhancedenhanced
overall
==+
=
+=
Its human nature to be attracted by 10X faster, vs.keeping in perspective its just 1.6X faster
8/9/2019 L03 Principles
16/37
16
Amdahls Law for Multiple TasksAmdahls Law for Multiple Tasks
1
1
=
=
ii
i i
iavg
F R F R
[ ][ ]=
secondresults
11
secondresults
Fraction of resultsgenerated at this rate
Average execution rate(performance)
Note: Not fractionof time spent workingat this rate
Note : Not fractionof time spent workingat this rate
Bottleneckology: Evaluating Supercomputers, Jack Worlton, COMPCOM 85, pp. 405-406
8/9/2019 L03 Principles
17/37
17
ExampleExample
30% of results are generated at the rate of 1 MFLOPS,20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance in MFLOPS?What is the bottleneck?
30% of results are generated at the rate of 1 MFLOPS,
20% at 10 MFLOPS,50% at 100 MFLOPS.What is the average performance in MFLOPS?What is the bottleneck?
MFLOPS08.35.32100
5.0230100
1005.0
102.0
13.0
1==
++=
++
= Ravg
%5.15.32
5.0%,2.6
5.322
%,3.925.32
30===
Bottleneck: the rate that consumes most of the time
0 0.2 0.4 0.6 0.8 1
1
1
=
=
ii
i i
i
avg
F R F
R
8/9/2019 L03 Principles
18/37
18
Another ExampleAnother Example
Which change is more effective on a certain machine: speeding up 10-foldthe floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, whichtake up 50% of total execution time?(Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)
Which change is more effective on a certain machine: speeding up 10-foldthe floating point square root operation only, which takes up 20% of execution time, or speeding up 2-fold all floating point operations, whichtake up 50% of total execution time?(Assume that the cost of accomplishing either change is the same, and thetwo changes are mutually exclusive.)
F sqrt = fraction of FP sqrt resultsR sqrt = rate of producing FP sqrt resultsF non-sqrt = fraction of non-sqrt resultsR non-sqrt = rate of producing non-sqrt resultsF fp = fraction of FP results
R fp = rate of producing FP resultsF non-fp = fraction of non-FP resultsR non-fp = rate of producing non-FP resultsR before = average rate of producing results before enhancementR after = average rate of producing results after enhancement
R FR F
R F
R F
fp
fp
fp-non
fp-non
sqrt
sqrt
sqrt-non
sqrt-non 4
=
=
8/9/2019 L03 Principles
19/37
19
Solution using Amdahls LawSolution using Amdahls Law
22.11.4
551
1.41
1.41
41.011
51
411
R R
R F
R 10FR
R F
R FR
before
after
sqrt-non
sqrt-non
sqrt
sqrtafter
sqrt-non
sqrt-non
sqrt
sqrt before
===
=+
=
+
=
=+
=
+
=
x
x
x x x
x x x
Improve FP sqrt only
33.15.1
2215.11
5.11
5.011
2111
R R R
F
R 2
FR
R F
R FR
before
after
fp-non
fp-non
fp
fpafter
fp-non
fp-non
fp
fp before
===
=+
=
+
=
=+
=
+
=
y
y
y y y
y y y
Improve all FP ops
8/9/2019 L03 Principles
20/37
20
Implications of Amdahls LawImplications of Amdahls LawImprovements provided by a feature limited by howImprovements provided by a feature limited by howoften feature is usedoften feature is usedAs stated, Amdahls Law is valid only if the systemAs stated, Amdahls Law is valid only if the systemalways works with exactly one of the ratesalways works with exactly one of the rates
Overlap between CPU and I/O operations? Amdahls Law asOverlap between CPU and I/O operations? Amdahls Law asgiven here is not applicablegiven here is not applicable
Bottleneck is the most promising target forBottleneck is the most promising target forimprovementsimprovements Make the common case fastMake the common case fastInfrequent events, even if they consume a lot of time, willInfrequent events, even if they consume a lot of time, willmake little difference to performancemake little difference to performance
Typical use: Change only one parameter of system, Typical use: Change only one parameter of system,and compute effect of this changeand compute effect of this change
The same program, with the same input data, should run The same program, with the same input data, should runon the machine in both caseson the machine in both cases
8/9/2019 L03 Principles
21/37
21
5) Processor Performance5) Processor Performance
sec secclock cycleCPU Time CPU Cycles for program clock cycle time
program program clock cy =
sec
sec
clock cycleCPU Cycles for program
programCPU Time
clock cycle program clock rate
=
or
8/9/2019 L03 Principles
22/37
22
CPI Clocks per InstructionCPI Clocks per Instruction
clock cycleCPU Cycles for program programclock cyles
CPI instruction instruction
instruction count
program
=
sec
sec
clock cycle instructionsCPI instruction count
instruction programCPU Time
clock cycle program clock rate
=
8/9/2019 L03 Principles
23/37
23
Details of CPIDetails of CPI
( )
( )
=
=
=
iii
iii
i
ii
ICPI
ICPI
ICPI
rateClock e performancCPU
countnInstructioCPI
countnInstructio CPI
We can break performance down intoindividual types of instructions (instructionof type i ) simplistic CPU
8/9/2019 L03 Principles
24/37
8/9/2019 L03 Principles
25/37
25
Processor Performance EqnProcessor Performance EqnHow can we improve performance?How can we improve performance?
Clockrate CPI Instruction counHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x
Clockrate CPI Instruction counHardware technology (realization) xHardware organization (implementation) x xInstruction set (architecture) x xCompiler technology x xProgram x x
8/9/2019 L03 Principles
26/37
26
Example 1Example 1
A LOAD/STORE machine has the characteristics shown below. We alsoobserve that 25% of the ALU operations directly use a loaded value that isnot used again. Thus we hope to improve things by adding new ALUinstructions that have one source operand in memory. The CPI of the newinstructions is 2. The only unpleasant consequence of this change is thatthe CPI of branch instructions will increase from 2 to 3. Overall, will CPUperformance increase?
A LOAD/STORE machine has the characteristics shown below. We alsoobserve that 25% of the ALU operations directly use a loaded value that isnot used again. Thus we hope to improve things by adding new ALUinstructions that have one source operand in memory. The CPI of the newinstructions is 2. The only unpleasant consequence of this change is thatthe CPI of branch instructions will increase from 2 to 3. Overall, will CPU
performance increase?
Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2
8/9/2019 L03 Principles
27/37
27
Example 1 (Solution)Example 1 (Solution)
Instruction type Frequency CPIALU ops 0.43 1Loads 0.21 2Stores 0.12 2Branches 0.24 2
TIC57.1
T1.57IC
timecycleClock CPIICtimeCPU
1.5720.24)0.12(0.2110.43CPI
=
=
=
=+++=
Before change
Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1
Loads (0.21-x)/(1-x) 2Stores 0.12/ (1- x ) 2Branches 0.24/ (1-x) 3Reg-mem ops x/(1-x) 2 TIC1.703
T908.1IC)-(1
timecycleClock CPIICtimeCPU
908.10.8925
1.7025-1
30.242)0.12-(0.211)-(0.43 CPI
1075.040.43
=
=
=
==
++++=
==
x
x x x x
xAfter change
Since CPU time increases, change will not improve performance.
8/9/2019 L03 Principles
28/37
28
Example 2Example 2
A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?
A load-store machine has the characteristics shown below. An optimizingcompiler for the machine discards 50% of the ALU operations, although itcannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)clock, what is the MIPS rating for optimized code versus unoptimized code?Does the ranking of MIPS agree with the ranking of execution time?
Instruction type Frequency CPIALU ops 43% 1
Loads 21% 2Stores 12% 2Branches 24% 2
8/9/2019 L03 Principles
29/37
29
Example 2 (Solution)Example 2 (Solution)
Instruction type Frequency CPIALU ops 43% 1Loads 21% 2Stores 12% 2
Branches 24% 2
5.318101.57
MHz500 MIPS
IC1014.3
1021.57IC
timecycleClock CPIICtimeCPU1.5720.24)0.12(0.2110.43CPI
6
9-
9-
=
=
=
=
=
=+++=
Without optimization
Instruction type Frequency CPIALU ops (0.43-x)/(1-x) 1Loads 0.21/ (1-x) 2Stores 0.12/ (1- x ) 2Branches 0.24/(1-x) 2
0.2891073.1
MHz500 MIPS
IC1072.2
10273.1IC)-(1
timecycleClock CPIICtimeCPU
73.10.7851.355
-120.24)0.12(0.211x)-(0.43
CPI
20.43
6
9-
9-
=
=
=
=
=
==
+++=
=
x
x
xWith optimization
Performance increases,but MIPS decreases!
f f
8/9/2019 L03 Principles
30/37
30
Performance of (Blocking) CachesPerformance of (Blocking) Caches
timecycleClock cyclesCPUtimeCPU =
timecycleClock cycles)stallMemorycycles(CPUtimeCPU +=
penaltyMissreferenceMemory
MissesnInstructio
referencesMemoryIC
penaltyMissnInstructio
Misses IC
penaltyMissmissesof Number cyclesstallMemory
=
=
=
CPIICcyclesCPU =
no cache misses!no cache misses!no cache misses!no cache misses!
with cache misses!with cache misses!with cache misses!with cache misses!
IC instruction count
l
8/9/2019 L03 Principles
31/37
31
ExampleExample
Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memory
accesses were cache hits?
Assume we have a machine where the CPI is 2.0 when allmemory accesses hit in the cache. The only data accessesare loads and stores, and these total 40% of the instructions.If the miss penalty is 25 clock cycles and the miss rate is 2%,how much faster would the machine be if all memory
accesses were cache hits?
35.127.2
22502.0)4.01(2
CPI
penaltyMissrateMissnInstructio
refsMemoryCPI
timeCPUtimeCPU
missesno
misses
==
++=
+=
Why?
ll i d i f ll
8/9/2019 L03 Principles
32/37
32
Fallacies and PitfallsFallacies and PitfallsFallacies - commonly held misconceptionsFallacies - commonly held misconceptions
When discussing a fallacy, we try to give aWhen discussing a fallacy, we try to give acounterexample.counterexample.
Pitfalls - easily made mistakesPitfalls - easily made mistakesOften generalizations of principles true in limitedOften generalizations of principles true in limited
contextcontextWe show Fallacies and Pitfalls to help you avoid theseWe show Fallacies and Pitfalls to help you avoid theseerrorserrors
ll d f ll ( )F ll i d Pi f ll (1/3)
8/9/2019 L03 Principles
33/37
33
Fallacies and Pitfalls (1/3)Fallacies and Pitfalls (1/3)Fallacy: Benchmarks remain valid indefinitelyFallacy: Benchmarks remain valid indefinitely
Once a benchmark becomes popular, tremendousOnce a benchmark becomes popular, tremendouspressure to improve performance by targetedpressure to improve performance by targetedoptimizations or by aggressive interpretation of theoptimizations or by aggressive interpretation of therules for running the benchmark:rules for running the benchmark:benchmarksmanship.benchmarksmanship.
70 benchmarks from the 5 SPEC releases. 70% were70 benchmarks from the 5 SPEC releases. 70% weredropped from the next release since no longer usefuldropped from the next release since no longer useful
Pitfall: A single point of failurePitfall: A single point of failureRule of thumb for fault tolerant systems: make sureRule of thumb for fault tolerant systems: make surethat every component was redundant so that nothat every component was redundant so that nosingle component failure could bring down the wholesingle component failure could bring down the wholesystem (e.g, power supply)system (e.g, power supply)
ll d f ll ( )F ll i d Pi f ll (2/3)
8/9/2019 L03 Principles
34/37
34
Fallacies and Pitfalls (2/3)Fallacies and Pitfalls (2/3)Fallacy - Rated MTTF of disks is 1,200,000Fallacy - Rated MTTF of disks is 1,200,000
hours orhours or 140 years, so disks practically never fail140 years, so disks practically never failDisk lifetime is ~5 yearsDisk lifetime is ~5 years replace a diskreplace a diskevery 5 years; on average, 28 replacementevery 5 years; on average, 28 replacementcycles wouldn't fail (140 years long time!)cycles wouldn't fail (140 years long time!)Is that meaningful?Is that meaningful?Better unit: % that fail in 5 yearsBetter unit: % that fail in 5 years
Next slideNext slide
ll i d i f ll (3/3)F ll i d Pi f ll (3/3)
8/9/2019 L03 Principles
35/37
35
Fallacies and Pitfalls (3/3)Fallacies and Pitfalls (3/3)
So 3.7% will fail over 5 yearsSo 3.7% will fail over 5 yearsBut this is under pristine conditionsBut this is under pristine conditions
little vibration, narrow temperature rangelittle vibration, narrow temperature range no power failuresno power failures
Real world: 3% to 6% of SCSI drives fail per yearReal world: 3% to 6% of SCSI drives fail per year3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen3400 - 6800 FIT or 150,000 - 300,000 hour MTTF [Gray & van Ingen05]05]
3% to 7% of ATA drives fail per year3% to 7% of ATA drives fail per year3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen3400 - 8000 FIT or 125,000 - 300,000 hour MTTF [Gray & van Ingen05]05]
Number of disks Time Period Failed DisksMTTF
=
1000 (5*365* 24 )37
1,200,000disks hours
Failed Diskshours
= =
N TiN Ti
8/9/2019 L03 Principles
36/37
36
Next TimeNext TimeInstruction Set ArchitectureInstruction Set Architecture
Appendix BAppendix B
R fR f
8/9/2019 L03 Principles
37/37
ReferencesReferencesG. M. Amdahl, Validity of the single processor G. M. Amdahl, Validity of the single processor approach to achieving large scale computingapproach to achieving large scale computingcapabilities, AFIPS Conference Proceedings, pp. 483-capabilities, AFIPS Conference Proceedings, pp. 483-485, April 1967485, April 1967
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdfhttp://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf