Hybrid System EmulationHybrid System Emulation
Taeweon SuhTaeweon Suh
Computer Science EducationComputer Science EducationKorea UniversityKorea University
January 2010January 2010
2/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
3/36
ScopeScope
CPU
North Bridge
South Bridg
e
Main Memor
y(DDR2)
FSB (Front-Side Bus)
DMI (Direct Media I/F)
A typical computer system (till Core A typical computer system (till Core 2)2)
4/36
Scope (Cont.)Scope (Cont.)
CPU
North Bridge
South Bridg
e
Main Memor
y(DDR3)
Quickpath (Intel) orQuickpath (Intel) orHypertransport (AMD)Hypertransport (AMD)
DMI (Direct Media I/F)
A Nehalem-based computer systemA Nehalem-based computer system
5/36
Scope (Cont.)Scope (Cont.)
CPU
North Bridge
South Bridg
e
Main Memor
y(DDR2)
FSB
DMI
CPU
core
L1, L2
core
L1, L2
…Scope of this talk
6/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
7/36
BackgroundBackground
Computer architecture research has Computer architecture research has been done mostly with software been done mostly with software simulationsimulation– ProsPros
Relatively easy-to-implementRelatively easy-to-implement FlexibilityFlexibility ObservabilityObservability DebuggabilityDebuggability
– ConsCons Simulation timeSimulation time Difficulty modeling real-world such as I/ODifficulty modeling real-world such as I/O
8/36
Background (Cont.)Background (Cont.)
What is an alternative? What is an alternative? – FPGA (Field-Programmable Gate Array) FPGA (Field-Programmable Gate Array)
ReconfigurabilityReconfigurability– Programmable hardwareProgrammable hardware– Short turn-around timeShort turn-around time
High operation frequencyHigh operation frequency Observability and debuggabilityObservability and debuggability Many IPs providedMany IPs provided
– CPUs, memory controller, etc.CPUs, memory controller, etc.
9/36
Background (Cont.)Background (Cont.)
FPGA capability exampleFPGA capability example– Reconfigurable PentiumReconfigurable Pentium
Real PentiumReconfigurabl
e PentiumFPGA
Reconfigurable Pentium
10/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
11/36
Related WorkRelated Work
MemorIES (2000)MemorIES (2000)– MemorMemory y IInstrumentation and nstrumentation and EEmulation mulation
SSystem from IBM T.J. Watsonystem from IBM T.J. Watson– L3 Cache and/or coherence protocol L3 Cache and/or coherence protocol
emulation emulation Plugged into 6xx bus of RS/6000 SMP machinePlugged into 6xx bus of RS/6000 SMP machine
– Passive emulatorPassive emulator
12/36
Related Work (Cont.)Related Work (Cont.)
RAMPRAMP– RResearch esearch AAccelerator for ccelerator for MMultiple ultiple PProcessorsrocessors– Parallel computer architecture Parallel computer architecture
Multi-core HW/SW researchMulti-core HW/SW research– Full emulatorFull emulator– Multi-disciplinary project by UC-Berkeley, Multi-disciplinary project by UC-Berkeley,
Stanford, CMU, UT-Austin, MIT and IntelStanford, CMU, UT-Austin, MIT and Intel
BEE2 board
FPGAs
13/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
14/36
Hybrid System EmulationHybrid System Emulation
Combination of FPGA and a real Combination of FPGA and a real systemsystem– FPGA is deployed in a system of interestFPGA is deployed in a system of interest– FPGA interacts with a systemFPGA interacts with a system
Monitor transactions from the systemMonitor transactions from the system Provide feedback to the system Provide feedback to the system
– System-level System-level activeactive emulation emulation– Run workload in a real systemRun workload in a real system– Research, measure and evaluate the Research, measure and evaluate the
emulated components in a full-system emulated components in a full-system configurationconfiguration
FPGA is deployed on FSB in this FPGA is deployed on FSB in this researchresearch
15/36
Intel server systemIntel server system
Pentium-IIIPentium-III
FPGA boardFPGA board
Hybrid System Emulation Experiment SetupHybrid System Emulation Experiment Setup
Front-side bus (FSB)
Pentium-III Pentium-III
North Bridge
2GB SDRAM
Pentium-IIIPentium-IIIFPGAFPGA
Use an Intel server system equipped with two Use an Intel server system equipped with two Pentium-IIIsPentium-IIIs
Replace one Pentium-III with an FPGAReplace one Pentium-III with an FPGA– FPGA actively participates in transactions on FSBFPGA actively participates in transactions on FSB
16/36
Hybrid System Emulation Front-side Bus (FSB)Hybrid System Emulation Front-side Bus (FSB)
FSB protocolFSB protocol– 7-stage pipelined bus (Pentium-III)7-stage pipelined bus (Pentium-III)
Request1, request2, error1, error2, snoop, Request1, request2, error1, error2, snoop, response, dataresponse, data
How FPGA participates in FSB How FPGA participates in FSB transactions?transactions?– Snoop stallSnoop stall
Part of cache coherence mechanismPart of cache coherence mechanism Delaying the snoop responseDelaying the snoop response
– Cache-to-cache transferCache-to-cache transfer Part of cache coherence mechanismPart of cache coherence mechanism Providing data from a processor’s cache to Providing data from a processor’s cache to
the requester via FSBthe requester via FSB
17/36Main Memory
Pentium-III Pentium-III (P1)(P1)
Pentium-IIIPentium-III(P0)(P0)
North Bridge
Hybrid System Emulation Cache Coherence ProtocolHybrid System Emulation Cache Coherence Protocol Example: MESI ProtocolExample: MESI Protocol
– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI
ModifiedExclusiveSharedInvalid
1234
Example
E 1234S 1234 S 1234M abcd
invalidation
I 1234 S abcdS abcd
1. P0: read2. P1: read3. P1: write (abcd)4. P0: read
I ----- I -----
abcd
shared
“snoop stall”
cache-to-cache transfer
18/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation┼┼
– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
┼┼ Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Erico Nurvitadhi, Jumnit Hong and Shih-Lien Lu “Active Cache Emulator.”, Emulator.”, IEEE Transactions on VLSI Systems, 2008IEEE Transactions on VLSI Systems, 2008
19/36
L3 Cache Emulation MethodologyL3 Cache Emulation Methodology
L3 cache emulation methodologyL3 cache emulation methodology– Implement L3 tags in FPGAImplement L3 tags in FPGA– If missed, inject snoop stalls and store the information in If missed, inject snoop stalls and store the information in
L3 tagL3 tag ““New” memory access latency (= L3 miss latency)New” memory access latency (= L3 miss latency)
= snoop stalls + = snoop stalls + memory access latencymemory access latency– If hit, no snoop stallIf hit, no snoop stall
L3 latency (L3 hit latency)L3 latency (L3 hit latency) = = memory access latencymemory access latency
Front-side bus (FSB)
Pentium-III Pentium-III
North Bridge
2GB SDRAM
Snoop stalls
FPGAFPGA
L3 TAGL3 TAGL1, L2L1, L2
datadata
Miss!
Hit!
20/36
L3 Cache Emulation Experiment EnvironmentL3 Cache Emulation Experiment Environment
Operating systemOperating system– Windows XPWindows XP
Validation of emulated L3 cacheValidation of emulated L3 cache– RightMark Memory Analyzer RightMark Memory Analyzer ┼┼
┼┼ RightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtmlRightMark Memory Analyzer, http://cpu.rightmark.org/products/rmma.shtml
21/36
L3 Cache Emulation Experiment ResultL3 Cache Emulation Experiment Result
RightMark Memory Analyzer resultRightMark Memory Analyzer result
L3 L3 CacheCache
L2 L2 CacheCache
Access Access latency latency
(CPU cycle)(CPU cycle)
Main Main MemoryMemory
AccesAccess s latenclatency y (nsec)(nsec)
L1 L1 CacheCache
Working set sizeWorking set size
22/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic Efficiency Evaluation of Coherence Traffic Efficiency
┼┼
– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
┼┼ Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Taeweon Suh, Shih-Lien Lu and Hsien-Hsin S. Lee, “An FPGA Approach to Quantifying Coherence Traffic Efficiency on Multiprocessor Systems.”, 17Coherence Traffic Efficiency on Multiprocessor Systems.”, 17 thth FPL 2007 FPL 2007
23/36
Evaluation of Coherence Traffic Efficiency MethodologyEvaluation of Coherence Traffic Efficiency Methodology Evaluation methodologyEvaluation methodology
– Implement an L2 cache in FPGAImplement an L2 cache in FPGA– Save evicted cache lines into theSave evicted cache lines into the cachecache– Supply data using cache-to-cache transfer when Supply data using cache-to-cache transfer when
P-III requests it next timeP-III requests it next time– Measure execution time of benchmarks and Measure execution time of benchmarks and
compare with the baselinecompare with the baseline
Front-side bus (FSB)
Pentium-III Pentium-III (MESI)(MESI)
North Bridge
2GB SDRAM
“cache-to-cache transfer”
FPGAFPGA
D$D$
24/36
Evaluation of Coherence Traffic Efficiency Experiment EnvironmentEvaluation of Coherence Traffic Efficiency Experiment Environment
Operating systemOperating system– Redhat Linux 2.4.20-8 Redhat Linux 2.4.20-8
Natively run SPEC2000 benchmarkNatively run SPEC2000 benchmark– Selection of benchmark does not affect the Selection of benchmark does not affect the
evaluation as long as reasonable # bus traffic is evaluation as long as reasonable # bus traffic is generatedgenerated
FPGA sends statistics information to PC via FPGA sends statistics information to PC via UARTUART– # cache-to-cache transfers per second# cache-to-cache transfers per second– # invalidation traffic per second# invalidation traffic per second
25/36
100k
200k
300k
400k
500k
600k
700k
800k 1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Evaluation of Coherence Traffic Efficiency Experiment Results Evaluation of Coherence Traffic Efficiency Experiment Results
Average # cache-to-cache transfers / Average # cache-to-cache transfers / secondsecond
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge #
cach
e-t
o-c
ach
e
tran
sfe
rs/s
ec
804.2K/sec
433.3K/sec
26/36
Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB-20
0
20
40
60
80
100
120
140
160
180
200
exec
ution tim
e in
crea
se o
ver
bas
elin
e (s
ec)
cache size in FPGA
Average
Average execution time increaseAverage execution time increase– Baseline: benchmarks execution on a single P-III without Baseline: benchmarks execution on a single P-III without
FPGAFPGA data is always supplied from main memorydata is always supplied from main memory
191 seconds
Average execution time: 5635
seconds(93 min)
171 seconds
27/36
Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown
Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA
Invalidation trafficCache-to-cache
transfer
Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles
Estimated run-times Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out
of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)
second increasesecond increase
69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds
Cache-to-cache transfer on P-III server system is Cache-to-cache transfer on P-III server system is NOT as efficient as main memory access!NOT as efficient as main memory access!
28/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-Simulation HW/SW Co-Simulation ┼┼
ConclusionsConclusions
┼┼ Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations Taeweon Suh, Hsien-Hsin S. Lee, Hsien-Hsin S. Lee, and John Shen, “Initial Observations of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2of Hardware/Software Co-Simulation using FPGA in Architecture Research.”, 2ndnd WARFP WARFP 20062006
29/36
HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation
Gain advantages from both software Gain advantages from both software simulation and hardware emulationsimulation and hardware emulation– FlexibilityFlexibility– High-speedHigh-speed
IdeaIdea– Offload heavy software routines into FPGAOffload heavy software routines into FPGA– The remaining simulator interacts with FPGAThe remaining simulator interacts with FPGA
30/36
HW/SW Co-Simulation Communication MethodHW/SW Co-Simulation Communication Method Communication between P-III and Communication between P-III and
FPGAFPGA– Use FSB as communication mediumUse FSB as communication medium– Allocate one page in memory for Allocate one page in memory for
communicationcommunication– SendSend data to FPGA: data to FPGA: write-throughwrite-through cache cache
modemode– ReceiveReceive data from FPGA: data from FPGA: cache-to-cachecache-to-cache
transfertransferFront-side bus (FSB)
Pentium-III Pentium-III (MESI)(MESI)
North Bridge
2GB SDRAM
FPGAFPGA
“write” bus transaction
“cache-to-cache transfer”“read” bus transaction
31/36
HW/SW Co-Simulation Co-Simulation ResultsHW/SW Co-Simulation Co-Simulation Results Preliminary experiment result with Preliminary experiment result with
SimpeScalar for correctness checkupSimpeScalar for correctness checkup– Implement a simple function Implement a simple function
((mem_access_latencymem_access_latency) into FPGA) into FPGA
mcf
bzip2
craftyeon-cook
Baseline (h:m:s)Co-simulation
(h:m:s)difference
(h:m:s)2:18:38 2:20:50 + 0:02:12
gcc-166
parser
perl
twolf
3:03:58 3:06:50 + 0:02:52
2:56:38 2:59:28 + 0:02:50
2:43:52 2:45:45 + 0:01:53
3:45:30 3:48:56 + 0:03:26
3:34:57 3:37:27 + 0:02:30
2:42:30 2:45:50 + 0:03:20
2:43:30 2:45:28 + 0:01:58
32/36
HW/SW Co-Simulation Analysis & Learnings HW/SW Co-Simulation Analysis & Learnings
Reason for the slowdownReason for the slowdown– FSB access is expensiveFSB access is expensive– Too simple function Too simple function
((mem_access_latency)mem_access_latency)– Device driver overhead Device driver overhead
Success criteriaSuccess criteria– Time-consuming software routines Time-consuming software routines – Reasonable FPGA access frequencyReasonable FPGA access frequency
33/36
HW/SW Co-Simulation Research OpportunityHW/SW Co-Simulation Research Opportunity Multi-core researchMulti-core research
– Implement distributed lowest level Implement distributed lowest level caches, and interconnection network such caches, and interconnection network such as ring or mesh in FPGAas ring or mesh in FPGA
L3
L3
CPU0L1,L2
Ring I/F
Ring I/F
CPU4
L1,L2
L3
L3
CPU1L1,L2
Ring I/F
Ring I/F
CPU5
L1,L2
L3
L3
CPU2L1,L2
Ring I/F
Ring I/F
CPU6
L1,L2
L3
L3
CPU3L1,L2
Ring I/F
Ring I/F
CPU7
L1,L2
FPGAFPGA
34/36
AgendaAgenda
ScopeScope BackgroundBackground Related WorkRelated Work Hybrid System EmulationHybrid System Emulation Case StudiesCase Studies
– L3 Cache EmulationL3 Cache Emulation– Evaluation of Coherence Traffic EfficiencyEvaluation of Coherence Traffic Efficiency– HW/SW Co-SimulationHW/SW Co-Simulation
ConclusionsConclusions
35/36
ConclusionsConclusions
Hybrid system emulationHybrid system emulation– Deploy FPGA to a place of interest in a Deploy FPGA to a place of interest in a
systemsystem– System-level active emulationSystem-level active emulation– Take advantage of an existing systemTake advantage of an existing system
Presented 3 usage cases in computer Presented 3 usage cases in computer architecture researcharchitecture research– L3 cache emulationL3 cache emulation– Evaluation of coherence traffic efficiencyEvaluation of coherence traffic efficiency– HW/SW co-simulationHW/SW co-simulation
FPGA-based emulation provides an FPGA-based emulation provides an alternative to software simulationalternative to software simulation
36/36
Questions, Comments?Questions, Comments?
Thanks for your attention!
37/36
Backup Slides
38/36
North Bridge
Evaluation of Coherence Traffic Efficiency
Cache Coherence ProtocolEvaluation of Coherence Traffic Efficiency
Cache Coherence Protocol
Example: MESI ProtocolExample: MESI Protocol– Snoop-based protocolSnoop-based protocol– Intel implements MESIIntel implements MESI
ModifiedExclusiveSharedInvalid
1234
Example
E 1234S 1234 S 1234
shared
M abcd
invalidate
I 1234 S abcdS abcd
1. P0: read2. P1: read3. P1: write (abcd)4. P0: read
I ----- I -----
cache-to-cache
P0 P1
abcd
39/36
L3 Cache Emulation MotivationL3 Cache Emulation Motivation
Software simulation has limitationsSoftware simulation has limitations– Simulation timeSimulation time– Reduced dataset and workloadReduced dataset and workload
Results could be offset by 100% or moreResults could be offset by 100% or more
Passive emulation has limitationsPassive emulation has limitations– Monitor transactionsMonitor transactions– Impact of emulated components on Impact of emulated components on
system can not be modeledsystem can not be modeled
Full-simulation requires much more Full-simulation requires much more efforteffort– Take much longer time to developTake much longer time to develop
Develop a full systemDevelop a full system Adapt workload to a new systemAdapt workload to a new system
40/36
L3 Cache Emulation Motivation (Cont.)L3 Cache Emulation Motivation (Cont.)
Active Cache Emulation (ACE) Active Cache Emulation (ACE) – Take advantage of an existing systemTake advantage of an existing system– Deploy an emulated component to a place Deploy an emulated component to a place
of interestof interest
41/36
L3 Cache Emulation HW DesignL3 Cache Emulation HW Design
Implemented modules in FPGAImplemented modules in FPGA– State machinesState machines
Keep track of up to 8 FSB transactionsKeep track of up to 8 FSB transactions– L3 TagsL3 Tags
L3 in FPGA varies from 1MB to 64MBL3 in FPGA varies from 1MB to 64MB Block size varies from 32B to 512BBlock size varies from 32B to 512B
– Statistics module Statistics module
FPGA (FPGA (Xilinx Virtex-II)Xilinx Virtex-II)
Front-side bus (FSB)
L3 cache TagL3 cache Tag
Registers for Registers for statisticsstatistics
PC via UARTLogic Analyzer
State machineState machine
FSB pipeline
8
42/36
Evaluation of Coherence Traffic Efficiency HW DesignEvaluation of Coherence Traffic Efficiency HW Design Implemented modules in FPGAImplemented modules in FPGA
– State machinesState machines Keep track of FSB transactionsKeep track of FSB transactions
– Taking evicted data from FSBTaking evicted data from FSB– Initiating cache-to-cache transferInitiating cache-to-cache transfer
– Direct-mapped cachesDirect-mapped caches Cache size in FPGA varies from 1KB to 256KBCache size in FPGA varies from 1KB to 256KB Note that Pentium-III has 256KB 4-way set associative L2 Note that Pentium-III has 256KB 4-way set associative L2
– Statistics moduleStatistics module
Xilinx Virtex-II FPGAXilinx Virtex-II FPGA
Front-side bus (FSB)
Direct-mapped Direct-mapped cachecacheTagTag DataData
Registers for Registers for statisticsstatistics
PC via UART
Logic Analyzer
State machineState machine
write-back
cache-to-cache
the rest
8
43/36
HW/SW Co-Simulation Implementation HW/SW Co-Simulation Implementation Hardware (FPGA) implementationHardware (FPGA) implementation
– State machinesState machines Monitoring bus transactions on FSBMonitoring bus transactions on FSB Checking bus transaction types (read or write)Checking bus transaction types (read or write) Managing cache-to-cache transferManaging cache-to-cache transfer
– Software functions to FPGASoftware functions to FPGA– Statistics countersStatistics counters
Software implementationSoftware implementation– Linux device driverLinux device driver
Specific physical address is needed for Specific physical address is needed for communication communication
Allocate one page of memory for FPGA access via Allocate one page of memory for FPGA access via Linux device driverLinux device driver
– Simulator modification for accessing FPGASimulator modification for accessing FPGA
44/36
Comparison with SimpleScalar Comparison with SimpleScalar simulationsimulation
L3 Cache Emulation Experiment Results (Cont.)L3 Cache Emulation Experiment Results (Cont.)
45/36
Evaluation of Coherence Traffic Efficiency MotivationEvaluation of Coherence Traffic Efficiency Motivation
Evaluation of coherence traffic Evaluation of coherence traffic efficiencyefficiency– Why important?Why important?
Understand the impact of coherence traffic Understand the impact of coherence traffic on system performance on system performance
Reflect into communication architecture Reflect into communication architecture – Problems with traditional methodsProblems with traditional methods
Evaluation of protocols themselvesEvaluation of protocols themselves Software simulationsSoftware simulations Experiments on SMP machines: ambiguousExperiments on SMP machines: ambiguous
– SolutionSolution A novel method to measure the intrinsic A novel method to measure the intrinsic
delay of coherence traffic and evaluate its delay of coherence traffic and evaluate its efficiencyefficiency
46/36
0
50k
100k
150k
200k
250k
300k
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)
Average increase of invalidation traffic / Average increase of invalidation traffic / secondsecond
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge in
cre
ase o
f in
valid
ati
on
tr
affi
c/s
ec
157.5K/sec
306.8K/sec
47/36
0
10
20
30
40
50
60
70
1KB 2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB
Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)Evaluation of Coherence Traffic Efficiency Experiment Results (Cont.)
Average hit rate in the FPGA’s cacheAverage hit rate in the FPGA’s cache
gzip vpr gcc mcf parser gap bzip2 twolf average
Avera
ge h
it r
ate
(%
)
Hit rate = # cache-to-cache transfer
# data read (full cache line)
64.89%
16.9%
48/36
MotivationMotivation
Traditionally, evaluations of coherence protocols Traditionally, evaluations of coherence protocols focused on reducing bus traffic incurred along with focused on reducing bus traffic incurred along with state transitions of coherence protocolsstate transitions of coherence protocols– Trace-based simulations were mostly used for the Trace-based simulations were mostly used for the
protocol evaluationsprotocol evaluations Software simulations are too slow to perform the Software simulations are too slow to perform the
broad range analysis of system behaviorsbroad range analysis of system behaviors– In addition, it is very difficult to do exact real-world In addition, it is very difficult to do exact real-world
modeling such as I/Osmodeling such as I/Os System-wide performance impact of coherence System-wide performance impact of coherence
traffic has not been explicitly investigated using real traffic has not been explicitly investigated using real systemssystems
This research provides a new method to evaluate This research provides a new method to evaluate and characterize coherence traffic efficiency of and characterize coherence traffic efficiency of snoop-based, invalidation protocols using an off-the-snoop-based, invalidation protocols using an off-the-shelf system and an FPGAshelf system and an FPGA
49/36
Motivation and ContributionMotivation and Contribution Evaluation of coherence traffic Evaluation of coherence traffic
efficiencyefficiency– MotivationMotivation
Memory wall becomes Memory wall becomes higherhigher
– Important to understand the Important to understand the impact of communication impact of communication among processorsamong processors
Traditionally, evaluation of Traditionally, evaluation of coherence protocols focused coherence protocols focused on protocols themselveson protocols themselves
– Software-based simulationSoftware-based simulation FPGA technologyFPGA technology
– Original Pentium fits into one Original Pentium fits into one Xilinx Virtex-4 LX200Xilinx Virtex-4 LX200
– Recent emulation effortRecent emulation effort RAMP consortium RAMP consortium
– ContributionContribution A novel method to measure A novel method to measure
the intrinsic delay of the intrinsic delay of coherence traffic and coherence traffic and evaluate its efficiency using evaluate its efficiency using emulation techniqueemulation technique
MemorIES (ASPLOS MemorIES (ASPLOS 2000)2000)
BEE2 BEE2 board board
50/36
Cache Coherence ProtocolsCache Coherence Protocols
Well-known technique for data consistency Well-known technique for data consistency among multiprocessor with cachesamong multiprocessor with caches
ClassificationClassification– Snoop-based protocolsSnoop-based protocols
Rely on broadcasting on shared busRely on broadcasting on shared bus– Based on shared memoryBased on shared memory
Symmetric access to main memorySymmetric access to main memory Limited scalabilityLimited scalability Used to build small-scale multiprocessor systemsUsed to build small-scale multiprocessor systems
– Very popular in servers and workstationsVery popular in servers and workstations
– Directory-based protocolsDirectory-based protocols Message-based communication via interconnection Message-based communication via interconnection
networknetwork– Based on distributed shared memory (DSM)Based on distributed shared memory (DSM)
Cache coherent non-uniform memory Access (ccNUMA)Cache coherent non-uniform memory Access (ccNUMA) ScalableScalable Used to build large-scale systemsUsed to build large-scale systems Actively studied in 1990sActively studied in 1990s
51/36
Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Snoop-based protocolsSnoop-based protocols
– Invalidation-based protocolsInvalidation-based protocols Invalidate shared copies when writingInvalidate shared copies when writing 1980s1980s
– Write-once, Synapse, Berkeley, and Illinois Write-once, Synapse, Berkeley, and Illinois Currently, adopt different combinations of Currently, adopt different combinations of
the states (M, O, E, S, and I)the states (M, O, E, S, and I)– MEI: PowerPC750, MIPS64 20KcMEI: PowerPC750, MIPS64 20Kc– MSI: Silicon Graphics 4D seriesMSI: Silicon Graphics 4D series– MESI: Pentium class, AMD K6, PowerPC601MESI: Pentium class, AMD K6, PowerPC601– MOESI: AMD64, UltraSparcMOESI: AMD64, UltraSparc
– Update-based protocolsUpdate-based protocols Update shared copies when writingUpdate shared copies when writing Dragon protocol and FireflyDragon protocol and Firefly
52/36
Cache Coherence Protocols (Cont.)Cache Coherence Protocols (Cont.) Directory-based protocolsDirectory-based protocols
– Memory-based schemesMemory-based schemes Keep Keep directory at the granularity of a cache directory at the granularity of a cache
lineline in home node’s memory in home node’s memory– One dirty bit, and one presence bit for each nodeOne dirty bit, and one presence bit for each node
Storage overhead due to directoryStorage overhead due to directory ExamplesExamples
– Stanford DASH, Stanford FLASH, MIT Alewife, and SGI Stanford DASH, Stanford FLASH, MIT Alewife, and SGI OriginOrigin
– Cache-based schemesCache-based schemes Keep Keep only head pointer for each cache lineonly head pointer for each cache line in in
home node’ directoryhome node’ directory– Keep Keep forward and backward pointers in cachesforward and backward pointers in caches of each of each
nodenode Long latency due to serialization of messagesLong latency due to serialization of messages ExamplesExamples
– Sequent NUMA-Q, Convex Exemplar, and Data GeneralSequent NUMA-Q, Convex Exemplar, and Data General
53/36
Emulation Initiatives for Protocol EvaluationEmulation Initiatives for Protocol Evaluation RPM (mid-to-late ’90s)RPM (mid-to-late ’90s)
– RRapid apid PPrototyping engine for rototyping engine for MMultiprocessor ultiprocessor from Univ. of Southern Californiafrom Univ. of Southern California
– ccNUMA Full system emulation ccNUMA Full system emulation A Sparc IU/FPU core is used as CPU in each A Sparc IU/FPU core is used as CPU in each
node, and the rest (L1, L2 etc) is implemented node, and the rest (L1, L2 etc) is implemented with 8 FPGAswith 8 FPGAs
Nodes are connected through Futurebus+Nodes are connected through Futurebus+
54/36
FPGA Initiatives for EvaluationFPGA Initiatives for Evaluation
Other cache emulatorsOther cache emulators– RACFCS (1997)RACFCS (1997)
RReconfigurable econfigurable AAddress ddress CCollector and ollector and FFlying lying CCache ache SSimulator from imulator from Yonsei Yonsei Univ. in Korea Univ. in Korea
Plugged into Intel486 busPlugged into Intel486 bus– Passively collect Passively collect
– HACS (2002)HACS (2002) HHardware ardware AAccelerated ccelerated CCache ache SSimulator imulator
from Brigham Young Univ.from Brigham Young Univ. Plugged into FSB of Pentium-Pro-based Plugged into FSB of Pentium-Pro-based
systemsystem– ACE (2006)ACE (2006)
AActive ctive CCache ache EEmulator from Intel Corp.mulator from Intel Corp. Plugged into FSB of Pentium-III-based Plugged into FSB of Pentium-III-based
systemsystem
55/36
Background (Cont.)Background (Cont.)
ExampleExample
56/36
Intel server systemIntel server system
Pentium-IIIPentium-III
FPGA boardFPGA board
Logic analyzerLogic analyzerHost PCHost PC
UARTUART
Hybrid System Emulation Experiment Setup (Cont.)Hybrid System Emulation Experiment Setup (Cont.)
57/36
Experimental Setup (Cont.)Experimental Setup (Cont.)
Xilinx Virtex-IIFPGA
FSB interface
Logic analyzer ports
LEDs
58/36
FSB ProtocolFSB Protocol
Snoop stallSnoop stall
ADS
addrA[35:3]#
request1
request2
error1
error2
snoop
response
dataFSB pipeline stages
HITM#
HIT new transaction
Snoop stalls
59/36
FSB ProtocolFSB Protocol
Cache-to-cache transferCache-to-cache transfer
ADS
addrA[35:3]#
HIT#
HITM#
TRDY#
DRDY#
DBSY#
data0D[63:0]# data2 data3data1
request1
request2
error1
error2
snoop
response
dataFSB pipeline stages
snoop-hit
memory controller is ready to accept data
new transaction
60/36
Evaluation MethodologyEvaluation Methodology Goal Goal
– Measure Measure the intrinsic delay of coherence trafficthe intrinsic delay of coherence traffic and and evaluate its efficiencyevaluate its efficiency
Shortcomings in multiprocessor environmentShortcomings in multiprocessor environment– Nearly impossible to isolate the impact of coherence Nearly impossible to isolate the impact of coherence
traffic on system performancetraffic on system performance– Even worse, there are non-deterministic factorsEven worse, there are non-deterministic factors
Arbitration delayArbitration delay Stall in pipelined busStall in pipelined bus
“cache-to-cache transfer”shared bus
Processor Processor 00
(MESI)(MESI)
Memorycontroller
Main memory
Processor Processor 11
(MESI)(MESI)
Processor Processor 22
(MESI)(MESI)
Processor Processor 33
(MESI)(MESI)
61/36
Evaluation of Coherence Traffic Efficiency Run-time BreakdownEvaluation of Coherence Traffic Efficiency Run-time Breakdown
Run-time estimation with 256KB cache in Run-time estimation with 256KB cache in FPGAFPGA
Invalidation trafficCache-to-cache
transfer
Latencies 5 ~ 10 FSB cycles5 ~ 10 FSB cycles 10 ~ 20 FSB cycles10 ~ 20 FSB cycles
Estimated run-times
Estimated time =Estimated time = avg. occurrenceavg. occurrence
secsecx avg. total execution timeavg. total execution timex
clock periodclock period
cyclcyclee
x latency of each trafficlatency of each traffic
Note that the execution time increased 171 seconds on average out Note that the execution time increased 171 seconds on average out of average total execution time (5635 seconds) of the baselineof average total execution time (5635 seconds) of the baseline
Cache-to-cache transfer is responsible for at least 33 (171-138) Cache-to-cache transfer is responsible for at least 33 (171-138)
second increase!second increase!
69 ~ 138 seconds 69 ~ 138 seconds 381 ~ 762 381 ~ 762 seconds seconds
Coherence traffic on P-III server system is NOT as Coherence traffic on P-III server system is NOT as efficient as main memory accessefficient as main memory access
62/36
ConclusionConclusion Proposed a novel method to measure the Proposed a novel method to measure the
intrinsic delay of coherence traffic and intrinsic delay of coherence traffic and evaluate its efficiencyevaluate its efficiency– Coherence traffic in P-III-based Intel server system Coherence traffic in P-III-based Intel server system
is not efficient as expectedis not efficient as expected The main reason is that, in MESI, main memory The main reason is that, in MESI, main memory
should be updated at the same time upon should be updated at the same time upon cache-to-cache-to-cache transfercache transfer
Opportunities for performance enhancementOpportunities for performance enhancement– For faster cache-to-cache transferFor faster cache-to-cache transfer
Cache line buffers in memory controllerCache line buffers in memory controller– As long as buffer space is available, memory controller As long as buffer space is available, memory controller
can take datacan take data
MOESI would help shorten the latencyMOESI would help shorten the latency – Main memory need not be updated upon cache-to-cache Main memory need not be updated upon cache-to-cache
transfertransfer
– For faster invalidation trafficFor faster invalidation traffic Advancing the snoop phase to an earlier stageAdvancing the snoop phase to an earlier stage
63/36
HW/SW Co-Simulation MotivationHW/SW Co-Simulation Motivation
Software simulationSoftware simulation– ProsPros
Flexible, observable, easy-to-implementFlexible, observable, easy-to-implement– ConsCons
Intolerable simulation timeIntolerable simulation time
Hardware emulationHardware emulation– ProsPros
Significant speedupSignificant speedup Concurrent executionConcurrent execution
– ConsCons Much less flexible and observableMuch less flexible and observable Low-level design taking longer time to Low-level design taking longer time to
implement and validateimplement and validate
64/36
Communication DetailsCommunication Details
All FSB signals are mapped to FPGA pinsAll FSB signals are mapped to FPGA pins Encoding software function arguments Encoding software function arguments
in the FSB address for Simplescalar in the FSB address for Simplescalar exampleexample– For 4KB page,For 4KB page,
Set its attribute as write-through modeSet its attribute as write-through mode Lower 12 bits in FSB address bus are free to Lower 12 bits in FSB address bus are free to
useuse High 24 bits are used for TLB translationHigh 24 bits are used for TLB translation
Front-side bus (FSB)
Pentium-III Pentium-III (MESI)(MESI)
XilinxXilinxVirtex-IIVirtex-II
65/36
HW/SW Co-Simulation Co-simulation Results Analysis HW/SW Co-Simulation Co-simulation Results Analysis FSB access is expensiveFSB access is expensive
– ~ 20 FSB cycles (~ 20 FSB cycles (≈ ≈ 160 CPU cycles) for each 160 CPU cycles) for each transfertransfer
One cache line (32 bytes) needs to be One cache line (32 bytes) needs to be transferred for cache-to-cache transfertransferred for cache-to-cache transfer
P-III MESI requires to update main memory P-III MESI requires to update main memory upon cache-to-cache transferupon cache-to-cache transfer
““mem_access_latency”mem_access_latency” function is too function is too simplesimple– Even software simulation takes at most a few Even software simulation takes at most a few
dozen CPU cyclesdozen CPU cycles Device driver overhead Device driver overhead
– System overhead due to device driverSystem overhead due to device driver– It requires one TLB entry, which would be It requires one TLB entry, which would be
used in the simulation otherwiseused in the simulation otherwise Time-consuming software routines and Time-consuming software routines and
reasonable FPGA access frequency are reasonable FPGA access frequency are needed to benefit from hardware needed to benefit from hardware implementationimplementation
66/36
Conclusions Conclusions
Proposed a new co-simulation methodologyProposed a new co-simulation methodology Preliminary co-simulation using Preliminary co-simulation using
Simplescalar proves the correctness of the Simplescalar proves the correctness of the methodology methodology – Hardware/softwareHardware/software implementationimplementation– Communication between P-III and FPGA via FSBCommunication between P-III and FPGA via FSB– Linux driver Linux driver
Co-simulation results indicate Co-simulation results indicate – Bus access (FSB) is expensiveBus access (FSB) is expensive– Linux driver overhead also needs to be overcomeLinux driver overhead also needs to be overcome– Time-consuming blocks need to be emulatedTime-consuming blocks need to be emulated
Multi-core co-simulation would benefit from Multi-core co-simulation would benefit from FPGAFPGA– Implement distributed low-level caches and Implement distributed low-level caches and
interconnection network, which would be complex interconnection network, which would be complex enough to benefit from hardware modelingenough to benefit from hardware modeling