Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | lexus-fewell |
View: | 216 times |
Download: | 1 times |
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
The Powerful and The Fallen
Common Name
Issue Structure
Hazard Detection
Scheduling Distinguishing characteristics
Examples
Superscalar (static)
Dynamic Hardware Static In order execution Sun UltraSPARC II and III
Superscalar (dynamic)
Dynamic hardware Dynamic Some out of order execution
IBM Power2
Superscalar (speculative)
Dynamic Hardware Dynamic With speculation
Speculative out of order execution
Pentium 3 and 4
VLIW / LIW Static Software Static No hazards between issues packets
Trimedia, i860
EPIC Mostly Static Mostly Software
Mostly Static Explicit Dependences marked by compiler
Itanium
Multiple Issue Architectures: Increase your IPC / Take advantages of ILP
Register RenamingTomasulo Algorithm Reorder Buffer Scoreboarding
The Powerful and The Fallen
Register Renaming
Tomasulo AlgorithmReorder Buffer
ScoreboardingBased on the CDC 6000 ArchitectureImportant Feature: Scoreboard
Issue: WAW, Decode: RAW, execute and write results: WAR
Implemented in the IBM360/91’s floating point unit.Important Feature: Reservation Station and CDB
Issue: tag if not available, copy if they are; Execute: stall RAW monitoring the CDB Write results: Send results to the CDB and dump the store buffer contents; Exception Handling: No insts can be issued until a branch can be resolved
The Powerful and The Fallen
Power5Dual Core Two way SMT IBM PowerPC SuperScalar Architecture.
Picture Courtesy of IBM from “Power5 Microarchitecture”
The Powerful and The Fallen
Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
The MutualistsVector Processing
Super Computer of the pastSIMD type of designElements of the data stream are worked by a
single type of instructionSimplifies hardware designMoving toward more “general” purpose
vector processing
The MutualistsThe Cell Broadband EngineCreated by STI Composed of nine computing elements
•The brain of the system•Organizer •Runs Linux•PowerPC dual issue arch
•A modified Vector Arch•Limited memory: 256 KiB•All accesses are to and from this local memory•Main Memory Accesses DMA transfers
BEI
Flex IO
Memory Interface
SPE
PPSS
SPEPPE MFCMFC
•Each SPE has a MFC unit•Issue and receive DMA to and from main memory•Gate Keeper of the bus
•Four rings•Has QoS in a limited fashion (RAM)
Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
The Just PassingCache “Invisible” architecture
componentNot so much in the last years
PowerPC and other architecture provides instructions to control
dcbf[e], dcbst[e], dcbz[e], icbi[e], isyncInstruction available to touch, to zeroed
out, to reserve, or to lock a line in place.But for some interesting designs look no
further than …
The Just PassingXBOX 360 Xenon Architectures
Picture Courtesy of IBM from ”XBOX 360 System Microarchitecture”
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
The Olympic SprintersThe Hertz race is over; however …
Some processors are still at it …Power 6 and 7 running at 4 and 5 GHzIntel Polaris: 3.6 to 6 GHz
Many hardware re-designs are in orderMake pipelines shorter, simplerGet rid of “extra” hardware features
The Olympic Sprinters
13 FO4 versus 23 FO4 pipeline
Power6
Running at frequencies from 4 to 5 GHz
Pictures Courtesy of Intel from “IBM Power6 Microarchitecture”
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
The Threads’ CommuneLarge shared memory systems are
becoming scarceScalability issues due to synchronizationContentionCoherency and Consistency
Novel Solutions have emergedExplicit memory hierarchies with very weak
memory modelsMassive Multithreading on chipSynchronization in memory
The Threads’ CommuneCray XMT
128 Hardware streamsA stream is 31 64-bit registers, 8 target registers,
and a control registerThree functional units: M, A and C500 MHzFull and Empty bits per word (2-bits)
An example of a very high SMT design
The Threads’ CommuneSMT / HT designs
Time
Issue Slot
Super Scalar Coarse MT Fine MT SMT
http://www.intel.com/technology/computing/dual-core/demo/popup/demo.htm
The Threads’ Commune
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions
Cray MTA2 picture from Jonh Feo’s “Can programmers and Machines ever be friends”
The Threads’ CommuneData Race or Race Condition
“There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write”
The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races
ProblemsSeparation of lock and guarded data
The Threads’ CommuneCoherency and Consistency
Caching elements and make sure that everyone sees the last copy
If an element is written by processor A then how processor B and C will know that they have the latest copy?
Very difficult problem!One of the scalability problems of Shared
memory
The Threads’ CommuneHow Cray XMT solves these problems?
For Synchronization: Join the lock with each data word and put the synchronization requirement on the memory instead that the processor
For coherence and consistency: DO NOT cache remote data (outside the local 8 GiB)
OutlineThe Powerful and the FallenThe MutualistsThe Just PassingThe Olympic SprintersThe Threads’ CommuneBreaking the Despotic Rule of the Lock
Breaking the Despotic Rule of the LockSynchronization
Atomicity and SeriabilityLocks and BarriersAround hundreds to ten thousands of cycles and
grows linearly (in the best cases) or polynomial (in the worst cases) with the number of processors
The lockThe most used synch primitive!Alternatives: Lock-free data structures
Breaking the Despotic Rule of the LockLock Free Data Structures
Used to implement non blocking or / and wait free algorithms
Prevents deadlocks, livelocks and priority inversions
Potential problems: ABA problemIt tells us no-one is working on this now, but not if
someone has done it before
Transactional MemoryBased on transactions (an atomic bundle
operations)If two transactions conflict then one is bound to
fail
Side NoteA Review of LL and SC
27
PowerPC and many other architecture instructions
Provide a way to optimistically execute a piece of code
In case that a “violation” has taken place, discard your results
Many implementationsPowerPC: lwarx and stwcx
Side NoteThe LL and SC behavior
28
The lwarx instructionLoads a word
aligned locationSide Effects:
A reservation is created
Storage coherence mechanism is notified that a reservation exists
The stwcx instructionConditionally Store
a location to a given memory location.Conditionally
Depends on the reservation
If success, all changes will be committed to memory
If not, changes will be discarded.
Side NoteReservations
29
At most one per processorA reservation is lost when
Processor holding the reservation executes A lwarx or ldarx A stwcx or stdcx (No matter if the reservation matches or
not)Other processors executes
A store or a dcbz to the granuleSome other mechanism modifies a storage location in
the same reservation granuleInterrupts does not clean reservations
But interrupt handlers mightGranularity
The length of the memory block to keep under surveillance
Side NoteExamples
31
LL a = ?
SC a
a
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
a
Memory
a = ?
Side NoteExamples
32
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
X
a = 100;
Memory
a = 100
Side NoteExamples
33
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = ?
SC a
a += 100;
brnz
X
Memory
a = 100
Side NoteExamples
LL a = ?
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 100
Side NoteExamples
LL a = 100
SC a
a
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 100
Side NoteExamples
LL a = 100
SC a
X
a *= 100;
brnz
Storage Mechanism
LL a = 100
SC a
a += 100;
brnz
a
Memory
a = 200
Breaking the Despotic Rule of the Lock
Sun Rock ProcessorExecute AheadScouting ThreadsSimultaneous MultithreadingTransactional MemoryCheckpointCache memory with extra bits
for tracking speculative execution
32 logical threads and 16 physical cores
Pictures courtesy of “Rock: A SPARC CMT Processor”
Breaking the Despotic Rule of the LockTake a “RISC”-y Approach
Small transaction HWBest effort
Use the checkpoint mechanism!Transactions == Software construct
Checkpoint in case of failureCommit on successful transactionExecuted speculative by a strandUse the cache store buffers and locks cache lines
until commit ( tracking lines with the “s-bits” )
CBEPowerPC9 Core chip
Power564 bit PowerPC 2
Core with SMT
Codename: Rock16 Core Processor, 32 Logical
Threads
UltraSparc T2Codename: Niagara
8 Core Processor, 64 Logical Threads
UltraSparc T1Codename: Niagara
8 Core Processor, 32 Logical Threads
AMD Turion64 X2IA32 x86 Dual Core Chip
AMD OpteronCode Name:
DenmarkIA32 x86 2 Core Chip
AMD Code Name: BarcelonaIA32 x86 Native 4 Core
Chip
Codename: Sandy Bridge
Intel Core 2Codename: Penryn,
WolfdaleIA32 x86 Dual & Quad Core
Chip
Intel Core 2 DuoIA32 x86 2 Core
Chip
Intel Core DuoIA32 x86 Dual
Core Chip
Xeon Dual CoreIA32 x86 2 Core
Chip
Pentium DIA32 x86 2 Core
Chip
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Power 464 bit PowerPC
2 Core
Power 664 bit PowerPC
2 Core with SMT
Xenon64 bit
PowerPC 3 Core chip
Power7
Codename: Nehalem1 to 8 Core
Chip
IBM
Intel
AMD
SUN
Multi-core Trends in this Decade
Sources The Powerful and the Fallen
Sinharoy, B et al, “Power5 System Microarchitecture”, IBM Journal of Research and Development, Vol 49, June/September 2005
Marr, D et al, “Hyper-Threading Technology Architecture and Microarchitecture” Intel Technology Journal, Vol 6, Issue 1, 2002
The Mutualists The Just Passing
Andrews, Jeff and Baker, Nick “XBOX 360 System Architecture”, IEEE Micro, Volume 26, Issue 2 March 2006
The Olympic Sprinters Le, H.Q. et al, “Power6 System Microarchitecture,” IBM Journal
of Research and Development, Vol 61, November 2007 The Threads’ Commune
Konecny, P, “Introducing the Cray XMT,” May 5th, 2007 Feo, J ,“Can programmers and machines can ever be friends?”
Breaking the Despotic Rule of the Lock Chaundhry, S, “Rock: A SPARC CMT Processor”, August 26, 2008