Post on 23-Mar-2021
transcript
1
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU
WorkloadsS. –Y Lee, A. A. Kumar and C. J Wu
ISCA 2015
(2)
Goal
• Reduce warp divergence and hence increase throughput
• The key is the identification of critical (lagging) warps
• Manage resources and scheduling decisions to speed up the execution of critical warps thereby reducing divergence
2
(3)
Review: Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp ContextWarp ContextWarp ContextWarp Context
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
TB 0
Thread Block Control
L1/Shared MemoryLimits the #thread
blocks
Limits the #threads
SM – Stream MultiprocessorSP – Stream Processor
Locality effects
(4)
Evolution of Warps in TB
• Coupled lifetimes of warps in a TBv Start at the same timev Synchronization barriersv Kernel exit (implicit synchronization barrier)
Completed warps
Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
Region where latency hiding is less effective
3
(5)
Warp Execution Time Disparity
• Branch divergence, interference in the memory system, scheduling policy, and workload imbalance
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
(6)
Workload Imbalance
10 instructions 10000 instructions
• Imbalance exists even without control divergence
4
(7)
Branch Divergence
Warp-to-warp variation in dynamic instruction count (w/o branch divergence)
Intra-warp branch
divergenceExample: traversal over constant node degree graphs
(8)
Branch Divergence
Extent of serialization over a CFG traversal varies across warps
5
(9)
Impact of the Memory System
Executing Warps
Intra-wavefront footprints in the cache
• Could have been re-used• Too slow!
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
Scheduling gaps are greater than the re-use distance
(10)
Impact of Warp Scheduler
• Amplifies the critical warp effect
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
6
(11)
Criticality Predictor Logic
m instructions
n instructions
• Non-divergent branches can generate large differences in dynamic instruction counts across warps (e.g., m>>n)
• Update CPL counter on branch: estimate dynamic instruction count
• Update CPL counter on instruction commit
(12)
Warp Criticality Problem
Host CPU
Inte
rcon
nect
ion
Bus
GPU
SMX SMX SMX SMX
Kernel Distributor
SMX Scheduler Core Core Core Core
Registers
L1 Cache / Shard Memory
Warp Schedulers
Warp Context
Kern
el M
anag
emen
t Uni
t
HW W
ork
Que
ues
Pend
ing
Kern
els
Memory Controller
PC Dim Param ExeBLKernel Distributor Entry
Control Registers
DRAML2 Cache
TB
Warp
Warp
Available registers: spatial underutilization
Registers allocated to completed warps in the
TB: temporal underutilization
The last warp
Warp Context
Completed (idle) warp contexts
Manage resources and schedules around Critical Warps
Temporal & spatial underutilization
7
(13)
CPL Calculation
m instructions
n instructions
nCriticality = nInstr *w.CPIavg + nStall
Inter-instruction memory stall cycles
Average warp CPIAverage instruction Disparity between warps
(14)
Scheduling Policy
• Select warp based on criticality
• Execute until no more instructions are availablev A form of GTO
• Critical warps get higher priority and large time slice
8
(15)
Behavior of Critical Warp References
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
(16)
Criticality Aware Cache Prioritization
• Prediction: critical warp• Prediction: re-reference interval• Both used to manage the cache foot print
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
9
(17)
Integration into GPU Microarchitecture
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
Criticality Prediction Cache Management
(18)
Performance
Periodic computation of accuracy for critical warps
Due in part to low miss rates
10
(19)
Summary
• Warp divergence leads to some lagging warps àcritical warps
• Expose the performance impact of critical warps àthroughput reduction
• Coordinate scheduler and cache management to reduce warp divergence
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt
MICRO 2012
11
(21)
Goal
• Understand the relationship between schedulers (warp/wavefront) and locality behaviors v Distinguish between inter-wavefront and intra-wavefront
locality
• Design a scheduler to match #scheduled wavefronts with the L1 cache sizev Working set of the wavefronts fits in the cachev Emphasis on intra-wavefront locality
(22)
Reference Locality
Intra-threadlocality
Inter-threadlocality
Intra-wavefront locality Inter-wavefront
locality
• Scheduler decisions can affect locality
• Need non-oblivious (to locality) schedulers
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
Greatest gain!
12
(23)
Key Idea
Executing WavefrontsReady Wavefronts
Issue?
• Impact of issuing new wavefronts on the intra-warp locality of executing wavefrontsv Footprint in the cachev When will a new wavefront cause thrashing?
Intra-wavefront footprints in the cache
(24)
Concurrency vs. Cache Behavior
Round Robin Scheduler
Less than max concurrency
Adding more wavefronts to hide latency traded off
against creating more long latency memory references
due to thrashing
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
13
(25)
Additions to the GPU Microarchitecture
Control issue of wavefronts
I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps Keep track of locality behavior
on a per wavefront basis
Feedback
(26)
Intra-Wavefront Locality
• Scalar thread traverses edges
• Edges stored in successive memory locations
• One thread vs. many
• Intra-thread locality leads to intra-wavefront locality
• #wavefronts determined by its reference footprint and cache size
• Scheduler attempts to find the right#wavefronts
14
(27)
Static Wavefront LimitingI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
• Limit the #wavefronts based on working set and cache size
• Based on profiling?
• Current schemes allocate #wavefronts based on resource consumption not effectiveness of utilization
• Seek to shorten the re-reference interval
(28)
Cache Conscious Wavefront Scheduling
• Basic Stepsv Keep track of lost locality of a
wavefrontv Control number of wavefronts
that can be issued
• Approachv List of wavefronts sorted by lost
locality scorev Stop issuing wavefronts with
least lost locality
I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
15
(29)
Keeping Track of LocalityI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Keep track of associated Wavefronts
Indexed by WID Victims indicate “lost locality
Update lost locality score
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(30)
Updating Estimates of LocalityI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
pending warps
Wavefronts sorted by
lost locality score
Update wavefront lost locality score on victim hit
Note which are allowed to issue
Changes based on
#wavefronts
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
16
(31)
Estimating the Feedback Gain
Drop the scores –no VTA hits
VYA hit –increase score
Prevent issue
LLDS = VTAHitsTotalInsIssuedTotal
.KThrottle.CumLLSCutoff Scale gain based on percentage of cutoff
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(32)
Schedulers
• Loose Round Robin
• Greedy-Then-Oldest (GTO)v Execute one wavefront until stall and then oldest
• 2LVL-GTOv Two level scheduler with GTO instead of RR
• Best SWL
• CCWS
17
(33)
Performance
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(34)
Some General Observations
• Performance largely determined byv Emphasis on oldest wavefrontsv Distribution of references across cache lines – extent of lack
of coalescing
• GTO works well for this reason à prioritizes older wavefronts
• LRR touches too much data (too little reuse) to fit in the cache
18
(35)
Locality Behaviors
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(36)
Summary
• Dynamic tracking of relationship between wavefronts and working sets in the cache
• Modify scheduling decisions to minimize interference in the cache
• Tunables: need profile information to create stable operation of the feedback control
19
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Divergence Aware Warp Scheduling T. Rogers, M O’Conner, and T. Aamodt
MICRO 2013
(38)
Goal
• Understand the relationship between schedulers (warp/wavefront) and control & memory locality behaviors v Distinguish between inter-wavefront and intra-wavefront
locality
• Design a scheduler to match #scheduled wavefronts with the L1 cache sizev Working set of the wavefronts fits in the cachev Emphasis on intra-wavefront localityv Couple effects of control flow divergence
• Differs from CCWS in being proactivev Deeper look at what happens inside loops
20
(39)
Key Idea
• Manage the relationship between control divergence, memory divergence and scheduling
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(40)
Key Idea (2)
• Coupling control divergence and memory divergence• The former indicates reduced cache capacity• “learning” eviction patterns
21
(41)
Key Idea (2)
while (i <C[tid+1])
warp 0warp 1
Fill the cache with 4 references –delay warp 1Divergent
branch
Intra-thread locality
Available room in the cache due to
divergence in warp 0 à schedule warp
1
Use warp 0 behavior to predict interference due to warp 1
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(42)
GoalSimpler portable version
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
• Each thread computes a row
• Structured similar to a multithreaded CPU version • Desirable
22
(43)
GoalGPU-Optimized Version
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
• Each warp handles a row• Renumber threads modulo warp
size
• Row element index for each thread• Warps traverse the whole row
• Each thread handles a few row (column elements)
• Need to get sum the partial sums computed by each thread in a row
(44)
Optimizing the Vector Version
• What do need to know?
• TB & warp sizes, grid mappings
• Tuning TB sizes as a function of machine size
• Orchestrate the partial product summation
Simplicity (productivity) with no sacrifice in performance
23
(45)
GoalSimpler portable version GPU-Optimized Version
Make the performance equivalent
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(46)
Additions to the GPU Microarchitecture
• Control issue of warps• Implements PDOM for control divergence
I-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer/Sboard
pending warps
Memory Coalescer
24
(47)
Observation
• Bulk of the accesses in a loop come from a few static load instructions
• Bulk of the locality in (these) applications is intra-loop
Loops
(48)
Distribution of Locality
• Bulk of the locality comes form a few static loads in loops
• Find temporal reuse
Hint: Can we keep data from last iteration?
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
25
(49)
A Solution
• Prediction mechanisms for locality across iterations of a loop
• Schedule such that data fetched in one iteration is still present at next iteration
• Combine with control flow divergence (how much of the footprint needs to be in the cache?)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(50)
Classification of Dynamic Loads
• Group static loads into equivalence classes àreference the same cache line
• Identify these groups by repetition ID• Prediction for each load by compiler or hardware
converged
Diverged
Somewhat diverged
Control flow divergence
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
26
(51)
Coupling Divergence Effects
• Can we improve performance with branch prediction?v Diverged warps reduce cache footprint
converged
Control flow divergence
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(52)
Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction
Warp
• Exit loop• Reinitialize prediction
• Some threads exit the loop• Predicted footprint drops
Warp
• Predict locality usage of static loadsv Not all loads increase the footprint
• Combine with control divergence to predict footprint • Use footprint to throttle/not-throttle warp issue
Taken
Not Taken
27
(53)
Predicting a Warp’s Active Thread Count
Warp• Predict the cache usage
of a thread• #active threads, i.e.,
control divergence
• Modulate footprint based on predicted cache usagev Predictions include control divergence effects
• Classify each loop static load based on divergence and repetition ID à profiled DAWs
Taken
Not Taken
(54)
Principles of Operation
• Prefix sum of each warp’s cache footprint used to select warps that can be issuedEffCacheSize = kAssocFactor.TotalNumLines
• Scaling back from a fully associative cache• Empirically determined
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
28
(55)
Principles of Operation (2)
• Profile static load instructionsv Are they divergent?v Loop repetition ID
o Assume all loads with same base address and offset within cache line access are repeated each iteration
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(56)
Prediction Mechanisms
• Profiled Divergence Aware Scheduling (DAWS)v Used offline profile results to dynamically determine de-
scheduling decisions
• Detected Divergence Aware Scheduling (DAWS)v Behaviors derived at run-time to drive de-scheduling
decisionso Loops that exhibit intra-warp localityo Static loads are characterized as divergent or convergent
29
(57)
Extensions for DAWSI-Fetch
Decode
RFPRF
D-Cache
DataAll Hit?
Writeback
scalarPipeline
scalarpipeline
scalarpipeline
Issue
I-Buffer
+
+
(58)
Operation: Tracking
Basis for throttling
Profile-based information
One entry per warp issue
slot
Created/removed at
loop begin/end
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
#Active Lanes
30
(59)
Operation: Prediction
Sum results from static loads in this loop
- Add #active-lanes of cache lines for divergent loads
- Add 2 for converged loads- Count loads in the same
equivalence class only once (unless divergent)
• Generally only considering de-scheduling warps in loopsv Since most of the activity is here
• Can be extended to non-loop regions by associating non-loop code with next loop
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(60)
Operation: Nested Loops
for (i=0; i<limitx; i++)}{....
for (j=0: j<limity; j++){....}
..
..}
• On-entry update prediction to that of inner loop
• On re-entry predict based inner loop predictions
On-exit, do not clear prediction
De-scheduling of warps determined
by inner loop behaviors!
• Re-used predictions based on inner-most loops which is where most of the date re-use is found
• Note the attempt at tracking footprints to code segments
31
(61)
Detected DAWS: PredictionSampling warp for
the loop (>2 active threads)
• Detect both memory divergence and intra-loop repetition at run time
• Fill PCLoad entries based on run time information• Use profile information to start
if locality
enter load
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(62)
Detected DAWS: Classification
Increment or decrement the counter depending on #memory accesses for a load
Create equivalence classes of loads (checking the PCs)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
32
(63)
Performance
Little to no degradation
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(64)
Performance
Significant intra-warp locality
SPMV-scalar normalized to best
SPMV-vector
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
33
(65)
Summary
• If we can characterize warp level memory reference locality, we can use this information to minimize interference in the cache through scheduling constraints
• Proactive scheme outperforms reactive management
• Understand interactions between memory divergence and control divergence
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving
GPGPU PerformanceA. Jog et. al ASPLOS 2013
34
(67)
Goal
• Understand memory effects of scheduling from deeper within the memory hierarchy
• Minimize idle cycles induced by stalling warps waiting on memory references
Off-chip Bandwidth is Critical!
68
Percentage of total execution cycles wasted waiting for the data to come back from DRAM
0%20%
40%60%
80%100%
SAD
PVC
SSC
BFS
MUM
CFD
KMN
SCP
FWT IIX
SPMV
JPEG
BFSR SC FFT
SD2
WP
PVR BP
CON
AES SD1
BLK HS
SLADN
LPS
NN
PFN
LYTE
LUD
MM
STO CP
NQU
CUTP HW
TPAF
AVG
AVG-T1
Type-1Applications
55%AVG: 32%
Type-2Applications
GPGPU Applications
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
35
(69)
Source of Idle Cycles
• Warps stalled on waiting for memory referencev Cache missv Service at the memory controllerv Row buffer miss in DRAMv Latency in the network (not addressed in this paper)
• The last warp effect
• The last CTA effect
• Lack of multiprogrammed executionv One (small) kernel at a time
(70)
Impact of Idle Cycles
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
36
High-Level View of a GPU
DRAM
SIMTCores
…
Scheduler
ALUsL1 Caches
Threads
… WW W W W W
Warps
L2 cache
Interconnect
CTA CTA CTA CTA
Cooperative Thread Arrays (CTAs)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Warp Scheduler
ALUsL1 Caches
CTA-Assignment Policy (Example)
72
Warp Scheduler
ALUsL1 Caches
Multi-threaded CUDA Kernel
SIMT Core-1 SIMT Core-2
CTA-1 CTA-2 CTA-3 CTA-4
CTA-2 CTA-4CTA-1 CTA-3
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
37
(73)
Organizing CTAs Into Groups
• Set minimum number of warps equal to #pipeline stagesv Same philosophy as the two-level warp scheduler
• Use same CTA grouping/numbering across SMs?
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
SIMT Core-1 SIMT Core-2
CTA-2 CTA-4CTA-1 CTA-3
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Warp Scheduling Policyn All launched warps on a SIMT core have equal priority
q Round-Robin execution
n Problem: Many warps stall at long latency operations roughly at the same time
74
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core Stalls
TimeCourtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
38
Solution
75
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
Send Memory Requests
Saved Cycles
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core Stalls
Time
• Form Warp-Groups (Narasiman MICRO’11)
• CTA-Aware grouping• Group Switch is Round-Robin
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(76)
Two Level Round Robin Scheduler
CTA0 CTA1
CTA3 CTA2
CTA4 CTA5
CTA7 CTA8
CTA12 CTA13
CTA15 CTA14
Group 0 Group 1
Group 3Group 2
RR
RR
Thread
Agnostic to whenpending misses
are satisfied
39
(77)
Key Idea
Executing Warps (EW)
• Relationship between EW, SW and CW?
• When EW ≠ CW, we get interference
• Principle – optimize reuse: Seek to make EW = CW
Stalled Warps (SW) Cache Resident Warps (CW)
Objective 1: Improve Cache Hit Rates
78
CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7
Data for CTA1 arrives.No switching.
CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5
Data for CTA1 arrives.
TNo Switching: 4 CTAs in Time T
Switching: 3 CTAs in Time T
Fewer CTAs accessing the cache concurrently à Less cache contentionTime
Switch to CTA1.
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
40
Reduction in L1 Miss Rates
79
n Limited benefits for cache insensitive applications
n What is happening deeper in the memory system?
0.000.200.400.600.801.001.20
SAD SSC BFS KMN IIX SPMV BFSR AVG.Norm
alize
d L1
Miss
Rat
es
8%
18%
Round-Robin CTA-Grouping CTA-Grouping-Prioritization
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(80)
The Off-Chip Memory Path
Off-chip GDDR5
Off-chip GDDR5
CU 0
To
CU 15
CU 16
To
CU 31
MC
MC
MC
MC
MC MC
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Access patterns?
Ordering and buffering?
41
(81)
Inter-CTA Locality
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
CTA-2 CTA-4CTA-1 CTA-3
DRAM DRAM DRAM DRAM
How do CTAs Interact at the MC and in DRAM?
(82)
Impact of the Memory Controller
• Memory scheduling policiesv Optimize BW vs. memory
latency
• Impact of row buffer access locality
• Cache lines?
42
(83)
Row Buffer Locality
CTA Data Layout (A Simple Example)
84
A(0,0) A(0,1) A(0,2) A(0,3)
:
:
DRAM Data Layout (Row Major)
Bank 1 Bank 2 Bank 3
Bank 4A(1,0) A(1,1) A(1,2) A(1,3)
:
:
A(2,0) A(2,1) A(2,2) A(2,3)
:
:
A(3,0) A(3,1) A(3,2) A(3,3)
:
:
Data Matrix
A(0,0) A(0,1) A(0,2) A(0,3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1) A(3,2) A(3,3)
mapped to Bank 1CTA 1 CTA 2
CTA 3 CTA 4
mapped to Bank 2
mapped to Bank 3
mapped to Bank 4
Average percentage of consecutive CTAs (out of total CTAs) accessing the same row = 64%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
43
L2 Cache
Implications of high CTA-row sharing
85
WCTA-1
WCTA-3
WCTA-2
WCTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
Idle Banks
W W W W
CTA Prioritization Order CTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
High Row LocalityLow Bank Level Parallelism
Bank-1
Row-1
Bank-2
Row-2
Bank-1
Row-1
Bank-2
Row-2
ReqReqReq
Req
Req
Req
Req
Req
Req
Req
Lower Row LocalityHigher Bank Level
Parallelism
Req
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
44
(87)
Some Additional Details
• Spread reference from multiple CTAs (on multiple SMs) across row buffers in the distinct banks
• Do not use same CTA group prioritization across SMsv Play the odds
• What happens with applications with unstructured, irregular memory access patterns?
L2 Cache
Objective 2: Improving Bank Level Parallelism
WCTA-1
WCTA-3
WCTA-2
WCTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
11% increase in bank-level parallelism
14% decrease in row buffer locality
CTA Prioritization OrderCTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
45
L2 Cache
Objective 3: Recovering Row Locality
WCTA-1
WCTA-3
WCTA-2
WCTA-4
SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
Memory Side Prefetching
L2 Hits!
SIMT Core-1
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Memory Side Prefetchingn Prefetch the so-far-unfetched cache lines in an already open row into
the L2 cache, just before it is closed
n What to prefetch?q Sequentially prefetches the cache lines that were not accessed by
demand requestsq Sophisticated schemes are left as future work
n When to prefetch?q Opportunistic in Natureq Option 1: Prefetching stops as soon as demand request comes for
another row. (Demands are always critical)q Option 2: Give more time for prefetching, make demands wait if there
are not many. (Demands are NOT always critical)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
46
IPC results (Normalized to Round-Robin)
0.61.01.41.82.22.63.0
SA
D
PV
C
SS
C
BF
S
MU
M
CF
D
KM
N
SC
P
FW
T
IIX
SP
MV
JPE
G
BF
SR
SC
FF
T
SD
2
WP
PV
R
BP
AV
G -
T1
Nor
mal
ized
IPC
n 11% within Perfect L2
44%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(92)
Summary
• Coordinated scheduling across SMs, CTAs, and warps
• Consideration of effects deeper in the memory system
• Coordinating warp residence in the core with the presence of corresponding lines in the cache