Software Logging under Sp l ti P r ll liz ti nSpeculative Parallelization
M í J ú G áMaría Jesús Garzarán
M. Prvulovic, J. M. Llabería,V. Viñals, L Rauchwerger and J TorrellasL. Rauchwerger, and J.Torrellas
U d Z U f Illi iU. de ZaragozaU. Politècnica de Catalunya
U. of IllinoisTexas A&M U.
Roadmap of the Talk
Speculative tasks running in the same processor can l i l i f h i blcreate multiple versions of the same variable
– Stall the processor or redesign the caches
Alternative solution: Logs
Contribution:
D i i i d l i f fDesign, integration and evaluation of software logging on top of a speculation protocol
h h d
2
- cheap, low overhead (10%)
Outline
Speculative Parallelization
Multiple Local Speculative VersionsSoftware Logginggg gEvaluationConclusionsConclusions
3
Speculative Parallelization
Assume no dependences and execute tasks in parallelp pTrack data accessesDetect violationsSquash offending tasks and restart them
Task J+1= A(2)+
Do I = 1 to N… = A(L(I))+… Task J
= A(4)+Task J+2
= A(5)+… A(2)+…
A(2) = ...A(K(I)) = …
EndDo
… A(4)+…
A(5) = ...
… A(5)+…
A(6) = ...RAW
4
Speculative Parallelization
Speculative tasks cannot displace speculative dataS b ff d il k b l iState buffered until task becomes non-speculative
Tasks 3 5 64
Cache
Memory
Network
5
Network
Several Tasks Share a Cache
Processors must hold speculative state of several tasksT k ID fi ld id if h [Ci 00][S f 00]Task-ID field to identify the owner [Cintra00][Stefan00]
Tasks 3 5 64 87
Cache
Memory
Network
6
Network
Outline
Speculative ParallelizationMultiple Local Speculative Versions
Software Logginggg gEvaluationConclusionsConclusions
7
Last and Non-Last Versions
Speculative tasks in the same processor write the same memory addressmemory address
Task 5:
store value1, 0x400
T k 8
non-last version
Task 8 :
store value2, 0x400 last version ….load r4, 0x400 needs last version
8
Multiple Local Speculative Versions
To avoid the stall of the processor:
– Modify the cache
U L– Use Logs
9
Modify the cache
Cache keeps last and non-last versions (same Tag, but different task ID)different task-ID)– complexity and extra comparisons
h f di l i– chances of displacement increase– equally hard access last versions than non-last versions
DataTagTask-ID
58
0x400 value1value20x400
10
Cache
Logs
Cache keeps last versionsL hid l i
Log
Logs hide away non-last versions
g
AddrTask-ID DataDataTagTask-ID
CacheMemory
Data
0x4005 value1
Dataag
8 value20x4008 value20x400
11
LogsCollect the state that a task made staleUsefulUseful– Free up space when the task commits– To recover in case of squashesq
Undo LogT k 5Task 8Task 10
cache
Task 5
Task 8
last versions
non last versions
Task 10
12
non-last versions
Speculative protocol
Speculative protocol using Hw logs was proposed:
[Zhang99] Y. Zhang. ” Hardware for Speculative Parallelization in DSM Multiprocessors”. Ph.D. thesis, U i i f Illi i M 1999University of Illinois, May 1999
Use Sw Logs on top of a speculative protocol:Use Sw Logs on top of a speculative protocol:
– Task-ID: per memory word in the local memory
– ISA: new ld/st instructions
13
Outline
Speculative ParallelizationMultiple Local Speculative VersionsSoftware Logginggg g
EvaluationConclusionsConclusions
14
Software Logs
A compiler instruments the applicationI i i b f i– Insert : extra instructions before store operations
– Recycle : free up space when a task commits
Interrupt handlers– Recovery : in case of a o-o-o RAW and squash– Retrieval : in-order RAW and the version in the log
15
Software Data Structures
Logs are allocated locally before speculation starts
Task Pointer TableLog Buffer
Task OwnerValid
TaskID Ovflw End Next Vaddr
Owner Task-ID
ij 1
Value
Free
j 1
Free SectorStack
Sector
16
Instructions to insert in log
# Assembly Instruct-------------------------
Ch k l fl 1Check log overflow 1Log.Vaddr = addr of var 2Log. OwnerTask_ID = current Task_ID 2Log.Value = value of var 2Increment log pointer 1Update Task ID 1Update Task_ID 1
Original store
17
-------------Total 9
Reducing unnecessary logging
Create log entry: only 1st write to each variableCreate log entry: only 1st write to each variable– Non-spec vars: easy to identify
Spec vars: hard– Spec vars: hard• Insert run-time check in all spec writes
If 1 l• If 1st, create log entry
==> Much reduced instrumentation overhead
18
Outline
Speculative ParallelizationMultiple Local Speculative VersionsSoftware Logginggg gEvaluation
ConclusionsConclusions
19
Simulation Environment
Execution-driven simulatorScalable multiprocessor: 16 nodesDetailed superscalar processor modelp pProcessor: 4-issue, dynamic, 2K BTB32 KB L1 2 way 512 KB L2 4 way32 KB L1 2-way, 512 KB L2 4-waySpeculative protocol [Zhang99]
20
Applications
Applications dominated by non-analyzable loops (subscripted subscripts)– P3m (NCSA)– Tree (Univ. de Hawaii)– Apsi (Specfp2000)
d f l b
Non analyzable loops account for anaverage of 51 4%– Bdna (Perfect Club)
– Track (Perfect Club)D 3d (HPF2)
average of 51.4% of sequential time
– Dsmc3d (HPF2)Non-analyzable loops and stores to instrument
21
identified by the Polaris compiler
Performance Results
P3m Tree Apsi Bdna
0.6
0.8
1
n Ti
me
0.9 3.5 4.0 8.1 14.6 14.7 3.5 3.8 4.4 6.1 7.0 7.6 p
0
0.2
0.4
g w w g w w g w w g w w
Exec
utio
n
NoL
og Sw Hw
NoL
og Sw Hw
NoL
og Sw Hw
NoL
og Sw Hw
Useful Hazard Sync Memory Stall
Sw only increases execution by 10% over Hw
22
Sw only increases execution by 10% over HwSw reduces execution time by 36% over NoLog
Outline
Speculative ParallelizationMultiple Local Speculative VersionsSoftware Logginggg gEvaluationConclusionsConclusions
23
ConclusionsLogs:
S l i l i– Support multiple versions– Minimize changes to cache
Software logging:gg g– No hardware support necessary– Low time overhead (10% over HW)( )
Software logging: good solution for spec
24
Software logging: good solution for spec parallelization
Software Logging under Speculative ParallelizationSpeculative Parallelization
María Jesús Garzarán ([email protected])
M. Prvulovic, J. M. Llabería,V. Viñals, dL. Rauchwerger, and J.Torrellas
htt // i /d /DIIS/http://www.cps.unizar.es/deps/DIIS/gazhttp://iacoma.cs.uiuc.edu
How to access Task_ID (TID)
2 special instructions: lh_ts addr, sh_ts addr
h dd i i l dd f h dwhere addr is virtual address of the data, not of the TID, since TIDs do not have virtual address
lh_ts: bring data from TID page into cachesh_ts: update TID in cache
Dependence-checking HW reads/updates the TID pages in memory automatically
26
Implementation of lh_ts Vaddress2 possibilities:
TLB has 2 physical addresses per entryVaddressVar PaddressVar PaddressTID
TLB only has 1 physical address and there is a fixed offset between PaddressVar and PaddressTID
27
Hardware logging
It has hardware cost:FSM– FSM
– Extra protocol messages– HW in caches to detect first writes– HW in caches to detect first writes...
Need log physical address: complicates recoveryNeed log physical address: complicates recovery– Should not have changed the mapping of Vir to Phys– Recovery needs to be done by priviledged processy y p g p
28
Instructions to insert in log
; r1 = upper limit of the sector; r2 = address in memory to insert the log record; r2 = address in memory to insert the log record; offset(r3) = address of the variable to update
bgt r1 r2 insertionbgt r1, r2, insertion… allocate another sector
insertion: addu r4, r3, offset ; address of the variablesw r4, 0(r2) ; store in the loglh_ts r4, offset(r3) ; load the task-IDsw r4, 4(r2) ; store in the logLogging , ( ) ; glw r4, offset(r3) ; load value of variablesw r4, 8(r2) ; store in the logaddu r2 r2 log record size
instr
29
addu r2, r2, log_record_sizesw r5, offset(r3)
Reducing unnecessary instrumentation
Not all the stores need to be instrumentedIInstrument:– first store of the non-speculative ones– all speculative stores
• Run time filtering of the first speculative store
Others
First
SpeculativeInstrumented
30
First stores Non speculative
Filtering first speculative store
Using Task-ID
lh_ts r6, offset (r3) ; load task-IDbeq r6, r5, no_insert ; first store?addu r4, r3, offset ; insert as usualsw r4, 0(r2)……...
Logginginstr
addu r2, r2, log_record_size
no insert:sh_ts r5, offset (r3) ; store task-ID
instr
no_insert:sw r5, offset (r3)
31
Software handlers
Recovery : Out-of order RAW– Undo the modifications using data from log
Retrieval : Some in-order RAWs– The exposed load needs dig version from log p g g
32
Stores can cause squashes
Stores can produce squashes of tasks that loaded a value prematurelyvalue prematurely– out-of-order RAW
Tasks loadstore3 5 64
Cache
Memory
Network
33
Network
3 Support for multiple versions
Tasks 3 5 643 5 67 8
Cache
Memory
Network
34
Network
Logs help managing overflow area
Logs hide away past versions of varsOverflow area and cache have the latest versionOverflow area and cache have the latest version– The processor will request the latest version
Task 4Task 7Task 10
Undo Log
T k 7T k 4cache
Task 4
Latest version Past
Task 7Task 4
overflowarea
Task 7 Task 10
Overwriting
35
Latest version Past versions
Overwritingtask
Problem: Address time stamp in software
The time stamp is not mapped in virtual spaceHow to make visible the time stamp to the sw?
lw r3, addr_TS?
V dd V i V l
Undo LogLogginginst
sw r5, offset(r3)Vaddr Version Valueinst
36
Problem: Address time stamp in software
OS copiesdata even
– Data page in even page– Time stamp page in next odd page time
stamp odd
lh_ts r3, offset(r3) Undo LogLoggingi t
sw r5, offset(r3)Vaddr Version Valueinst
37
Log Sizes
(R l )
# Tasks in Undo Log per Processor
Log size/Task(KB)
Appl
Apsi
All Filter Maximum Average
(Recycle)
184 40 24
Appl
Apsi
Dsmc3dP3m
184 4056.7 18.2
1 1 1
2100
250
4
Track 0.3 0.3 6 2
38
Logging under exposed loads
A local version can be killed with an exposed loadHard are m st detect it and send an interr ptHardware must detect it and send an interrupt
Tasks load3 5 64
Cache
Memory
Network
39
Network
Loads find correct version
On a exposed loadh l i l fi d h i– the speculation protocol finds the correct version
– provides it to the consumer task
Tasks load3 5 64
Cache
Memory
Network
40
Network