Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | laureen-snow |
View: | 220 times |
Download: | 2 times |
Implementing, Simulating and Extending Hardware Transactional
MemoryPhD Status (Statusvortrag)
Betreuer: Prof. Christof FetzerFachreferent: Prof. Hermann Härtig
Executive Summary
• Transactional Memory is great tool for fast parallel code
• Expensive HW implementation -> restrict features, reuse components
• Keep OS-invisible to aid adoption• Tight relationship between implementation
details and provided features
=> Can still innovate in constrained space. I do it.
2013-03-27 2Stephan Diestelhorst - HTM
Processing TrendsJeff Preshing, A Look Back at Single-Threaded CPU Performance, Feb 2008http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance
2013-03-27 3Stephan Diestelhorst - HTM
Processing TrendsChuck Moore, Data Processing inExascale-Class Computer Systems, The Salishan Conference on High Speed Computing , Apr 2011http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf
2013-03-27 4Stephan Diestelhorst - HTM
Processing Trends
• Single thread performance stalling in 2004• Multi-core CPUs penetrating the market– Top-notch smartphones with four CPU cores– Enthusiast desktop machines with four – eight
cores
• Not all problems are coarse-grained data parallel (so your cluster doesn‘t help) -> SMPs, synchronise
2013-03-27 5Stephan Diestelhorst - HTM
Synchronising SMPs• Coarse locks– Size of system limits performance– Low contention data vs. high contention lock
• Fine-grained locking– Instruction overhead– Lock order– Composability
• Lock-free data structures– Overheads– Complexity
2013-03-27 6Stephan Diestelhorst - HTM
Transactional Memory (1993)• Speculative execution of
transactions• Tentative stores (data
versioning)• Monitor working set for
concurrent conflicting accesses (conflict detection)
• Make tentative updates visible at once
Local to the core (point of coherence), fine-grained benefit w/o cost
2013-03-27 Stephan Diestelhorst - HTM 7
Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993
Txload [foo]
Txstore [bar]
Commit
Begin
Txstore [foo]Commit
Begin
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• Contributions
• IS:– Focussed on hardware /
low-level software interaction
– TM in coherent system• IS NOT:
– A full history of TM– Computer architecture
course– Language integration of
TM– Distributed TM
2013-03-27 8Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 9Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 10Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 11Stephan Diestelhorst - HTM
HTM Microarchitecture - Basics
• Conflict detection– Location, capacity– Eager / lazy
• Data versioning– Location, capacity– Make visible
• Integration with baseline microarchitecture
2013-03-27 Stephan Diestelhorst - HTM 12
Txload [foo]
Txstore [bar]=5
Commit
Begin
Txstore [foo]=7Commit
Begin
foo | txrbar | txw
foo | txw
bar | 0 -> 5 foo | 0 -> 7
Eage
rLa
zy
HTM Microarchitecture - Academia
• 2004 – Hammond, et al – Transactional Memory and Consistency (TCC)
• 2005 – Ananian, et al – Unbounded Transactional Memory (UTM)
• 2006 – Shriraman, et al – An Integrated Hardware-Software Approach to Flexible Transactional Memory (RTM)
• 2007 – Yen, et al – LogTM-SE: Decoupling Hardware Transactional Memory From Caches
2013-03-27 13Stephan Diestelhorst - HTM
1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE
CPU Core Emulated
Coherence Pt L1d
Conf. Det. TX$, eager
#TX Reads TX$ size
Buffering TX$
Undo TX
#TX Writes TX$ size
Abort Holder wins, async
Changes to baseline system
(L1d), TX$
2013-03-27 14Stephan Diestelhorst - HTM
1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE
CPU Core Emulated Emulated
Coherence Pt L1d Local cache
Conf. Det. TX$, eager Local cache, lazy
#TX Reads TX$ size Local cache size
Buffering TX$ Write buffer
Undo TX Mem
#TX Writes TX$ size Write buffer size
Abort Holder wins, async
First commit wins, ordered
Changes to baseline system
(L1d), TX$ Replace coherency / consistency
2013-03-27 15Stephan Diestelhorst - HTM
1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE
CPU Core Emulated Emulated ?
Coherence Pt L1d Local cache ?
Conf. Det. TX$, eager Local cache, lazy
TX bits for all mem, eager
#TX Reads TX$ size Local cache size
Infinite
Buffering TX$ Write buffer Mem
Undo TX Mem Mem-log
#TX Writes TX$ size Write buffer size
Infinite
Abort Holder wins, async
First commit wins, ordered
Older wins
Changes to baseline system
(L1d), TX$ Replace coherency / consistency
Reg Rename, Visible Regs, LS, $, OS
2013-03-27 16Stephan Diestelhorst - HTM
1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE
CPU Core Emulated Emulated ? In-order, Simics / GEMS
Coherence Pt L1d Local cache ? L1d
Conf. Det. TX$, eager Local cache, lazy
TX bits for all mem, eager
L1d, eager / lazy
#TX Reads TX$ size Local cache size
Infinite L1d
Buffering TX$ Write buffer Mem Inner $
Undo TX Mem Mem-log Outer $
#TX Writes TX$ size Write buffer size
Infinite L1d
Abort Holder wins, async
First commit wins, ordered
Older wins SW policy
Changes to baseline system
(L1d), TX$ Replace coherency / consistency
Reg Rename, Visible Regs, LS, $, OS
Coherence states & protocol, L1d, visible regs, OS
2013-03-27 17Stephan Diestelhorst - HTM
1993 TM 2004 TCC 2005 UTM 2006 RTM 2007 LogTM-SE
CPU Core Emulated Emulated ? In-order, Simics / GEMS
Ooo, Simics / GEMS
Coherence Pt L1d Local cache ? L1d L1d
Conf. Det. TX$, eager Local cache, lazy
TX bits for all mem, eager
L1d, eager / lazy
Signatures, eager
#TX Reads TX$ size Local cache size
Infinite L1d Infinite
Buffering TX$ Write buffer Mem Inner $ Mem
Undo TX Mem Mem-log Outer $ Mem-log
#TX Writes TX$ size Write buffer size
Infinite L1d Infinite
Abort Holder wins, async
First commit wins, ordered
Older wins SW policy Older wins, SW handler
Changes to baseline system
(L1d), TX$ Replace coherency / consistency
Reg Rename, Visible Regs, LS, $, OS
Coherence states & protocol, L1d, visible regs, OS
Coherence protocol, directory, OS
2013-03-27 18Stephan Diestelhorst - HTM
HTM Academia Conclusions
• Many proposals with new “widgets”• “Easy to implement”• Evaluation: simple in-order core & more
realistic memory models
• Preserve protocols? Preserve components?• Verification cost! (absence of)
2013-03-27 19Stephan Diestelhorst - HTM
2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12
CPU Core In-order, 54 cores x 16 modules
Coherence Pt L1d
Conf. Det. L1d, eager?
#TX Reads 16 kB (L1d)
Buffering L1d
Undo Outer cache
#TX Writes 16 kB (L1d)
Abort ?
Changes to baseline
L1d, core
2013-03-27 21Stephan Diestelhorst - HTM
2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12
CPU Core In-order, 54 cores x 16 modules
Semi out-of-order, check-pointed, 2T, 16 cores
Coherence Pt L1d L2
Conf. Det. L1d, eager? L1d R, L2 W, eager
#TX Reads 16 kB (L1d) 32 kB (L1d)
Buffering L1d Store buffer
Undo Outer cache L2
#TX Writes 16 kB (L1d) 32 (StBuf)
Abort ? Requester wins
Changes to baseline
L1d, core L2, (new core)
2013-03-27 22Stephan Diestelhorst - HTM
2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12
CPU Core In-order, 54 cores x 16 modules
Semi out-of-order, check-pointed, 2T, 16 cores
In-order, PowerPC A2, 4T, 16 cores
Coherence Pt L1d L2 Shared L2
Conf. Det. L1d, eager? L1d R, L2 W, eager
L2, eager / lazy
#TX Reads 16 kB (L1d) 32 kB (L1d) 20 MB (L2)
Buffering L1d Store buffer L2
Undo Outer cache L2 L2 (multi-vers)
#TX Writes 16 kB (L1d) 32 (StBuf) 20 MB (L2)
Abort ? Requester wins
SW handler
Changes to baseline
L1d, core L2, (new core) L2, (bypass L1d), OS
2013-03-27 23Stephan Diestelhorst - HTM
2009 Azul 2009 Rock 2011 BG/Q 2012 zEC12
CPU Core In-order, 54 cores x 16 modules
Semi out-of-order, check-pointed, 2T, 16 cores
In-order, PowerPC A2, 4T, 16 cores
Out-of-order, 6 cores x 6 x 4 modules
Coherence Pt L1d L2 Shared L2 L3
Conf. Det. L1d, eager? L1d R, L2 W, eager
L2, eager / lazy
L1d, eager
#TX Reads 16 kB (L1d) 32 kB (L1d) 20 MB (L2) 96 kB (L1d)
Buffering L1d Store buffer L2 Store buffer (WCC)
Undo Outer cache L2 L2 (multi-vers)
L2
#TX Writes 16 kB (L1d) 32 (StBuf) 20 MB (L2) 64 (x128 B, StB)
Abort ? Requester wins
SW handler Req wins + NACK
Changes to baseline
L1d, core L2, (new core) L2, (bypass L1d), OS
L1d, StB, (OS)
2013-03-27 24Stephan Diestelhorst - HTM
HTM Industry Conclusion
• Baseline core and cache microarchitectures differ vastly– Cache sizes, organisation– Coherency point
• TM implementation and performance highly dependent
• Component reuse is key, adapt TM to baseline, not vice versa
2013-03-27 25Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 26Stephan Diestelhorst - HTM
HTM ISA
• General idea simple: TX.begin, TX.commit
• Decisions for aborts: visible [1] / invisible [2], sync / async [1]
• Register snapshotting: none (partial) [1], full [3], selective [2]
• Poke through HTM– No [4] – Default [1]– Special case [5]
[1] Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.[3] Ananian, C. Scott, et al. "Unbounded transactional memory." High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. IEEE, 2005.[4] Intel Corp. “Intel® Architecture Instruction Set Extensions Programming Reference”, chapter 8, 2012[5] Chaudhry, Shailender, et al. "Rock: A high-performance SPARC CMT processor." Micro, IEEE 29.2 (2009): 6-16.2013-03-27 27Stephan Diestelhorst - HTM
HTM ISA
• OS interaction– Survive faults, interrupts [1]– Support syscalls [3,4] – OS-support required
• Strong / weak isolation• Capacity [1] / progress guarantees [2][1] Ananian, C. Scott, et al. "Unbounded transactional memory." High-Performance Computer Architecture, 2005. HPCA-11. 11th International Symposium on. IEEE, 2005.[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.[3] Moravan, Michelle J., et al. "Supporting nested transactional memory in LogTM." ACM Sigplan Notices. Vol. 41. No. 11. ACM, 2006.[4] Ramadan, Hany E., et al. "MetaTM/TxLinux: transactional memory for an operating system." ACM SIGARCH Computer Architecture News 35.2 (2007): 92-103.2013-03-27 28Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 30Stephan Diestelhorst - HTM
HTM-- Simpler Hardware
• Reduce HW cost by offering less than full transactions
• AOU [1]• HASTM [2]• Similar: register checkpoints
[1] Spear, Michael F., et al. "Alert-on-update: a communication aid for shared memory multiprocessors." Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2007.[2] Saha, Bratin, A-R. Adl-Tabatabai, and Quinn Jacobson. "Architectural support for software transactional memory." Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on. IEEE, 2006.
2013-03-27 31Stephan Diestelhorst - HTM
HTM++ Wider Applicability
• Amortise HW cost by offering more features• Escape actions [1], Suspend / resume [2],
Open nesting [3]• Hardware lock-elision [4]• InvisiFence [5]
[1] Moravan, Michelle J., et al. "Supporting nested transactional memory in LogTM." ACM Sigplan Notices. Vol. 41. No. 11. ACM, 2006.[2] Zilles, Craig, and Lee Baugh. "Extending hardware transactional memory to support non-busy waiting and non-transactional actions." TRANSACT: First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing. 2006.[3] Moss, J. Eliot B., and Antony L. Hosking. "Nested transactional memory: model and architecture sketches." Science of Computer Programming 63.2 (2006): 186-201.[4] Rajwar, Ravi, and James R. Goodman. "Speculative lock elision: Enabling highly concurrent multithreaded execution." Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society, 2001.[5] Blundell, Colin, Milo MK Martin, and Thomas F. Wenisch. "InvisiFence: performance-transparent memory ordering in conventional multiprocessors." ACM SIGARCH Computer Architecture News. Vol. 37. No. 3. ACM, 2009.
2013-03-27 32Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM microarchitecture
• Academia• Industry
– HTM ISA– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 33Stephan Diestelhorst - HTM
Simulation Solutions
• Execution-driven, emulation [1]• In-order / out-of-order cores, extended
memory simulator (Simics + GEMS) SPARC [2]• UVSIM, MIPS64, user-space [3]
• X86? Full-system? Detailed core model?
[1] Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993[2] Martin, Milo MK, et al. "Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset." ACM SIGARCH Computer Architecture News 33.4 (2005): 92-99.[3] Zhang, Lixin, and Lixin Zhang. "UVSIM User Manual." (2003).
2013-03-27 34Stephan Diestelhorst - HTM
Roadmap
• Introduction• Related Work– HTM ISA– HTM microarchitecture
• Academia• Industry
– HTM++ / HTM--– Simulation
• ContributionsMicroarchitecture
RTL
Instruction Set
Libraries
Operating System
Applications
2013-03-27 35Stephan Diestelhorst - HTM
Related Work Summary
• Understand realities of baseline microarchitecture
• Constrain TM implementation
• Can we still improve with constraints and allthe previous work?
YES!
2013-03-27 36Stephan Diestelhorst - HTM
Contributions
• Simulator: PTLsim-ASF / Marss86-ASF• HTM ISA: ASF 2.0, ASF++• Solve new problems with HTM: Thread to
thread communication
2013-03-27 37Stephan Diestelhorst - HTM
PTLsim-ASF / Marss86-ASF
PTLsim-ASF [1]• Out-of-order, superscalar core• AMD64 ISA + ASF• Detailed pipeline
implementation of transactional memory primitives
• First order cache model• Multiple options for tracking
working set• Full system, switch
paravirtualisation <-> simulation (Xen-based)
Marss86-ASF [2]• Core derived from PTLsim-
ASF• Detailed cache model
– Controllers, directories, coherence states
– Bandwidth and latency limitations
• QEMU-based, switch emulation <-> simulation
[1] Yourst, Matt T. "PTLsim: A cycle accurate full system x86-64 microarchitectural simulator." Performance Analysis of Systems & Software, 2007. ISPASS 2007. IEEE International Symposium on. IEEE, 2007.[2] Patel, Avadh, et al. "MARSSx86: A full system simulator for x86 CPUs." Proceedings of the 2011 Design Automation Conference. 2011.
2013-03-27 38Stephan Diestelhorst - HTM
ASF - Advanced Synchronization Facility
• Extension to AMD64 ISA for HTM• Key facts– Non-transactional loads and stores– Minimal capacity guarantee (four cache lines)– Requestor-wins conflict resolution, synchronous
aborts– Partial register snapshot (only rIP, rSP)– OS-invisible / Hypervisor-invisible: syscalls,
interrupts, exceptions abort
2013-03-27 39Stephan Diestelhorst - HTM
AMD "Advanced Synchronization Facility" Proposal, 2009http://amddevcentral.com/Resources/archive/ASF/Pages/default.aspx
Using ASF‘s Unique Features
• Accelerate / cooperate with tinySTM
2013-03-27 40Stephan Diestelhorst - HTM
Christie, Dave, et al. "Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack." Proceedings of the 5th European conference on Computer systems. ACM, 2010.
Using ASF‘s Unique Features
• Non-TX memory accesses for well-formed communication channels
• Implement RPC from hardware transactions• And parallel transactional nesting
2013-03-27 41Stephan Diestelhorst - HTM
Liu, Yujie, Stephan Diestelhorst, and Michael Spear. "Delegation and nesting in best-effort hardware transactional memory." Proceedings of the 24th ACM symposium on Parallelism in algorithms and architectures. ACM, 2012.
ASF++ - Extensions to ASF
• Nested abort handling• Supporting time-stamp accesses from
transactions• Non-atomic aborts– Immediate aborts -> asynchronous– Resurrect transactions: tolerate interrupts, syscalls– Map alert-on-update: low latency thread to thread
communication, acceleration of STM• All without expensive hardware changes or
additional OS-visible state
2013-03-27 42Stephan Diestelhorst - HTM
Open Issues
• Microarchitectural enhancements– Tolerate read-after-write conflicts– Enable local transactional stores that never escape
the transaction
• Architectural enhancements: RISC-ified lock elision
• Simulation stability
2013-03-27 43Stephan Diestelhorst - HTM
Conclusion• Transactional memory has high implementation
cost and open design space• I create & use detailed simulation to guide
exploration of microarchitectural and ISA enhancements
• I provide features with lower implementation cost, better performance and wider applicability
• Acknowledgements: AMD (M. Hohmuth, D. Christie, M. Pohlack), TUD (T. Riegel, M. Nowack, JT. Wamhoff), VELOX (UniNE, BSC)
2013-03-27 44Stephan Diestelhorst - HTM
Between All and Nothing Aborts
• Basic ASF instructions
• All arch state available at abort
• Coexisting locks / transactions
• No extension of OS-visible state
2013-03-27 46Stephan Diestelhorst - HTM
Processing TrendsJeff Preshing, A Look Back at Single-Threaded CPU Performance, Feb 2008http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance
2013-03-27 47Stephan Diestelhorst - HTM
Transactional Memory 1993• RISC-ify Compare-And-
Swap• Enclose instructions
inside transactions• Execute in parallel watch
for conflicts• New instructions
– Load-transactional– Store-transactional– Commit / Abort– Validate
• New hardware– Separate TX cache,
neighbour to L1D– Keeps undo / redo copy– New bus transaction types– Line holder wins– Size guarantee: 10 – 100
locations
Maurice Herlihy, and J. Eliot B. Moss. “Transactional memory: Architectural support for lock-free data structures.” Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, 1993
2013-03-27 48Stephan Diestelhorst - HTM
Azul‘s HTM
Baseline microarchitecture• 54 x 16 = 864 coherent cores,
in-order, 2 misses• Private L1i / L1d @ 16kB per
core• L2 shared by 9 cores @ 2MB
– Task-level parallelism• Need to tweak applications to
scale– Usually < 10 cores– Tuning -> 50 cores
• Data contention low, but lock contention high (synchronized)
HTM microarchitecture• All modifications to L1d +
core, no tweaks of L2• TXR & TXW bits in L1d• No reg snapshot• Size limit L1d size &
associativity
Click, Cliff. "Azul’s experiences with hardware transactional memory." HP Labs-Bay Area Workshop on Transactional Memory. 2009.
2013-03-27 49Stephan Diestelhorst - HTM
Azul‘s HTM - Results
• 2x for some (Trade6)• Most < 10% upside• Heuristic (HTM vs lock) is tricky• Small SW rewrites help massively• Most TX fail for conflict, not capacity• Shared counters etc. cause excessive conflicts=> Need breadcrumbs, need failure reporting
Click, Cliff. "Azul’s experiences with hardware transactional memory." HP Labs-Bay Area Workshop on Transactional Memory. 2009.
2013-03-27 50Stephan Diestelhorst - HTM
Oracle / SUN‘s Rock
Chaudhry, Shailender, et al. "Rock: A high-performance SPARC CMT processor." Micro, IEEE 29.2 (2009): 6-16.Dice, Dave, et al. "Early experience with a commercial hardware transactional memory implementation.“ ASPLOS (2009)
Baseline• 16 cores x 2 threads, semi out-
of-order• 4-way 32 kB L1d shared by 2
cores, S-bit for load ordering• 8-way 2 MB L2, shared by 16
cores• 8 MB L3 per memory
controller• Checkpointed core / register
file• Cache miss causes Execute
Ahead phase
HTM• 32 entry store buffer for TX
stores, L2 tracks conflicts• Existing L1d S-bits track TX
read set• Commit locks write-set,
drains store buffer into L2
2013-03-27 51Stephan Diestelhorst - HTM
Oracle / SUN‘s Rock
• Problems with data dependent stores and branch prediction in RB-tree
2013-03-27 52Stephan Diestelhorst - HTM
IBM Blue Gene/Q
Baseline• 16 PowerPC A2 cores x 4
threads, in-order• 8-way 16 kB L1d per core• 16 cores share L2 @ 32 MB
– 16 slices, 16 ways each– Point of coherence– Multi-version
HTM• L2 buffers TX stores, both old /
new in different way• L1 and core mostly unmodified• L2 directory tracks ownership• Short-TX: Push through L1, notify
L2 with store for every load• Long-TX:
– TLB maps versions to different physical addresses,
– Flush L1 on TX start• Small HW register checkpoint: SP,
IP (and Global Offset Table)• MMIO communication
Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).2013-03-27 53Stephan Diestelhorst - HTM
IBM Blue Gene/Q - Results
Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).
• L1 causes significant overhead
• OS involvement
2013-03-27 54Stephan Diestelhorst - HTM
IBM Blue Gene/Q - Results
Wang, Amy, et al. "Evaluation of Blue Gene/Q Hardware Support for Transactional Memories." PACT (2012).
• Good scalability• 20 MB working set too small in
labyrinth• Out of IDs in ssca2
2013-03-27 55Stephan Diestelhorst - HTM
IBM zEC12
Baseline• Out-of-order core, three /
seven wide, 6 cores x 20• 6-way 96 kB L1d WT, 3 cyc• 8-way 1 MB WT, 7 cyc• 6 cores share L3 @ 48 MB• 6 x 6 cores share 384 MB L4• 6 x 6 x 4 = 144 coherent
system• Can reject probes
HTM• Keep SMP protocol• Extend fetch unit to track
transaction begin / end• TXR & TXW bits in L1• Gate TX stores before L2 / L3• Flexible
– restore / ignore registers• Constrained mode
– Guaranteed progress– Size limit– Functionality limit
Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.
2013-03-27 56Stephan Diestelhorst - HTM
IBM zEC12 - Results
Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.
• Scale beyond MCM size• Constrained mode better @
high contention• TX can cause more traffic
than locks due to aborts, everything in cache at one time
2013-03-27 57Stephan Diestelhorst - HTM
Progress Guarantees
• None, holder wins (TM)• None, requester wins (Rock)
• Hardware timestamps (UTM)• Dependency tracking [1]• Hardware progress for restricted TX [2]
• Software abort control (RTM)[1] Ramadan, Hany E., Christopher J. Rossbach, and Emmett Witchel. "Dependence-aware transactional memory for increased concurrency." Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2008.[2] Jacobi, Christian, Timothy Slegel, and Dan Greiner. "Transactional Memory Architecture and Implementation for IBM System z." Proceedings of the 45th International Symposium on Microarchitecture (MICRO). 2012.2013-03-27 58Stephan Diestelhorst - HTM