The IEEE Rebooting Computing Initiative: Lessons Learned and The Road Ahead
Tom ConteCo-Chair, IEEE Rebooting Computing InitiativeVice Chair, International Roadmap for Devices and SystemsSchools of CS & ECE, Georgia Institute of [email protected]
A history of modern computing: How we got here
1945: Von Neumann’s EDVAC (draft) report1955: Manchester Transistor Computer, IBM 709T1965: Software industry begins (IBM 360), Moore #11975: Moore’s Law update; Dennard’s geo. scaling rule1985: “Killer micros”: HPC hitches a ride on Moore’s law1995: Slowdown in CMOS wires: superscalar era begins
2
In 1995, wire delays impact pipelining: Superscalar begins
3
Moore’s law
Processorperformance
Source: Sanjay Patel, UIUC (used with permission)
We hid parallelism extraction with
4
Instruction Fetch
Decode & Dispatch
Schedule
Reorder instructions
ALU ALU ALU...
register file
...
...
...
...
Execute in parallel
...
Issue N independent instructions
InstructionCache
DataCache
Branchpredictor
Superscalar Processor Microarchitectures
…Very few of these “tricks” are energy efficientHidden by Dennard scaling– until that ended
How we got here, part 2
1945: Von Neumann’s EDVAC (draft) report1955: Manchester Transistor Computer, IBM 709T1965: Software industry begins (IBM 360)1975: Moore’s Law; Dennard’s geometric scaling rule1985: “Killer micros”: HPC hitches a ride on Moore’s law1995: Slowdown in CMOS wires: superscalar era begins2005: The Power Wall: Single thread exponential scaling ends (Intel Prescott) …
5
Multicore era begins
6
Dilemma: Could not clock single core aggressively ANDcontinued to get more transistors/chip
Solution:Clock multiple cores conservatively
Why IEEE? Encompasses the whole computing stack
IEEE Rebooting Computing
Council on Electronic Design Automation
Circuits & Systems Society
Goal: Rethink Everything: Turing & Von Neumann to now
Co-Chairs: Elie Track, Tom Conte, Erik DeBenedictis, Dejan Milojicic,
Bruce Kraemer
IEEE Rebooting ComputingSummit 1: 2013 Dec. 12-13
(summary online)– Invitation only– Three Pillars:– Energy Efficiency– Security– Applications/HCI
Rebooting Computing
Ap
plic
atio
ns/
HC
I
Sec
uri
ty
Ener
gy
Effi
cien
cy
8
IEEE Rebooting Computing
Summit 2: 2014 May 14-16– Engines of Computation Adiabatic/Reversible Computing Approximate Computing Neuromorphic Computing Augmentation of CMOS
Rebooting Computing
Ap
plic
atio
ns/
HC
I
Sec
uri
ty
Ener
gy
Effi
cien
cy
Engine Room
9
IEEE Rebooting ComputingSummit 3: 2014 Oct. 23-24
– Algorithms and Architectures ITRS joins forces with RCI
Rebooting Computing
Ap
plic
atio
ns/
HC
I
Sec
uri
ty
Ener
gy
Effi
cien
cy
Engine Room
Algorithms & Architectures
10
IEEE Rebooting ComputingSummit 4: 2015 Dec. 10-11
Goal: coordinating efforts between:– Industry (HP, Intel, NVIDIA)–US: DOE, DARPA, IARPA, NSFGoal 2: How to roadmap the future
Rebooting Computing
Ap
plic
atio
ns/
HC
I
Sec
uri
ty
Ener
gy
Effi
cien
cy
Engine Room
Algorithms & Architectures
11
Moving forward…
1/22/201812
RCI: “Software drives the computer industry”
Questions for computer industry:– How valuable is legacy software?– What computing resources do the
emerging applications need?– How long and how much investment will
it take to train new generation of programmers?
Degrees of Pain Vs. Gain…
13
logic
device
FU
Microarchitecture
ISA
Architecture
API
Language
Algorithm
Potential Approaches vs. Disruption in Computing Stack
Hidden changes
Architecturalchanges
Nonvon Neumann
computing
LEGEND: No Disruption
“More Moore”
Level 1 2 3 4Total Disruption
Level 1: More Moore
Software impact: Legacy code works without issueNew switch candidates:– Logic examples: Tunneling FET,CNFET,
superconducting electronics – Memory examples: MRAM, memristor,
PCM, …
15
More Moore: A better switch?
16
Courtesy Dimitri Nikonov and Ian Young
CMOS Device structure evolution – IRDS 2017 MM chapter
FinFET – still the leading device option until 2021Lateral-Gate All Around (LGAA) is expected to be introduced in 2021Beyond 2024 – 3D stacking needed for functional scaling
17
L-Nanowire
Lateral Gate-All-Around (LGAA)
Gate
Drain
Bulk SiSource
Gate
Bulk Si
FinFET
Gate
Bulk Si
Gate
Bulk Si
Sequential 3D
SourceDrain GateSourceDrain Gate
Bulk Si
Epi Si
SourceDrain GateEpi Si
N5: 2021-2024FinFET
Gate
Bulk Si
FDSOI
Vertical GAA (VGAA)
GateDrain
Bulk SiSource
Gate
TBOXThin Si
L-Nanosheet
L-Nanowire
Lateral GAA (LGAA)
Gate
Bulk Si
Gate
Bulk SiL-Nanosheet
FinFET
Gate
Bulk Si
Vertical GAA (VGAA)
N7: 2019-2021N7: 2017-2019 >N3: >2024
Level 1: More Moore
Software impact: Legacy code works without issueNew switch candidates:– Logic examples: Tunneling FET,CNFET,
superconducting electronics – Memory examples: MRAM, memristor, PCM
Predictions:Industry will go to monolithic 3DMoore’s law* won’t end for a while
(*if correctly defined)
18
logic
device
FU
Microarchitecture
ISA
Architecture
API
Language
Algorithm
Potential Approaches vs. Disruption in Computing Stack
Hidden changes
Architecturalchanges
Nonvon Neumann
computing
LEGEND: No Disruption
“Moore More”
Level 1 2 3 4Total Disruption
Level 2: Not CMOS, but hidden
Software impact: Legacy code works, but may require performance tuningLessons learned from superscalar in 1995Next: Microarchitectural changes to– Use unreliable switch logic, and/or– Use cryogenic superconducting– Reversible computing
20
CPU Trends
21
Vdd hasn’t reduced much below 1V because devices
become unreliable
• Power ∝ 𝑽𝑽𝟐𝟐𝒇𝒇• Therefore,
reduce supply voltage.
• But…ITRS / Asif Khan, PhD Thesis, University of California Berkley, 2015
• Traditional coding fixes errors in data stored or transmitted, not in computation
• Redundancy can be in space and/or time. Tradeoffs.
• What if there are errors in the control-path?• Bypass logic, instruction decode
Computational Error Correction
Proof-of-concept RRNS Core
23
Supercomputer Titan at ORNL - #2 of Top500 Superconducting SupercomputerPerformance 17.6 PFLOP/s (#2 in world*) 20 PFLOP/s ~1x
Memory 710 TB (0.04 B/FLOPS) 5 PB (0.25 B/FLOPS) 7x
Power 8,200 kW avg. (not included: cooling, storage memory) 80 kW total power (includes cooling) 0.01x
Space 4,350 ft2 (404 m2, not including cooling) ~200 ft2 (includes cooling) 0.05x
Cooling additional power, space and infrastructure required All cooling shown
2’ x 2’same scale comparison
Courtesy of M. Manheimer, IARPA Cryogenic Computing Complexity (C3) Program
24
Superconducting: smaller, lower power, same performance
Level 2: Not CMOS, but hidden
Software impact: Legacy code works, but may require performance tuningLessons learned from superscalar in 1995Next: Microarchitectural changes to– Use unreliable switch logic, and/or– Use cryogenic superconducting– Reversible computing
Potential to make exascalesupercomputers orders of magnitude lower powerKey is co-design of devices and architectures
25
logic
device
FU
Microarchitecture
ISA
Architecture
API
Language
Algorithm
Potential Approaches vs. Disruption in Computing Stack
Hidden changes
Architecturalchanges
Nonvon Neumann
computing
LEGEND: No Disruption
“Moore More”
Level 1 2 3 4Total Disruption
Level 3: Architectural changes
Software impact: new programming requiredGPU already an example of this– Inexpensive parallelism available, but need
to reprogram to use itUse special purpose accelerators for Critical kernels, Digital neuromorphic, etc.Approximate computingAnd/or use memory-centric (e.g., Emu, The Machine) to move the computation to the data
27
Accelerators (and reconfigurable)
Idea has been around for a long time– IBM 7030 Project STRETCH attached stream
processor (Harvest) in 1961– Various FP accelerators for minicomputers
in 70s/80s (FP-164)Speedup via “gate-level parallelism”– Hardware duplication to support
computationEnergy savings via elimination of instruction fetch & decodeProgramming options: Compiler extraction, APIs, DSLs
1/22/2018
Trendline: 1.9x per year
Performance Trends in Machine LearningFrom: IRDS Applications Benchmarking chapter
Approximate computingBuilding acceptable systems out of unreliable/inaccurate hardware and software components
Many uses:– Most start and/or end with human
perception (Images, video, control, etc.) or near-optimal search
Output accuracyEfficiency and performance
30
Approximate computing challenges
Algorithms & programming languages– Work continues hereEnsuring quality of output– Step function:great…good…good-ish…ok…
unacceptable
31
Level 3: Architectural changes
Software impact: new programming requiredGPU already an example of this– Inexpensive parallelism available, but need to
reprogram to use itUse special purpose accelerators for Critical kernels, Digital neuromorphic, etc.Approximate computingAnd/or use memory-centric (e.g., Emu, The Machine) to move the computation to the dataArchitectures can be built now--Software and programmers are the challenge
32
logic
device
FU
Microarchitecture
ISA
Architecture
API
Language
Algorithm
Potential Approaches vs. Disruption in Computing Stack
Hidden changes
Architecturalchanges
Nonvon Neumann
computing
LEGEND: No Disruption
“Moore More”
Level 1 2 3 4Total Disruption
Level 4: Non-von Neumann
1. Quantum- Gate-based or quantum annealing
2. Analog neuromorphic3. Others: coupled oscillators, stateful
devices (memristors, spintronics, etc.), analog computing
34
Native Neuromorphic
Direct analog (memristor, etc.) neuromorphic has orders of magnitude better energy efficiency over digital approachesVirtuous cycle of neuroscience informing neuromorphic, and neuromorphic serving as modeling platform to advance neuroscience
35
Neuromorphic computing
Neuroscienceresearch
Neuromorphic challenges
Guarantees– Quality of results, quality of service, reliability,
etc.Security concerns:– For example: neuro used for authentication and
intruder identity trained into network Virtually impossible to detect tampering
Learning– Supervised learning (today) has two phases: training
and inferencing (use) Training is highly computationally expensive
– Unsupervised learning is maturing
36
QuantumTwo varieties: gate-based and quantum annealingQuantum annealing (e.g., Dwave)– Convergence time is a function of noise floor– Classical annealing may be more power efficient
Gate-level quantum– Many proposed qubit devices: quantum dots,
Transmon, Ion trap, etc.– Current coherence times: 10-100usec
Need to be several of orders of magnitude longer– Solution: Redundancy- 1 virtual qubit = 1000
physical qubits– Power needs per virtual qubit ~ 10kW
Most of the power for waveform generators, interfacing
Cooling is a small percentage of the power1/22/201837
Is there a third way?
Non-neuromorphic, non-quantum, non-von-Neumann computing?Potentials:– Massive memorization (eg, HPE The
Machine)– Analog(-ous) computing /
Thermodynamic computing
1/22/2018
1/23/2018
Courtesy: Todd Hylton, UCSD
Level 4: Non-von Neumann
1. Quantum- Gate-based or quantum annealing
2. Analog neuromorphic3. Others: coupled oscillators, stateful
devices (memristors, spintronics, etc.), analog computing
System software nonexistentVery immature, risky technologyLarge investments needed
40
IEEE Rebooting Computing: Summary
Levels of RC: 1. More Moore (New switch/3D)2. Microarchitecture changes3. Architecture changes4. Non-von NeumannDirect pain / gain tradeoffsNew software R&D desperately neededIRDS: Applications-driven Roadmapping is identifying needed devices
41
rebootingcomputing.ieee.org irds.ieee.org
42
icrc.ieee.org