Supercomputing and Mass Market Desktops
John Manferdelli
Microsoft Corporation
1
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this
summary.
Capabilities on the rise for the end user
2
Hardware Paradigm Shift - 2004
“… we see a very significant shift in what architectures will look like in the future
...
fundamentally the way we've begun to look at doing that is to move from
instruction level concurrency to … multiple cores per die. But we're going to
continue to go beyond there. And that just won't be in our server lines in the
future; this will permeate every architecture that we build. All will have massively
multicore implementations.”
Intel Developer Forum, Spring 2004
Pat Gelsinger
Chief Technology Officer, Senior Vice President
Intel Corporation
February, 19, 2004
10,000
1,000
100
10
1
„70 „80 „90 „00 „10
Po
wer
Den
sit
y (
W/c
m2)
4004
8008
8080
8085
8086
286386
486
Pentium® processors
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun‟s Surface
Intel Developer Forum, Spring 2004 - Pat Gelsinger
Today's Architecture: Memory access speed not
keeping up with CPU clock speeds
Modern Microprocessors - Jason Patterson
Sp
eed
(M
Hz)
10,000
1,000
100
10
1990 1992 1994 1996 1998 2000 2002 2004
CPU Clock Speed
DRAM Access
Speed
Today‟s Architecture: Heat becoming an
unmanageable problem!
Intel Cancels Top-Speed Pentium 4 ChipThu Oct 14, 6:50 PM ET Technology - Reuters
By Daniel Sorid
Intel …canceled plans to introduce its highest-speed desktop
computer chip, ending for now a 25-year run that has seen the
speeds of Intel's microprocessors increase by more than 750
times.
Memory Wall~90 cycles of the CPU clock
to access main memory!
Trends
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
2004 90nm
2006 65nm
2008 45nm
2010 32nm
2012 22nm
2015 16nm
max transistors (millions, 500 mm2 die)
potential many-core peak parallel GOPS
multi-core single threaded GOPS
memory bandwidth, novel packaging (10s GB/s)
memory bandwidth, DIMMs (10s GB/s)
8X
80X
max transistors ↑40%/year
single thread perf ↑10%/year
32 LPIA cores @ 30 GFLOPS
25X
MCM
stacked dice DRAM
ParallelismOpportunity
Montecito
BandwidthOpportunity
4
LPIA
x86
1 MB
cache
1 MB
cache
DRAM
ctlr
LPIA
x86
1 MB
cache
1 MB
cache
PCIe
ctlr
Ultra-Mobile: 40 mm2, 5 W, $50
LPIA
x86
LPIA
x86
DRAM
ctlr
DRAM
ctlr
DRAM
ctlr
DRAM
ctlr
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
PCIe
ctlrNoC NoC NoC NoC NoC NoC
PCIe
ctlr
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
LPIA
x86
LPIA
x86 Custom acceleration
LPIA
x86
LPIA
x86
Server: 350 mm2, 120 W, $2000
LPIA
x86
LPIA
x86
DRAM
ctlr
DRAM
ctlr
OoO
x86
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
PCIe
ctlr
PCIe
ctlrNoC NoC
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
1 MB
cache
1 MB
cache
LPIA
x86
LPIA
x86
DRAM
ctlr
DRAM
ctlr
OoO
x86
Desktop: 200 mm2, 100 W, $400
Lots more computing powerLos Alamos Computing Center? No a single CPU chip.
(2008 45 nm process)
5
1. We believe user experiences will benefit from 100 fold improvements in computational power.
2. Because of physics, you won’t get this power from frequency scaling or programmer transparent hardware like ILP. The only way is parallelism.
3. Chips with many heterogeneous cores can be manufactured now. Only with chips like this can such performance scaling continue.
4. There are difficult but solvable issues related to memory bandwidth and I/O for this to work well over-all.
5. There are other benefits from such chips like good power and manufacturing characteristics .
6. Application of such hardware does not appear to be limited by any intrinsic lack of parallel algorithms.
Many Core in a Nutshell
6
Manycore Induced Hard Problems
7
Constructing parallel applications
• Encapsulating parallelism in reusable components
• Integrating concurrency & coordination into existing programming languages
• Raising the semantic level to eliminate sequencing
• Reducing the complexity of debugging, tuning and testing
Executing fine-grain concurrent applications
• Managing large amounts potential concurrency
• Supporting lightweight transactions
• Interoperating with legacy thread model and interfaces
• Evolving hardware to effectively support parallel programs
Coordination of system resources and services
• Assigning resources securely
• Hosting concurrent operating environments
• Managing resources cooperatively
• Providing concurrent system services
• Managing heterogeneous resources
8
Manycore Stack
Constructing Parallel
Applications
Executing Fine-Grain
Parallel Applications
Coordinating
System Resources
and Services
Applications
Libraries
Languages, Compilers and Tools
Concurrency Runtime
Partitioning Hypervisor
Hardware
OS Kernel
Developer Impact
9
New Languages Safety & Portability Streaming Data Parallel Functional Parallel
Programming Abstractions
Existing Languages Data Parallel Extensions Transactions API for messaging
Developer Tools
Infrastructure Transaction & Parallel
Safe Numerical Libraries Declarative Languages
Asynchronous Agents
Concurrent Collections
Transactions
Visual DesignersAgents, Dataflow
Diagnostics Debugging &
Performance Tuning
Testing & Verification
10
Simulating The Physical World
Gameplay Simulation
Life
Sciences
Multidisciplinary
Research
New Materials,
Technologies
and Processes
Computer and
Information
Sciences
Math and
Physical Science
Social
SciencesEarth
Sciences
ComputationalModeling
Computational
Modeling
Sensors
Interpretation
& Insight
Persist
Data Mining
& Algorithms
A transformative exampleModeling of the instrumented world
141414
First Responder Scenario (empowering users to make good decisions)First responders need to quickly and safely experience unfamiliar areas
under different disaster scenarios before they deploy into it.
For example, earthquakes in San Francisco.
Setup Environment
Explore Possibilities
Monitor and Take Action
Scenario Features Tech Domains
15
Modeling the instrumented worldScenario: First responders need to quickly and safely experience unfamiliar areas under different disaster scenarios before they deploy into it.
Technology Domains: To realize this scenario, we need to compose these technologies and accelerate their performance at least 10-fold on the manycore client.
Mac
hin
e Le
arn
ing
Mat
h
Ph
ysic
s
Stat
isti
cs
Imag
e P
roce
ssin
g
3d
Mo
del
ing
Ren
der
ing
Dat
abas
e
Qu
ery
engi
ne
Ind
ex e
ngi
ne
Dat
a co
mp
ress
ion
Mo
tio
n In
pu
t
Stitching scenes together (2D, 3D) P P P P P
Object recognition in scene P P P P
Interaction (navigation) in scene P P P P P P P
Linking scene objects to live data P P P P P
Filling in missing information P P P P P P P
Image and video correction P P P P P P
Long running scene statistics P P P P P
Personalization of content P P P
Simulating physical events P P P P P
Real-time scene update P P P P P P P P P P
Tech Domains Platform
Technology DomainsTo realize this scenario, we need to compose these technologies and accelerate their performance at least 10-fold on the manycore client..
Platform : To compose technology domains and accelerate their performance 10-fold, we need to build the following platformcomponents
Inte
grat
ing
con
curr
ency
in
to e
xist
ing
lan
guag
es
Enca
psu
lati
ng
par
alle
l-is
m in
reu
sab
le m
od
ule
s
Too
ls f
or
deb
ugg
ing
and
p
rofi
ling
par
alle
l co
de
Sup
po
rt f
or
fin
e-g
rain
d
ata
par
alle
lism
Sup
po
rt f
or
ligh
twei
ght
tran
sact
ion
s
Op
tim
izat
ion
of
avai
lab
le b
and
wid
th
Syst
em-w
ide
dyn
amic
re
sou
rce
man
agem
ent
Co
nn
ecti
vity
of
serv
ices
Man
agin
gm
anyc
ore
har
dw
are
pla
tfo
rm
Machine Learning P P P P P P P P
Math P P P P
Physics P P P P P P P
Statistics P P P P P
Image Processing P P P P P P P P P
3D Modeling P P P P P P P P
Rendering P P P P P P P P P
Database P P P P P P P P P
Query engine P P P P P P P P P
Data compression P P P P
Gesture/Motion Input P P P P P P P
Application/Libraries Architecture
Personal Assistant Information Management
Natural UI Games & Entertainment
Program Development Biology/Health care
Technical Computing Business Intelligence
System Design Robotics
Speech Engine, Gaming (Physics/AI), Unified
Communications, Database, Vision, Machine
Learning, Semantic Processing
Single domain physics, search algorithms,
audio/video characterization and extraction, biology
simulations, optimization and constraint resolution
Applications
Application Services
Domain Specific Libraries
BaseLibraries
Common data structures and
algorithms: trees, graphs, tables,
sorting, traversal.
System Architecture
Hypervisor
Device/
Virtualization
Service
Partition
Virtual
Device
Disk Net Keyboard, Mouse, Screen
Trusted I/OCPUs, Memory, TPM
Machine
Management
Partition
Paravirtualized
OS Partition
Device
proxy
ConcRT
SysService API
Many-Core
Compute Partition
ConcRT
Device
proxy
Application
Kernel
Application
SysService
Kernel
Resource Management between OS/Runtime and
VM/OS (under Machine Management)Isolation and cooperative
device support provided by
VM/OS
Scalable Services provided by OS (async,
cancellable, thread affinity free)