Post on 20-May-2020
transcript
Feb 23, 2011
Gans Srinivasa
Intel Corp
Thanks to: Ravi Iyer, Scott Hahn, Bhushan Chitlur
Doug Carmean, Pranav Mehta, Alon Naveh, Henry Gabb
Power-Performance Uplift Using Hetero General compute & Domain Specific Areas
1
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,
EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
Intel may make changes to specifications and product descriptions at any time, without notice.
All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Intel, Intel Inside, Xeon, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2011 Intel Corporation.
2
Agenda
Power-Performance
Servers to Clients
Heterogeneous Architecture
The Future Challenges
3
Energy Efficient Computing: Overall processor PowerTransition to Multicore
Source: Fred Pollack, Keynote – MICRO’32
Parallelization
UC PAR Lab Presentation – Krste Asanovic – May24,2010
Mutlicore PartsIntel 6C
What is Next?Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
Power Constrained Era
4
Trends in “The Cloud”Rapid Cloud Growth: Reduces the hassle for users and providers.
User needs only a browser: No worry about software installation, patches etc
HW & Service Providers: Focus on Total Cost of Ownership
SW Providers: Apps in controlled env and no shrink warp worries
Trend: Cloud needs more Energy efficiency, more communication bandwidth, lower TCO.
Today we have homogenous Multicore. Scaling will be limited by power and area.
Power efficient small cores have emerged and show potential for power efficient performance.
What could we do spanning Servers to Clients/Mobile Devices?
SW Service
Search,Apps
Providers
WALL POWERED
30–35% DataCenteroperating
cost is Power USERS:
Less hassle on
SW and HW front
Future Workloads Demand more performance and are
widely different*
COMMUntrusted
(Internet)Trusted
(Enterprise)
10Gbe
onwards
10Gbe
onwards
Mobile Client*Dominant application platform
Thinner form-factors and longer battery life:“Whole day” Streaming/media/audio workloads
* http://parlab.eecs.berkeley.edu/publication/137; http://parlab.eecs.berkeley.edu/5
Current Data Center
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
20
40
60
80
100
120
140
Bill
ion
s (
kW
h / y
ea
r)
Historical
Trends
Current
Efficiency
Trends
Projected Data Center Energy Use
forecast
2.9% of projected
total U.S. electricity use
1.5% of total US.
electricity usage
0.8% of total US
electricity usage
EPA study for Congress on Data Center Consumption
DiskNIC NICDisk
EP Xeon Server Node Today
QPI
PCIe DDR
Ethernet
R R
R R R…
S S
S S S S
LB LB
A A A…
A A A…
Data Center
Layer 3
Layer 2
A single Layer 2 domain
A = Rack of 40 Servers
Source – NY Times, 06.14.2006
Acer AR160 F1(Intel Xeon X5670, 2.93 GHz)
1U
Intel Xeon X5670
Six-Core, 2.93 GHz, 12 MB L3 cache
2933
12 cores, 2 chips, 6 cores/chip
24 (2 / core)
3-Nov-10
Line-powered
Many “modern day data centers” operate around 20% load.
Reducing idle to 50W @10c/kwh for 10000 nodes saves ¼ Millon $ /yr
on energy cost. Potential is very high.
How can we go beyond this and achieve higher power-performance
efficiency?
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands
may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
0
50
100
150
200
250
300
0% 20% 40% 60% 80% 100%
Average Active Power (W)
http://www.spec.org/power_ssj2008/results/res2010q4/power_ssj2008-20101012-00300.html
Heterogeneous Space
Workloads are not homogeneous
Energy efficiency increasingly important
Frequency to Multicore play is running out of steam
What is Next?Heterogeneous architectures (mix of
different cores and accelerators)
7Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
~3400MB/Sec~3400MB/SecDRAM Bandwidth
512KB 8-way512KB 2-wayL2 Cache*
24KB 6-way32KB 8-wayL1 D-Cache*
1.6GHz2GHzFrequency
Atom** (Diamondville)
Core2 Quad** (Woodcrest)
Parameters
~3400MB/Sec~3400MB/SecDRAM Bandwidth
512KB 8-way512KB 2-wayL2 Cache*
24KB 6-way32KB 8-wayL1 D-Cache*
1.6GHz2GHzFrequency
Atom** (Diamondville)
Core2 Quad** (Woodcrest)
Parameters
Big & Small cores Research
Recent measurements (SPECcpu and Bio workloads)Comparison of Atom to Core 2 Duo
Observations: Varying perf difference (1.03x – 2.63x) across applications
Shows the use case for heterogeneity
Knowledge of performance/power difference can help in scheduling
Validation:
Small Core Perf =
50% Big Core
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
0
1
2
3
4
5
6
7
8
9
10
eo
n
pe
rlb
mk
hm
me
r
pe
rlb
en
ch
de
nb
en
ch
con
sum
er
cra
fty
tele
com
au
tom
otiv
e
ga
p
sje
ng
ph
ylip
2
vort
ex
gzi
p
go
bm
k
om
ne
tpp
mcf
bzi
p2
pa
rse
r
two
lf
fast
a1
bzi
p2
vpr
ast
ar
ne
two
rkin
g
gcc
mcf
tigr
CP
I
0
0.5
1
1.5
2
2.5
3
Pe
rfo
rma
nce
Ra
tio
CPI Core2
CPI Atom
Performance ratio
DRAM
MCH
Big
L1
L2
Big
L1
L2
Small
L1
L2
Small
L1
L2
8
** to mimic a configuration where both cores integrated into the same
Heterogeneous uncore architecture, we configured core freq, cache and memory
Subsystem to be as close as possible as shown in the table.
Hetero Core Solution Space Multiple compute elements that are tightly coupled
– Big core, Atom, Smaller Cores and Hardware accelerators
– Sharing cache coherency and common memory
– Can be managed by a single OS or User level or Other means
Power is key
– Objective is to minimize energy per task
– Ideally, each task is performed in most efficient compute element
– Using a smaller core reduces power by an order of magnitude. Performance/Power improves.
Big
Accel1
SmallTiny
Accel2
With all big cores, scalability is a problem because of power wall
With all small cores, it single thread performance that suffers
With heterogeneous (big + small), we can strike a power/performance balance and
continue to scale
Pe
rfo
rma
nce
Apps range
ISA
ISA extension A ISA extension B
Courtesy: Prof Uri Wiser, Technion.
TinyTinySmall
SmallSmall
BigBig
Small
TinyTiny
9
Accel3
Heterogeneous Architecture Design Space
Performance asymmetry
– Cores have different performance and power
– E.g., asymmetric cache sizes, clock speeds, uarch
– Apps can run anywhere, but get different performance
Functional asymmetry– Cores have different ISAs
– Difference can be in many dimensions
– Instructions, registers, data types, addressing modes, memory architecture, exception handling, I/O
– Can have various degrees of difference
Disjoint ISAsSame ISA Overlapping ISAs
Degree of functional asymmetry
10
Heterogeneous Architecture Space
Disjoint ISAsSame ISA Overlapping ISAs
Degree of asymmetry
Cores with same ISA, same uArch executing at different speeds
Cores with Subset ISA, diff uArch executing at different speeds
Cores with overlapping ISA, diff uArch (embedded HW accelerators, (Xeon+MIC )
Heterogeneous Cores (cell, IA/Gen)
diff ISA,diff uArch
Traditional SMP OSDriver Model
(e.g. OpenCL)
Hetero OS
Homogeneous cores same ISA, same uArch shared resources (SMP)
same ISA, same uArch platform resource differences
(NUCA/NUMA)
11
HeteroOS Research work @ Intel
Cost effective solution for ST and MT
Various scheduling options examined and continue the work
Fault and migrate to support Instruction based asymmetry
Reference: Operating System Support fro Overlapping ISA Heterogeneous Multi-core Architectures – HPCA 2010
- Scott Hahn etal Intel corp.
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
4 big 4 small 2.33 GHz
4 big 4 small 2 GHz
4 big 4 small 1.66 GHz
4 big 4 small 1.33 GHz
4 big 4 small 1 GHz
Speedup o
ver
sto
ck L
inux
SPEC OMP2001
SPECjbb2005
Kernbench
x264
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
0.6
0.8
1
1.2
1.4
1.6
2 big 2.66 GHz 1 big 2.66 GHz 4 small 1 GHz
1 big 2.66 GHz 4 small 1.33 GHz
1 big 2.66 GHz 4 small 1.66 GHz
1 big 2.66 GHz 4 small 2 GHz
No
rma
lize
d p
erf
orm
an
ce
SPEC OMP2001
SPECjbb2005
Kernbench (parallel make)
x264 (parallel media conversion)
Performance comparisons for two big cores only vs. one
big core, and four small cores at different frequencies
Many Core and Multi-Core
In Intel® MIC, each core is smaller and lower power, has lower single thread
performance, but higher aggregate performance
Many core relies on a high degree of parallelism to compensate for the lower speed of
each individual core
Relatively few specialized applications today are highly parallel, but those applications
will benefit from Intel® MIC
Many Integrated Cores at 1-1.2 GHz Multi-core Intel Xeon at 2.26-3.5 GHz
Die Size not to scale
Intel Confidential
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
Generalized Forms
Big
Core
Small
Core
Small
Core
Fixed Function
1
Fixed Function
2
Fixed Function
3
Big
Core
Small
Core
Small
Core
Fixed Function
1
Fixed Function
2
Fixed Function
3
Big
Core
Small
Core
Small
Core
Fixed Function
1
Fixed Function
2
Fixed Function
3
coherence domains
coherence domains
coherence
domains
coherence
domains
coherence
domains
coherence
domains
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.14
SW challenges
Should be tolerant of coherence domains
Scheduling policies
Hierarchical?
OS centered?
User level orchestrated?
Heterogeneous friendly
Big core / Little core
Fixed function
Migratory (similar)
Absence / Presence (dissimilar)
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.15
Let us turn this into our opportunity & Innovate
HW Challenges
“On Chip Diversity”: Various IP cores, MEMs, Sensors– Interconnect fabrics for heterogeneous region to communicate
efficiently
– Along with existing cores – need continuity
– Cores communicating in a fault tolerant fashion
Network On Chip communication
Multiple Clock domains & Voltage islands
SOC and 3D chip/platform will be the norm
High Level of integrationA
cc
ele
rato
r2A
cc
ele
rato
r1
Heterogeneous
CMP +
Loosely/Tightly
coupled
Accelerators
Core3
Core1
Core4
Core2
L1 L1
L1 L1
L2 L2
L2 L2
NoC
Ac
ce
lera
tor2
Ac
ce
lera
tor1
Heterogeneous
CMP +
Loosely/Tightly
coupled
Accelerators
Core3
Core1
Core4
Core2
L1 L1
L1 L1
L2 L2
L2 L2
NoC
How Not To Do Hetero CMP
Big CPU
GPU, AccelSmall CPU
Future Workload
Let us get this mix rightMore Discussion and state of current work on Friday
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and
brands may be claimed as the property of others. All products, dates, and figures are preliminary and are subject to change without any notice.
17