Processor Technology Update Final draft - ARM … 5x Camera ... Xeon-E5 2650 V3 Cortex-A57...

Post on 27-May-2018

227 views 0 download

transcript

1

ARM Processor Technology Update ARM Cortex®-A72 Processor Taking Mobile Performance

and Efficiency To New Levels

ARM Tech Forum, June 2015

Ian Smythe

Director of Marketing Programs

CPU Group

2

Processing Solutions

for Consumer

Markets

3

Accelerating the Pace of Innovation

2009 Display 5x

Camera 4x

Connectivity 20x

Sensors 3x

Video 34x

CPU 17x

GPU 40x

Memory Bandwidth 16x

2014

By GalaxyOptimus (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-

sa/3.0)], via Wikimedia Commons By Creative Tools. Watermark removed by User:Ainali [CC BY 2.0

(http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

4

ARM®v8-A Architecture: Mobile Leadership in 2015

Asus

Pegasus X002

Huawei Honor

4X

Huawei

Ascend Y550

Lenovo

A858T

Lenovo

Lemon K3

Lenovo

Sisley S90

Lenovo

Vibe X2 Pro

LG

Flex 2

Galaxy S6

Edge

HTC

Desire 820

Meizu

M1 Note

Oppo

R5

Oppo

1105

Samsung

Galaxy A7

Samsung

Galaxy Mega 2

Samsung

Galaxy Note 4

Vivo

X5Max

Xiaomi

Redmi 2

Just some of the ARMv8-A architecture-based phones announced so far

Unsubsidized price estimates* from $100 to $750

*Pricing information from www.gsmarena.com

HTC

Desire 510

5

Scales efficiently to significantly higher performance in larger screen devices

Fits even more compute in a smaller footprint

with less power

Cortex®-A Processors: Scalable for Large Screen Devices

By Google (Open Source OS Screenshot) [CC-BY-SA-3.0

(http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons

6

Cortex-A72 as ‘big’ core

increases performance and efficiency

ARM big.LITTLE™: Must-Have for Longer Battery Life

Technology Evolution

big.LITTLE Cluster switching to big.LITTLE MP

big.LITTLE with

Intelligent Power Allocation

7

3.5x performance of Cortex-A15 in smartphone

power envelope

Maximizes sustained device performance

75% less energy for same workloads enabling slimmer and

cooler devices

Compelling scalable solutions

Smartphones to large-screen compute solutions

16nm FF+ POP enables high frequency designs to 2.5GHz+

Designed with the system in mind

CoreLink CCI-500 interconnect

Mali-T880 GPU, V550 Video, DP550 Display

MMU-400, NIC-400, ELA-500

ARM Cortex-A72: Highest Performance ARM Cortex CPU

8

Compelling single-threaded performance

Large performance increase across all workloads including integer, memory-intensive, crypto, floating point, etc.

Baseline microarchitecture similar to Cortex-A57

Significant advancements in power efficiency

Re-optimized every logical block from Cortex-A57

Power reduction enables sustained operation at Fmax

Area reduction lowers costs and static power

Feature support for enterprise and mobile SoCs

Cortex-A72: Increased Performance and Reduced Power

9

1.9

2.6

Cortex-A72: Accelerating Usable Performance

2016

Premium

2014

2015

x

x

Increase in sustained performance within

smartphone power budget 3.5x

Cortex-A15

28nm

1.6 GHz

Cortex-A57

20nm

2.0 GHz

Cortex-A57

14/16nm

2.3 GHz

Cortex-A72

14/16nm

2.5 GHz

10

28nm

28nm

28nm

Cortex-A72: Reducing Power Consumption

28nm

20nm

16FF+

75% Less energy

at target

process

Energy consumed for same mobile workloads

Cortex-A72

2GHz max

1.1 GHz @ equivalent performance

50% Less energy

At iso-process 40-60% further reductions on

average across multiple workloads

Combined with Cortex-A53:

Cortex-A15 Cortex-A57

2GHz max

1.3 GHz @ equivalent performance

1.6GHz 2.2GHz max 2.5GHz max

11

Intel workloads measured on Dell Venue Pro II. SPEC benchmarks measured using gcc compiler v4.9 with –o3 flag.

Cortex-A72 measured on RTL with realistic memory system with the same compiler settings

Multi-threaded workloads use 2C4T Core-M CPU and estimated on 4C Cortex-A72 configuration w/2MB L2 cache.

Cortex-A72: More performance in constrained envelopes Compelling Mobile SoCs for smartphone, tablet, and laptop form factors

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Geekbench ST SPECint SPECfp Geekbench MT

(4T)

SPECintRate (4T) STREAM Add STREAM Copy STREAM Scale STREAM Triad

No

rmalized

Perf

orm

an

ce

Core-M 2 GHz (14FF)

Cortex-A72 2.5 GHz est (16nm)

Single-thread Multi-thread Memory

4W <1W

12

L2 Cache L2 Cache

Cache Coherent Interconnect

Interrupt Control

big Cluster

LITTLE Cluster Architecturally Identical Processors

High performance tuned “big” cores

High efficiency tuned “LITTLE” cores

Hardware Coherency

Cache Coherent Interconnect (CCI)

L1 and L2 snooping between clusters

Seamless & Automatic Task Allocation

Global Task Scheduling (big.LITTLE MP)

Heterogeneous Computing

Up to1.8x higher performance vs. LITTLE-only*

45% to 65% CPU power savings vs. big-only*

big.LITTLE Technology: Right Core for the Right Task

* Measured across a set of common use-cases on a 4xCortex-A57.4xCortex-A53 big.LITTLE device

† Average power across high-end gaming and low-utilisation workloads

1 2

Relative big. LITTLE Power

Cortex-A57

Cortex-A53

Cortex-A15

Cortex-A7

35%†

Lower

power

13

The combination of High Performance ‘big’ and High

Efficiency ‘LITTLE’ CPUs deliver optimal power efficiency

and user experience within the thermal constraints

big.LITTLE: Optimizing for Power Efficiency Measured Power and Performance during Web Browsing

LITTLE Cluster big Cluster LITTLE Cluster big Cluster

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

big.LITTLE* LITTLE-only* big-only*

Power Page Load Time

*Measurements taken from the same SoC

Lower is better

14

Three Ways big.LITTLE Has Improved in 2015

big.LITTLE Validation Suite ARM Intelligent Power

Allocation (IPA)

Validation suite simplifies

tuning and shortens time to

market

Native support for IPA, the

new Linux Thermal

Framework

Testcases

Report

Generator

ARMv8 Cortex-A CPUs

big.LITTLE devices in 2015

will achieve higher

performance efficiency

Traditional

IPA

0 1 2 3

AnTuTu HTML5

Epic Citadel

Vellamo HTML5

Quadrant CPU

AnTuTu CPU

Octane

AndEBench

WebXPRT big.LITTLE(ARMv8)

big.LITTLE(ARMv7)

Cortex-A57

15

Single thread performance is crucial for gaming, video playback and web browsing applications

big.LITTLE software migrates latency sensitive threads to High Performance CPUs to reduce

execution time and deliver an improved mobile user experience

125% 138% 159% 157% 140% 119%

Angry

Birds

Audio

Player

Photo

Editor

Facebook Castle

Master

Asphalt 8

big.LITTLE User Experience Improvement

LITTLE-only (4L) big.LITTLE (1b+4L)

big.LITTLE Delivers a Richer User Experience

b: “big” High Performance CPU

L: “LITTLE” High Efficiency CPU

0 0.2 0.4 0.6 0.8 1

4b+4L

2b+4L

1b+4L

4L

2L

1L

Normalised Time

Web Page Load Time Performance

(Higher is Better)

40%

Norm

aliz

ed

Applic

atio

n S

peed

16

big.LITTLE with Cortex-A72 for Entry to Mid-range

Configurations with High Performance CPUs

offers greater user experience and higher power

efficiency benefits relative to LITTLE only

Topologies with Cortex-A72 CPU as big core

offer improved user experience at reduced area

0

0.5

1

1.5

2

Angry Birds Temple Run Video Playback Asphalt 8

Normalised User Experience

LITTLE-only (SMP)

big.LITTLE (1b+4L)

Cortex-A72 with 2MB L2 for 2 cores, 1MB L2 for 1 core

Cortex-A53 1MB L2 for MP4, 512kB L2 for 2 MP2 and Octa-LITTLE 2nd cluster

LITTLE-only

Increasing in Single Thread Performance

Increasing in User Experience

Increasing in Energy Efficiency

big.LITTLE with Cortex-A72

1.09x Area: 1.0x

1.3x 0.98x

17

Standalone Devices Companion Devices Tethered Embedded Deeply Embedded

Embedded OS Rich OS

Always aware, lowest-power High-efficiency performance, constrained power budget

Peripheral Autonomous Compute

ARM at the Heart of the Wearables Market

18

Processing Solutions for

Networking and Infrastructure

Markets

19

Range of SoCs Addressing Infrastructure

Highly Accelerated Balanced Massively Multicore

QorIQ LS2 ThunderX Tile-MX 100 MPSoC

Opeteron™ A1100 Stratix® 10 X-Gene™

One Size Does Not Fit All

20

Cortex-A57 Networking Solutions Gather Pace ARMv8 SoCs in deployments now, many more coming.

Freescale LS2085 and LS 2045 Cortex-A57 based 8-core and 4-core complex SDN Switching, NFV Solutions

Networking Applications: Enterprise Routing, Data Center

Solutions, OpenFlow switching, Enterprise Switching,

Security Appliances/IPS/IDS, DPI, ADC/Wan-Opt

HiSilicon 32-core

First 16nm FinFET ARMv8-A networking chip

32-core ARM Cortex-A57 SoC

Networking applications: Next Generation BTS,

Core Routers, Virtualized appliances, SDN

AMD Hierofalcon, Seattle platforms

21

Enterprise Compute Requirements

Specialised Processing

L1, Content Delivery, Security

Diverse requirements

Trend: Advanced modulation schemes

Need: DSPs, Accelerators

Data Plane Processing

Throughput driven, IO intensive

Deterministic performance

Trend: Higher packet rates

Need: Small Cores at Maximum Efficiency

Control Plane Processing

Fast Event Processing

Complex signalling

Trend: Evolving Software

Need: Efficient, High Compute Performance

MAC Scheduling

Real Time, Latency Driven

Multiple core processing

Trend: More Complexity (LTE-A, 5G)

Need: High Compute, Low Latency Performance

High Bandwidth, Low Latency Interconnect

Wide Range of Implementations from Few to Many Coherent Devices

22

DSPDSP

ACE

Network Interconnect

NIC-400

Flash

NIC-400

USB

Memory

Controller

DMC-520

x72

DDR4-3200

AHB

Snoop Filter1-32MB L3 cache

PCIe

10-40

GbE

DPI Crypto

CoreLink™ CCN-512 Cache Coherent Network

DSP SATA

Memory

Controller

DMC-520

x72

DDR4-3200

Cortex-A72

Memory

Controller

DMC-520

x72

DDR4-3200

Memory

Controller

DMC-520

x72

DDR4-3200

PCIe

DPI

I/O Virtualisation CoreLink MMU-500

SRAM

Network Interconnect

NIC-400

GPIO PCIe

GIC-500

Cortex CPU

or CHI

master

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

®

Extensible Architecture for Heterogeneous Multi-core Solutions

Up to 4

cores per

cluster

Up to 12

coherent

clusters

Integrated

L3 cache

Up to 24 I/O

coherent

interfaces for

accelerators

and I/O

Peripheral address space

Heterogeneous processors – CPU, GPU, DSP and

accelerators Virtualized Interrupts

Up to Quad

channel

DDR3/4 x72

23

Maximizing Throughput Density: per mm2, per Watt

0

0.2

0.4

0.6

0.8

1

1.2

Xeon-E5 2650 V3 Cortex-A57 Cortex-A72 Xeon-E5 2660 V3

20 Thread Workload

2.3

GH

z

2.7

GH

z

2.6

G

Hz

Rela

tive

perf

orm

ance

(Sp

ec2

K6 r

ate)

Comparison for equivalent number of threads Platforms used:

Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag

Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache

per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+ interconnect

2.5

GH

z

105W

105W

<30W

<30W

ARM Solution Benefits:

Less than 1/3rd the power for equivalent

performance

Allows more specialized computing or

significantly greater thread density in

the same power budget

(10 cores 20 threads) (20 cores 20 threads) (20 cores 20 threads) (10 cores 20 threads)

POP

Optimizations

POP

Optimizations

24

Cortex-A72: Ideal for Dense Compute Environments

Cortex-A72 is <20 % size

Single Broadwell CPU + 256K1 L2

~8mm2

Cortex-A72 MP4 + 2MB L23

~8mm2

Single Cortex-A72 core 2

~1.15mm2

A quad core Cortex-A72 with 8x L2 cache RAM is

the same size

1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries

Core

25

ARM Ecosystem

ARM

Scalable

ISA

This diagram is a sample representation of the ARM Partner Ecosystem for illustration purposes only

26

Mobile Cortex-A72 delivers 3.5x performance of Cortex-A15 in the smartphone envelope

Compelling scalable solutions from smartphone to large-screen compute

Designed with the system in mind: CPU, CCI, GPU, Video, MMU, NIC, ELA

Wearables from Cortex-M to Cortex-A

Infrastructure Cortex-A72 (and Cortex-A57) are ideal for dense, high-throughput computing

Small footprint for greater density on-die for larger core counts

Scalable configurations of larger (40+) cores with ARM Corelink CCN products

Deliver maximum throughput per mm2, per watt and per chip

Enterprise ready feature set and ecosystem

Summary

27

Thank you