+ All Categories
Home > Documents > Integrating CPU and GPU, The ARM Methodology

Integrating CPU and GPU, The ARM Methodology

Date post: 04-Feb-2022
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
14
Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM
Transcript
Page 1: Integrating CPU and GPU, The ARM Methodology

Integrating CPU and GPU, The ARM

Methodology

Edvard Sørgård, Senior Principal Graphics Architect, ARM

Ian Rickards, Senior Product Manager, ARM

Page 2: Integrating CPU and GPU, The ARM Methodology

The ARM Business Model

Global leader in the development of semiconductor IP

R&D outsourcing for semiconductor companies

Innovative business model yields high margins

Upfront license fee – flexible licensing models

Ongoing royalties – typically based on a percentage of chip price

Technology reused across multiple applications

Long-term, secular growth markets

Approximately 850 licenses

Grows by 60-90 every year

More than 250 potential

royalty payers

>8bn ARM-based chips in ’11

>25% CAGR over last 5 years

Page 3: Integrating CPU and GPU, The ARM Methodology

Processor IP – Design of the

brain of the chip

Software development tools

ARM Technology Advanced consumer products are incorporating more and more ARM technology – from

processor and multimedia IP to software

Physical IP – Design of the building blocks of the chip

Page 4: Integrating CPU and GPU, The ARM Methodology

ARM® Mali™ GPU Momentum

#1 in Smart TVs

(>70% market share)

#1 graphics in

Android™ tablets

(>50% market share)

>20% Android™4

smartphones

Mali GPU shipments outpacing industry growth

rate & gaining market share

Page 5: Integrating CPU and GPU, The ARM Methodology

Comprehensive GPU Compute Support

ARM’s best-in-class CPU know-how combined with expertise in graphics technology

enabling complex use-cases

Computational photography: Panorama stitching

Image recognition: Face, smile, landmark, context

Image improvement, stabilization, editing, filtering

By moving GPU Compute tasks onto the GPU will enable lower power consumption

and faster response over being solely run on the CPU

Page 6: Integrating CPU and GPU, The ARM Methodology

Mali GPU Compute: No FUD... Facts

Passed Khronos Conformance

OpenCL™ 1.1 Full Profile on Linux and Android™

Proven in Silicon

Samsung Exynos 5 Dual, implements Full Profile

OpenCL and Renderscript DDK available now

Mali-T604 shipping in real products

Google Chromebook

Google Nexus 10

InSignal Arndale Community Board

API exposed for developers

OpenCL on Linux for Arndale platform

Renderscript computation on Android for Nexus 10

Page 7: Integrating CPU and GPU, The ARM Methodology

Compute Use Case Example

ARM Seemore demo

OpenCL 1.1 FP accelerated world

Interactive items and lights

Bullet physics broad-phase fully OpenCL accelerated

on GPU

Performance boost

GPU Kernel speedup >10x

But system speedup is less

ARM integration goal

Take the system cost out!

Page 8: Integrating CPU and GPU, The ARM Methodology

Integration: Coherency

SoCs are heterogenous systems

But sharing data can still be costly

Cache flushes, locks, syncs reduces the heterogeneous benefit

HW coherency makes sharing data cheap and automatic

ARM is in leading position with full technology coverage

Cortex™ CPUs

Mali GPUs

CoreLink™ system IP

AMBA™ bus protocols

Mali

CoreLink

Cortex

AMBA

Page 9: Integrating CPU and GPU, The ARM Methodology

Integration: Address Space Alignment

The 32-bit address space is running out, even in mobile

Midgard architecture built for full 64-bit addresses

Embedded distributed Mali MMU for VA to PA/IPA translation

Mali-T604: 48-bit VA and 40-bit PA/IPA

Uses ARMv7 LPAE page table format, just like Cortex-A15 & Cortex-A7

Multiple simultaneous address spaces supported

Mali GPUs run many threads in parallel

Independent processes may execute on GPU simultaneously

Seamless process transitions ensures maximum utilization/efficiency

Page 10: Integrating CPU and GPU, The ARM Methodology

Fine-Tuned to Different Performance Points

Simple, in-order, 8 stage pipelines

Performance better than mainstream, high-volume

smartphones (Cortex-A8 and Cortex-A9)

Most energy-efficient applications processor from ARM

Complex, out-of-order,

multi-issue pipelines

Up to 2x the performance of today’s high-end smartphones

Highest performance in mobile power envelope

Cortex-A7

Cortex-A53

Cortex-A15

Cortex-A57

LIT

TL

E

big

Q

u

e

u

e

I

s

s

u

e

I

n

t

e

g

e

r

Page 11: Integrating CPU and GPU, The ARM Methodology

ARM System Scalability

Introducing CCI-400 Cache Coherent Interconnect

Processor to Processor Coherency and I/O coherency

Memory and synchronization barriers

Virtualization support with distributed virtual memory signaling

128-bit AMBA 4

Mali-T624

GPU

Core

Mali L2 Cache

GPU

Core

GPU

Core

GPU

Core

CoreLink CCI-400 Cache Coherent Interconnect

128-bit AMBA 4

Quad Cortex-A7 MPCore

A7

Processor Coherency (SCU)

Up to 4MB L2 cache

A7 A7 A7

Quad Cortex-A15 MPCore

A15

Processor Coherency (SCU)

Up to 4MB L2 cache

A15 A15 A15

MMU-400

Page 12: Integrating CPU and GPU, The ARM Methodology

Low to Medium Intensity Use Cases

DVFS profiles from leading Dual Cortex-A9 Smartphone

Demonstrates that many common applications require only low to moderate processing power

All of these use cases will run predominantly on the LITTLE cores

The small % peaks to high MHz don’t necessarily require migration to big cores

The short term DVFS system response to any increase in average load is to go to the highest MHz

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1200000

1000000

800000

500000

200000

STANDBYWFI

CPU OFF

Cluster OFF

Audio (MP3) Angry Birds Camcorder 720p Video GT Racer

CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0

250MHz aggregate 90MHz aggregate

500MHz

aggregate

180MHz aggregate

Page 13: Integrating CPU and GPU, The ARM Methodology

big and LITTLE CPU Performance

Cortex-A9 powers high end mobile devices today

Cortex-A7 delivers comparable performance

…at lower power and area

Cortex-A15 delivers significantly higher performance

0

0.5

1

1.5

2

2.5

3

Quadrant - CPU Caffeinemark 3.0 BrowserMark V8 Benchmark Kraken Geomean

Cortex-A9 1.2GHz

Cortex-A7 1GHz actual

Cortex-A7 1.2GHz est.

Cortex-A15 1.2GHz

Cortex-A15 1.6GHz est.

Note: Cortex-A15 and Cortex-A7 results are from a test platform, with lower

memory performance than production systems will deliver

Page 14: Integrating CPU and GPU, The ARM Methodology

Summary

Getting the maximum efficiency out of modern SoCs is highly complex

Interactions between many sub-system to optimize

Requires new innovations and technology focus

ARM Cortex-A15 / coherent Mali / big.LITTLE enable highest performance and scalability

from mobile through to console class gaming.

ARM continues to drive the development for better system integrations

Cortex™ CPUs, Mali™ GPUs and CoreLink™ fabric leading the way

Future v8 AArch64 with multi-cluster for next-generation gaming

Thank you


Recommended