Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1...

transcript

Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision

Applications on ARM Platforms

Tim Hartley

The Evolution of Mobile GPU Compute

2007 2009 2010 2012 2013,2014,2015

OpenGL® ES 1.1 Fixed pipeline

OpenGL ES 2.0 Programmable pipeline

OpenCL™ Full Profile / RenderScript Portable Heterogeneous Parallel Computation

OpenGL ES 3.1 Compute Shaders GPU Compute within OpenGL ES API

Mali-200

Mali-300

ARM® Mali™-55 GPU

Mali-400 MP

Mali-450 MP

Mali-T600

Series

Mali-T700 &

T800 Series

Measuring the Whole System

Computer Vision will, for some time, succeed in using every drop of

processing power we give it And techniques in computer vision still evolving rapidly

New, complex, sustained low power use cases

Building computer vision applications an ever more complex process The availability of more processors and processor types makes this even more so

Capturing and analyzing accurate and effective measurements from platforms plays

a vital role in achieving optimal performance

CPU Core CPU Core

NEON NEON

Modern Computer Vision Applications

Vision Application

SIMD: Several components

per operation

128-bit registers

VLIW: Several operations per

instruction word

Some operations are “free”

Built in function library

Accelerated in hardware

Inside an ARM Mali Midgard Core

),,,max( 10 TexLSAAT

Hardware Counters

Counters per core Active cycles

Pipe activity

L1 cache

Counters for the GPU Active cycles

L2 caches

Accessed through DS-5 Streamline Timeline of all hardware counters, and more

Explore the execution of the full application

Zoom in on details

DS-5 Streamline Identify hotspots and system bottlenecks at a glance

Select from CPU/GPU counters

OS level and custom data sources

Accumulate counters, measure time

and find instant hotspots

Select one or more tasks to

isolate their contribution

Combined task switching trace and

sample-based profile

Example: Complex Computer Vision Application

Lane and Car Detection

Streamline

Streamline: OpenCL Timeline

kernel memory

Mem ops

Arithmetic

No Yes

Yes No

Limited by kernel execution

time or mem management?

Limited by Arith ops or

Mem ops?

Limited to 64 threads?

Large no. of register bank conflicts?

Large no. of instruction cache misses?

Reduce register pressure.

Simplify or shorten kernels

Vectorise the kernel if possible.

Decrease the arith work if possible.

High number of instruction

re-issues?

Limited to 64 threads?

Large no. of instruction cache misses?

Reduce register pressure.

Simplify or shorten kernels

Vectorise the LS operations if possible.

Decrease mem accesses if possible.

Ensure you are not copying

memory unnecessarily

Improve memory access

pattern to improve cache

efficiency

Limited by same factors?

Done optimising Reiterate

Optimisation

Overview

Deriving Meaning from Hardware Counters

Counters on their own usually don’t mean a huge amount

Combining counters is more useful Comparing values to determine limiting pipes

Calculating more meaningful values from multiple values

New graph traces can be added from these counters …and become an integral part of the timeline

Custom Charts: Bringing Counters Together

100 * $MaliLoadStorePipeLSInstructions / $MaliLoadStorePipeLSInstructionIssues

100 * MaliLoadStorePipeLSInstructionIssues / $MaliCoreCyclesTripipeCycles

100 * $MaliArithmeticPipeAInstructions / $MaliCoreCyclesTripipeCycles

100 * $MaliCoreCyclesTripipeCycles / $MaliJobManagerCyclesGPUCycles

One load

One store

“n” ALU operations

ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }

One load

One store

“n” ALU operations

ALU Bound kernel __kernel void kernel_alu_bound( global float* arr, uint n) { float value = arr[get_global_id(0)]; for(uint i = 0; i < n; i++) { value += sin(value); } arr[get_global_id(0)] = value; }

One load

One store

No ALU operation

L/S Bound kernel

__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }

One load

One store

No ALU operation

L/S Bound kernel

__kernel void kernel_memcpy( global float *a, global float *b) { float4 v = vload4(0, a); vstore4(v, get_global_id(0), b); }

One byte read every 64 bytes

One byte written every 64 bytes

Really bad cache utilisation!

Cache misses

__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }

One byte read every 64 bytes

One byte written every 64 bytes

Really bad cache utilisation!

Cache misses

__kernel void kernel_cache_misses( global uchar *a, global uchar *b) { b[64 * get_global_id(0)] = a[64 * get_global_id(0)]; }

What does good whole-system optimisation look like?

Conclusions

Tomorrow at the EVA Summit, 4pm:

“Understanding the Role of Integrated GPUs in Vision Applications”, Roberto Mijat

Computer vision applications need careful optimisation

Understanding your system as a whole is a vital first step

Understanding each individual processor core type is the next

Use tools to measure hardware counters across the entire platform

Whole-system views of the relative performance of heterogeneous architectures are invaluable

Allows you to decide where there is capacity to move workloads

And how to target optimisations by exposing the limiting component within individual cores

Ideally, use these tool throughout the development process, not just at the end

The Mali Ecosystem is making GPU Compute a reality today

ARM enables developers with platforms, drivers, tools and support

Industry leaders take advantage of ARM Mali GPU capabilities to innovate and deliver

Be one of them!

Ecosystem Resources

www.malideveloper.com

Download guides, papers, tools (including DS-5 Streamline), etc.

http://community.arm.com/welcome

Community forums, blogs and more

malidevelopers@arm.com

Graphics and GPU Compute developer support

http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

A range of video and written tutorials for GPU Compute

http://malideveloper.arm.com/develop-for-mali/features/mali-t6xx-gpu-user-space-drivers/

ARM Mali-T600 series GPU user-space binary drivers available for download

Linaro BSP now available with Mali-T600 series GPU support

Tim Hartley

Measuring the Whole System Holistic Profiling of CPU and GPU for Optimal Vision

Applications on ARM Platforms

Measuring the Whole System · 2016-12-13 · 2007 2009 2010 2012 2013,2014,2015 OpenGL® ES 1.1...

Documents