+ All Categories
Home > Technology > "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Date post: 13-Jan-2017
Category:
Upload: embedded-vision-alliance
View: 1,927 times
Download: 0 times
Share this document with a friend
25
Copyright © 2016 Intel Corporation 1 Accelerating Deep Learning Using Altera FPGAs Bill Jenkins May 3, 2016
Transcript
Page 1: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 1

Accelerating Deep Learning Using

Altera FPGAs Bill Jenkins

May 3, 2016

Page 2: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 2

Legal Notices and Disclaimers

• Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service

activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure.

• Tests document performance of components on a particular test, in specific systems. Results have been estimated or simulated

using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Differences in

hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance

as you consider your purchase. For more complete information about performance and benchmark results, visit

http://www.intel.com/performance.

• Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances

and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs

or cost reduction.

• All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product

specifications and roadmaps.

• Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-

looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s

results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

• The products described may contain design defects or errors known as errata which may cause the product to deviate from

published specifications. Current characterized errata are available on request.

• No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

• Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the

referenced web site and confirm whether referenced data are accurate.

• Intel, the Intel logo, and Xeon and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and

brands may be claimed as the property of others.

Page 3: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 3

• Accelerated FPGA innovation from

combined R&D scale

• Improved FPGA performance/power

via early access and greater

optimization of process node

advancements

• New, breakthrough Data Center and

IoT products harnessing combined

FPGA + CPU expertise

Altera and Intel Enhance the FPGA Value Proposition

Accelerated FPGA investment

Operational excellence

STRATEGIC RATIONALE

• Superior product design capabilities

• Continued excellence in customer

service and support

• Increased resources bolster long-term

innovation

• Focused, additive investments today

Page 4: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 4

• Extracting features from data in order to solve predictive problems

• Image classification & detection

• Image recognition/tagging

• Network intrusion detection

• Fraud / face detection

• Aim is programs that automatically learn to recognize complex patterns and make

intelligent decisions based on insight generated from learning

• For accuracy, models must be trained, tested and calibrated to detect patterns

using previous experience

What is Machine Learning?

Page 5: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 5

• Human expertise is absent

• Navigating to Pluto

• Humans cannot explain their expertise

• Speech recognition

• Solution changes over time

• Tracking traffic

• Solution needs to be adapted to particular cases

• Medical diagnosis

• Problem is vast in relation to human reasoning capabilities

• Ranking web pages on Google or Bing

When to Apply Machine Learning

Page 6: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 6

Value Proposition of Machine Learning

X 35ZB/s =

Increasing

Variety of

Things

Volume x

Velocity =

Throughput

Separating Signal

from Noise

Provides Value

Data is the problem Revenue Growth

Cost Savings

Increased Margin

Page 7: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 7

• A network of interconnected neurons, modeled after biological processes, for computing approximate functions

• Layers extract successively higher level of features

• Often want a custom topology to meet specific application accuracy/throughput requirements

Convolutional Neural Networks (CNN)

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to

Document Recognition. IEEE98

Page 8: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 8

CNN Computation in One Slide

Inew 𝑥 𝑦 = Iold

1

𝑦′=−1

1

𝑥′=−1

𝑥 + 𝑥′ 𝑦 + 𝑦′ × F 𝑥′ 𝑦′

Input Feature Map

(Set of 2D Images)

Filter

(3D Space)

Output Feature Map

Repeat for Multiple Filters to Create Multiple

“Layers” of Output Feature Map

Page 9: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 9

What’s in my FPGA?

• DSPs

• Dedicated single-precision floating point multiply and accumulators

• Block RAMs

• Small embedded memories that can be stitched to form an arbitrary memory system

• Programmable Interconnect

• Programmable logic and routing that can build arbitrary topologies

• Compute architecture with high degree

of customization

X +

Page 10: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 10

• 1 TFLOP floating point performance in mid-

range part

• 35W total device power

• Use every DSP, every clock cycle compute

spatially

• 8 TB/s memory bandwidth to keep the state on

chip!

• Exceeds available external bandwidth by

factor of 50

• Random access, low latency (2 clks)

• Place all data in on-chip memory compute

temporally

Why an FPGA for CNN? (Arria 10)

X +

X +

X +

X + M20K

M20K

M20K

M20K

Fine-grained & low latency

between compute and memory

Page 11: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 11

CNNs on FPGAs — Scalable Architecture

Page 12: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 12

Market Demands Scalability for Machine Learning

• 1000s of Classes

• Large Workloads

• Highly Efficient (Performance / W)

• Varying accuracy

• Server Form Factor

Cloud Analytics Transportation Safety

• < 10 Classes

• Frame Rate: 15–30fps

• Power: 1W-5W

• Cost: Low

• Varying accuracy

• Custom Form Factor

Page 13: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 13

Old Approach

• Parallelism across the “face” of the

kernel window, and across multiple

convolution stages

• Low hardware re-use

Different Parallelism in CNN

New Approach

• Parallelism in the depth of the kernel

window and across output features

Defer complex spatial math to

random access memory

• Re-use hardware to compute

multiple layers

Page 14: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 14

Scalable CNN Computations — In One Slide

14

accum

accum

accum Output

Feature Map

“Slide” No data movement. Addressing an on-chip RAM!

Filters

Page 15: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 15

Scalable CNN Architecture on FPGA (1)

FPGA

Double-Buffer

On-Chip RAM

DDR

Filters (on-chip RAM)

# o

f P

ara

llel

Convolu

tions

Page 16: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 16

Scalable CNN Architecture on FPGA (2)

• Array size

(x, y)

• Clock rate

• External memory

bandwidth

Calculated throughput &

resource utilization

• Layer

descriptions

• Given resource constraints,

find optimal architecture

• Ex. AlexNet on A10-115 is 52x26

for 800 img/s @ 350 MHz

Page 17: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 17

• Choice of parallelism has large impact on end compute architecture and properties of solution

• Defined a scalable approach to CNNs on the FPGA

• Not tied to specific FPGA device

• Not tied to specific CNN topology

• Design Methodology:

1. Fit largest possible accelerator network on FPGA (52x26 on Arria 10)

• Limited by DSP Blocks & M20K (RAM) Resources

2. Tile network onto available accelerator

• Decompose filter window into 1x1xW vectors for dot product

Scalable CNN Architecture on FPGA (3)

Page 18: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 18

AlexNet Competitive Analysis — Classification

System (Precision, Image, Speed)1 Throughput Est. Board

Power

Throughput /

Watt

Arria 10-115 (Current: FP32, Full Size, @275Mhz) 575 img/s ~31W 18.5 img/s/W

Arria 10-115 (Optimized: FP32, Full Size, @350Mhz) 750 img/s ~36W 20.8 img/s/W

Arria 10-115 (Estimate: FP16, Full Size, @350Mhz) 900 img/s ~39W 23.1 img/s/W

Arria 10-115 (Estimate: 21b, Full Size, @350Mhz) 1200 img/s ~40W 30 img/s/W

2 x Arria 10-115 Nallatech 510T Board

2400 img/s ~75W 32 img/s/W

cuDNN4 on NVIDIA Titan X Source: NVIDIA Corporation, GPU-Based Deep Learning Inference: A Performance and

Power Analysis, November 2015

3216 img/s 227W 14.2 img/s/W

• Further algorithmic optimization of FPGA possible

• Expect similar ratios for Stratix10 vs. NVIDIA 14nm Pascal

Page 19: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 19

Getting Started with CNNs on FPGAs

High-Performance

Machine Learning

Desired

Accelerate

Computation

Scale & Speed of Devices

Better Compute Architecture

Math Optimization (Winograd, FFT)

Optimized RTL / HLD

(Current Intel PSG focus,

original MSFT focus)

Tune Problem

to Platform

Simplify network topology

Reduce precision / use fixed point

Create more local neuron structures

Integrated training and classification

(Current i-Abra and partner focus)

Not Mutually Exclusive

Combine for Optimal Solution

Page 20: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 20

Overview: Design Flow Using CNN IP

Data

Collection Data

Store

Choose

Network

Train

Network

Execution

Engine

Improvement Strategies

• Collect more data

• Improve network

Parameters

Selection

Architecture

Choose Network

• Use framework (e.g. Caffé,

Torch)

• Choose based on experience

or limits of execution engine

Train Network

• An HPC workload

• Requires data to be pre-

selected

• Weeks to Months process

Execution Engine

• Implementation of the

Neural Network

• Flexibility, performance &

power dominate choice

Altera

CNN IP

Page 21: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 21

Overview: Design Flow for CNN Using Partner

Data

Collection Data

Store

Neural

Pathways

Neural

Synapse

Parameters

Selection

Architecture

Neural Pathways

• Integrated Network

selection and training

• Capable of acceleration in

FPGA

• Minutes to hours process

Neural Synapse

• Implementation of highly

efficient Neural Network

• Built in FPGA fabric with

OpenCL

Altera

CNN IP

Page 22: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 22

• New opportunities to increase the FPGA value proposition

• Accelerated FPGA investment driving product innovation to increase your

performance and productivity

• Increased operational excellence to accelerate time-to-market

• Expanded product portfolio to arm you with new solutions for your most

challenging applications

• Come join us at our booth to see a demo of machine learning on FPGAs

Join Us on Our Journey Together…

How can Intel + Altera help your business grow?

Page 23: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 23

• Altera Website

• Altera SDK for OpenCL Page (www.altera.com/opencl)

• Technical Article “Efficient Implementation of Neural Network Systems Built

on FPGAs, Programmed with OpenCL” (www.altera.com/deeplearning-

tech-article)

• GPU vs FPGA overview online training (available mid-May)

• CNN on FPGA whitepaper (available mid-May)

• “Machine Learning on FPGAs” web page (available mid-May)

• Embedded Vision Alliance Website

• Technical Article “OpenCL Streamlines FPGA Acceleration of Computer Vision”

Resources

Page 24: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 24

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies

depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark

and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause

the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the

performance of that product when combined with other products.

© Intel Corporation

Slide 18

Footnote 1. Configurations:

AlexNet configurations on Arria 10-115 FPGAs optimized via IP - tested by Intel PSG

For more information go to https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/pt/arria-10-product-table.pdf

Legal Notices and Disclaimers

Page 25: "Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel

Copyright © 2016 Intel Corporation 25

Thank You


Recommended