+ All Categories
Home > Technology > Using a Field Programmable Gate Array to Accelerate Application Performance

Using a Field Programmable Gate Array to Accelerate Application Performance

Date post: 08-Jan-2017
Category:
Upload: stanislas-odinot
View: 958 times
Download: 0 times
Share this document with a friend
29
1 Using a Field Programmable Gate Array to Accelerate Application Performance P. K. Gupta Director of Cloud Platform Technology, Intel Corporation DCWS008
Transcript

1

Using a Field Programmable Gate Array to Accelerate Application Performance

P. K. GuptaDirector of Cloud Platform Technology, Intel Corporation

DCWS008

2

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

3

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

4

Digital Services Economy

Build out of the CLOUD

$120B³50¹ BillionDEVICES

New SERVICES

$450B²

1: Sources: AMS Research, Gartner, IDC, McKinsey Global Institute, and various others industry analysts and commentators

2: Source IDC, 2013. 2016 calculated base don reported CAGR ‘13-’17

3: Source: iDATA /Digiworld, 2013

Digital Services Economy…

5

…Fueling Cloud Computing Growth

6

Cloud Economics

Amazon’s TCO Analysis¹

Hadoop Queries

Storage Capacity

Web Transactions / Sec

VMs per System

Workload Performance Metrics

1: Source: James Hamilton, Amazon* http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/

Performance / TCO is the key metric

7

Diverse Data Center Demands

Intel estimates; bubble size is relative CPU intensity

Accelerators can increase Performance at lower TCO for targeted workloads

8

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

9

Accelerator Architecture Landscape

Application Flexibility

Ease of Programming/ Development

Fixed FunctionAccelerator

ReconfigurableAccelerator

CPU

10

Benefits of Reconfigurable Accelerators:Savings in Area /Power

• Can be configured to implement different functions efficiently

- Meeting performance goals for segment

- Saving area and power compared to multiple Fixed Functions

Performance

Cost

Software

Fixed Functions

Programmable Accelerator

11

Benefits of Reconfigurable Accelerators:Meeting Customer Needs for Differentiation

Workload Optimized

Silicon

Pervasive Analytics &

Insights

Intelligent Resource

Orchestration

DynamicResourcePooling

Driving the Digital Service Economy

12

What is a Field Programmable Gate Array (FPGA)?

FPGAs (Field Programmable Gate Arrays) are semiconductor devices that can be programmed

• Desired functionality of the FPGA can be (re-) programmed by downloading a configuration into the device

FPGAs offer several advantages over potential alternatives:

• Lower one-time development cost, and faster time to market compared to custom designed chips (ASICs)

• Ability to implement customer-specific functionality beyond what is available from standard products (ASSPs)

• Customizable and reprogrammable after the device has been deployed to the field compared to both ASIC and ASSP

Logic Blocks

Interconnect Resources

I/O Cells

13

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

14

Intel® Xeon® E5 + Field Programmable Gate Array Software Development Platform (SDP) Shipping Today

Intel QPI

DDR3

DDR3

DDR3

DDR3

DDR3

PC

Ie3

.0 x

8

DM

I2

PC

Ie3

.0 x

8

PC

Ie3

.0 x

8

PC

Ie3

.0 x

8

PC

Ie3

.0 x

8

PC

Ie3

.0 x

8

DDR3

Intel Xeon Processor E5

Product Family

FPGA

Processor Intel Xeon Processor E5

FPGA Module Altera* Stratix* V

QPI Speed 6.4 GT/s full width (target 8.0 GT/s at full width)

Memory to FPGA Module

2 channels of DDR3(up to 64 GB)

Expansion connector to FPGA Module

PCI Express® (PCIe) 3.0 x8 lanes - maybe used for direct I/O e.g. Ethernet

FeaturesConfiguration Agent, CachingAgent, (optional) Memory Controller

SoftwareAccelerator Abstraction Layer (AAL) runtime, drivers, sample applications

Software Development for Accelerating Workloads using Intel® Xeon® processors and coherently attached FPGA in-socket

Intel® QuickPath Interconnect (Intel® QPI)

15

System Logical View

• AFUs can access coherent cache on FPGA

• AFUs can “not” implement a second level cache

• Intel® Quick Path Interconnect (Intel® QPI) IP participates in cache coherency with Processors

Cores LLC AFUsQPI

DRAM

DDR

DRAMDRAM

Processor FPGA

CCI

Multi-processor Coherence Domain Cache access Domain

C

a

c

h

e

Intel

QPI

IP

16

Intel® Xeon® + Field Programmable Gate Array SDP: Intel® Quick Path Interconnect 1.1 RTL Microarchitecture

• PHY – Implements the Intel QPI PHY 1.1 (Analog/Digital)

• Intel QPI Link layer- provides flow control and reliable communication

• Intel QPI Protocol – implements Intel QPI Cache Agent + Configuration Agent

• Cache Controller – Cache hit/miss determination and generates Intel QPI protocol requests.

• Cache Tag – Tracks state of cacheline (MESI + internal states for tracking outstanding requests)

• Coherency Table – Programmable table that implements coherency protocol rules

• System Protocol Layer (SPL2) – Implements Address translation functionality. Can provide up to 2GB device virtual address space to AFU. SPL2 cannot handle page faults.

• AFU – User designed Accelerator Function Unit

QPI interface to pins

QPI Link / Protocol Control

QPI PHYRx Align Tx Align

Rx Control Tx Control

Cache controller

Cache

Data

Cache Tag

Cache Table

Rx

Tx

SPL2

CCI-ERx

Tx

CCI-S

Intel QPI FPGA IP

640 bits640 bits

Address translation

User:

Accelerator Function Unit (AFU)

Intel® QuickPath Interconnect (Intel® QPI)

17

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

18

Intel® Xeon® Processor + Field Programmable Gate Array Tool Flow

C HDL

SWCompiler

Syn.PAR

exebit-

stream

Intel® Xeon® FPGAAAL Shell

Host Kernels

SWCompiler

OpenCLCompiler

exebit-

stream

HDL Programming OpenCL™ Programming

Intel Xeon FPGAAAL Shell

Field Programmable Gate Array (FPGA)Accelerator Abstraction Layer

19

Programming Interfaces

Host Application

Virtual Memory API

Addr Translation

Interfaces

Intel QPI/KTI Link, Protocol, & PHY

CPU

Intel QPI

CCI1

standard

Accelerator Function Units (AFU)

CCI1

extended

Service API

Physical Memory API

Accelerator Abstraction

Layer

Standard Programming Interfaces : AAL and CCIProgramming interfaces will be forward compatible from SDP2 to future MCP3 solutions

Simulation Environment available for development of SW and RTL4

Field Programmable Gate Array

Intel® QuickPath Interconnect (Intel® QPI)1. Coherent Cache Interface 3. Multi-chip package2. Software Development Platform 4. Register Transfer Level

20

Programming Interfaces: OpenCL™

20

OpenCL Application

Virtual Memory API VirtMem

CPU

CCI Standard

OpenCL Kernels

CCI Extended

Service API

Physical Memory API

Accelerator Abstraction

Layer

System Memory

CFG

Physical Memory API

OpenCL RunTime

OpenCL™ Host Code

OpenCL Kernel Code

Field Programmable Gate Array

Intel® QuickPath Interconnect (Intel® QPI)

Unified application code abstracted from the hardware environmentPortable across generations and families of CPUs and FPGAs

Intel QPI/PCI Express®

21

Agenda

• Accelerators: Motivation and Use Cases

• Using Field Programmable Gate Array (FPGA) as an Accelerator

• Intel® Xeon® Processor + FPGA Accelerator Platform

• Hardware and Software Programming Interfaces

• Example Applications

22

Example Usage: Deep Learning Framework for Visual Understanding

clu

ste

rn

od

ed

ev

ice

pri

mit

ive

s

Processing Tile ‘n’

Processing Tile 1DMA

PE

We

igh

ts

Inp

uts

Ou

tpu

ts

Processing Tile 0

PE PE

Read Write RegAccess

SRAM Controller

Control State

Machine

IP Registers

CCI Interface

CNN (Convolutional Neural Network) function accelerated on FPGA:Power-performance of CNN classification boosted up to 2.2X†

†Source: Intel Measured (Intel® Xeon® processor E5-2699v3 results; Altera Estimated (4x Arria-10 results)2S Intel( Xeon E5-2699v3 + 4x GX1150 PCI Express® cards. Most computations executed on Arria-10 FPGA's, 2S Intel Xeon E5-2699v3 host assumed to be near idle, doing misc. networking/housekeeping functions.

Arria-10 results estimated by Altera with Altera custom classification network. 2x Intel Xeon E5-2699v3 power estimated @ 139W while doing "housekeeping" for GX1150 cards based on Intel measured microbenchmark. In order to sustain ~2400 img/s we need a I/O bandwidth of ~500 MB/s, which can be supported by a 10GigE link and software stack

23

Example Usage:Genomics Analysis Toolkit

HaplotypeCaller (PairHMM)BWA mem (Smith-Waterman)

PairHMM function accelerated on FPGA: Power-performance of pHMM boosted up to 3.8X†

†pHMM Algorithm performance is measured in terms of Millions Cell Updates per seconds (CUPS).Performance projections: CPU Performance: includes: 1 core Intel® Xeon® processor E5-2680v2 @ 2.8GHz delivers 2101.1 MCUP/s measured; estimated value assumes linear scaling to 10 Cores on Xeon ES2680v2 @ 2.8 GHz & 115W TDP; FPGA Performance includes: 1 FPGA PE (Processing Engine) delivers 408.9 MCUP/s @ 200 MHz measured; estimated value assumes linear scaling to 32 PEs and 90% frequency scaling on Stratix-V A7 400 MHz based on RTL Synthesis results (35W TDP). Intel estimated based on 1S Xeon E5-2680v2 + 1 Stratix-V A7 with QPI 1.1 @ 6.4 GT/s full width using Intel® QuickAssist FPGA System Release 3.3, ICC (CPU is essentially idle when work load is offloaded to the FPGA)

24

Example Usage:Database Query Processing

DB Application

Query

NAS

Select * from table where a<100

Network Router

Query to Disk

Query to Disk

Compressed Data

Data Decompression

+ Query Execution

Decompression function accelerated on FPGA: Power-performance of LZO Decompression boosted up to 1.9X†

†LZO Decompression performance is measure in terms of Byte Decompressed per second.Performance projections for stream files of size 111kB where the decompression matches are in range of FPGA buffer not requiring any system memory R/W requests: FPGA performance (estimated): 0.48 Clocks/Byte per LZOD PE (Processing Engine) (resulting in 727 MB/s throughput @ 350 MHz) based on cycle accurate RTL simulation measurements; assuming linear scaling to 20 LZOD PE on Arria-10 1150 @ 350 MHz (60W TDP) (CPU is essentially idle when work load is offloaded to the FPGA). CPU performance: 4.5 Clocks/Byte measured on one thread E5-2699v3 using IPP 9.0.0 (resulting in 511 MB/s Throughput @ 2.3GHz); assuming linear scaling to 36 Threads on 1S E5-2699v3 @ 2.3 GHz (145W TDP)

25

Academic Research in FPGA Usages

Intel & Altera jointly launched Hardware Accelerator Research Program

• Q1’15: Call for proposals “which will provide faculty with computer systems containing Intel microprocessors and an Altera* Stratix* V FPGA module that incorporates Intel® QuickAssist Technology in order to spur research in programming tools, operating systems, and innovative applications for accelerator-based computing systems”

• Q2’15: Proposals reviewed and selected

• Q3’15: Systems being shipped to universities

26

Intel® Xeon® + FPGA1 in the Cloud Vision

Workload

Static/dynamic FPGA programming

Placeworkload

Intel® Xeon® +FPGA

26

Storage Network

Orchestration Software

Intel Developed IP

3rd partyDeveloped IP

FPGA VendorDeveloped IP

End UserDeveloped IP

Compute

Resource Pool

SoftwareDefinedInfrastructure

Cloud Users

IP Library

Launch workload Workload accelerators

1: Field Programmable Gate Array (FPGA)

27

Summary and Next Steps

• Intel® Xeon® Processor + FPGA platform is targeted for acceleration of various workloads in the data center

• Intel has launched the Hardware Accelerator Research Program for research in FPGA programming and applications

A PDF of this presentation is available from our Technical Session Catalog: www.intel.com/idfsessionsSF. This URL is also printed on the top of Session Agenda Pages in the Pocket Guide.

28

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

© 2015 Intel Corporation.

29

Risk FactorsThe above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K and earnings release.

Rev. 4/14/15


Recommended