Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 2,133 times |
Download: | 0 times |
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
David Pellerin, Business Development Principal
December 15, 2016
Announcing Amazon EC2 F1 Instances with Custom FPGAsHardware-Accelerated Computing on AWS
F1
Agenda
1. Accelerated Computing Concepts
2. Introducing F1 FPGA Instances
3. Examples of FPGA Use-Cases
4. FPGA Development Process
Accelerated Computing on EC2
EC2 Compute Instance Types
M4
General purpose
Computeoptimized
Storage and IO
optimized
GPU and FPGA
acceleratedMemory
optimized
X1
2010
2013
2016
2016PreviewF1
P2
G2
CG1
M3
T2
I2 HS1
I3 D2
R4
R3
C5
C4
C3
CC2
Announced
NVIDIA Tesla GPU Card
P2: GPU-accelerated computing Enabling a high degree of parallelism – each
GPU has thousands of cores Consistent, well documented set of APIs
(CUDA, OpenACC, OpenCL) Supported by a wide variety of ISVs and
open source frameworks
Xilinx UltraScale+
FPGA
F1: FPGA-accelerated computing Massively parallel – each FPGA includes
millions of parallel system logic cells Flexible – no fixed instruction set, can
implement wide or narrow datapaths Programmable using available, cloud-based
FPGA development tools
GPU and FPGA for Accelerated Computing
CPU: High speed, lower efficiency GPU/FPGA: High throughput, higher efficiency
GPUs and FPGAs can provide massive parallelism and higher efficiency than CPUs for certain categories of applications
Accelerated Computing ConceptsMore parallelism for higher throughout…
A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD). A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision integer and floating point values.
An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD). An FPGA does not have a predefined instruction-set, or a fixed data width.
ControlALU
ALU
Cache
DRAM
ALU
ALU
CPU(one core)
FPGA
DRAM DRAM
GPU
Each FPGA in F1 has more than 2M of these cells
Each GPU in P2 has 2880 of these cores
DRAM
Parallel Processing in GPUs and FPGAs
Blo
ck R
AM
Blo
ck R
AM
DRAM DRAM
module filter1 (clock, rst, strm_in, strm_out)
for (i=0; i<NUMUNITS; i=i+1)
always@(posedge clock)
integer i,j; //index for loops
tmp_kernel[j] = k[i*OFFSETX];
FPGA handles compute-intensive, deeply pipelined, hardware-accelerated operations
CPU handles the rest
application
How FPGA Acceleration Works
Process
Process
Process
Process
Process
Process
Process
Process
Process
Data Data
DataData
Process
Process
Data
Hardware-Accelerated ComputingBuilding parallel systems for parallel problems
An FPGA is effective at processing data of many types in parallel, for example creating a complex pipeline of parallel, multistage operations on a video stream, or performing massive numbers of dependent or independent calculations for a complex financial model…
An FPGA does not have an instruction-set!Data can be any bit-width (9-bit integer? No problem!)Complex control logic (such as a state machine) is easy to implement in an FPGA
Each FPGA in F1 has more than 2M of these cells
Parallel Processing in FPGAs
Introducing F1 FPGA Instances
Make FPGA acceleration available to a larger community of developers, and to millions of potential end-customers
Provide dedicated and large amounts of FPGA logic in a single EC2 instance, using multiple FPGAs
Simplify the development process by providing cloud-based FPGA development tools
Allow developers to focus on algorithm design, by abstracting FPGA I/O using well-defined interfaces
Provide access to a growing ecosystem of FPGA programming tools and applications
Provide a Marketplace for FPGA applications, providing more choice and easy access for all AWS customers
FPGA Acceleration in the AWS Cloud: Goals
New EC2 FPGA instance type for accelerated computing Up to 8 Xilinx UltraScale+ 16nm VU9P FPGA devices in a single instance The f1.16xlarge size provides:
8 FPGAs, each with over 2 million customer-accessible FPGA programmable logic cells and over 5000 programmable DSP blocks
Each of the 8 FPGAs has 4 DDR-4 interfaces, with each interface accessing a 16GiB, 72-bit wide, ECC-protected memory
Instance Size FPGAs DDR-4 (GiB)
FPGA Link
FPGA Direct
vCPUs Instance Memory (GiB)
NVMe Instance Storage (GB)
Network Bandwidth*
f1.2xlarge 1 4 x 16 - - 8 122 1 x 480 10 Gbps Peak
f1.16xlarge 8 32 x 16 Y Y 64 976 4 x 960 30 Gbps
*In a placement group
F1 FPGA Instance Types on AWS
System Logic Block:Each FPGA in F1 provides over 2M of these logic blocks
DSP (Math) Block:Each FPGA in F1 has more than 5000 of these blocks
I/O Blocks:Used to communicate externally, for example to DDR-4, PCIe, or ring
Block RAM:Each FPGA in F1 has over 60Mb of internal Block RAM, and over 230Mb of embedded UltraRAM
Blo
ck R
AM
Blo
ck R
AM
I/O Blocks
DDR-4 DDR-4
DDR-4 DDR-4
PC
Ie
FPG
A Li
nk
What’s Inside the F1 FPGA?
AWS FPGA ShellFPGA I/O is provided using pre-configured, pre-tested, and secure I/O components, allowing FPGA developers to focus on their differentiating value
The FPGA Shell allows for faster coding of core acceleration functions by removing the need to develop I/O related FPGA hardware
Blo
ck R
AM
Blo
ck R
AM
DDR-4 DDR-4
DDR-4 DDR-4
FPG
A Li
nk
PC
IeAbstracting FPGA I/O
Amazon Machine
Image (AMI)Amazon FPGA
Image (AFI)
EC2 F1 Instance
CPU Application
on F1
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
DDR-4 Attached Memory
FPGA Link
PCIeDDR
Controllers
Launch Instanceand Load AFI
An F1 instance can have any number of AFIs
An AFI can be loaded into the FPGA in less than 1 second
FPGA Acceleration Using F1
Example F1 Use-Cases
Highly Efficient
• Algorithms Implemented in Hardware• Gate-Level Circuit Design• No Instruction Set Overhead
Massively Parallel
• Massively Parallel Circuits• Multiple Compute Engines• Rapid FPGA Reconfigurability
FPGA
Speeds Analysis of Whole Human Genomes from Hours to MinutesUnprecedented Low Cost for Compute and Compressed Storage
F1 for Genomics Processing
F1 for Financial ComputingModeling Counterparty Risk (CVA) and Regulatory Capital Requirements
F1 for Video ProcessingNext Generation Video Compression for Broadcast Quality 4K content
Successfully ported to F1 in just 3 weeks
F1 for Accelerated AnalyticsHeterogeneous Compute Acceleration for Faster Data Discovery
FPGA Development Process
Development steps Launch the AWS-provided FPGA Developer AMI, which includes all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK)
Use Xilinx Vivado or SDAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic
After successful simulation, use Vivado or SCAccel to synthesize and place/route the FPGA logic to create an FPGA Design Check Point (DCP), encrypt, and generate an Amazon FPGA Image (AFI)
Launch an F1 instance and load the AFI to the FPGA, using AFI management tools provided by AWS
Developing Applications for F1
1
2
3
4
Generate an Amazon FPGA
Image (AFI)FPGA Place-and-Route
using Xilinx Vivado on C4 or M4 instance
FPGA Logic Design using Xilinx Vivado on C4 or M4
instance
Securely deploy AFI on one or
more F1 instances
Developing Applications for F1
Choose and launch the AWS-provided FPGA Developer AMI, which includes all needed FPGA design and programming software, as well as the AWS FPGA Hardware Development Kit (HDK)
Developing Applications for F1
Developing Applications for F1
Use Xilinx Vivado or SCAccel software and a hardware description language (Verilog, VHDL, or OpenCL) with the HDK to describe and simulate your custom FPGA logic After successful simulation, use scripts provided with the HDK to encrypt, synthesize and place/route the FPGA logic to create a final FPGA Design Check Point (DCP) and generate a secure, encrypted Amazon FPGA Image (AFI)
Developing Applications for F1
Launch an F1 instance and download the AFI to the FPGA, using AFI management tools provided by AWS
Generate an Amazon FPGA
Image (AFI)
Deploy AFI on one or more F1
instances
Developing Applications for F1
Amazon EC2 FPGA Deployment via Marketplace
Amazon Machine
Image (AMI)Amazon FPGA Image
(AFI)
AFI is secured, encrypted, dynamically loaded into the FPGA - can’t be copied or
downloaded
Customers
AWS Marketplace
Delivering FPGA Partner Solutions on AWSvia AWS Marketplace
Delivering FPGA Partner Solutions on AWSAWS Marketplace Benefits• Streamlined delivery of FPGA-accelerated solutions: Offer software as a
managed Amazon Machine Image (AMI) and one or more Amazon FPGA Images (AFI), with secure 1-click purchasing.
• Discover new customers: Allow customers to launch directly from AWS Marketplace, decreasing the length of sales cycles. Sellers can also offer free trials with no additional engineering effort.
• Simplified billing & payments: Customers pay for AWS Marketplace software as part of the regular AWS billing cycle. AWS manages the complexity of AMI and AFI security, metering, billing, payment collection, and financial reporting.
• Secure your FPGA-based products: FPGA custom logic is deployed to customers in a secure way, with no ability to view, copy, or edit the AFI logic.
• Provide Seamless Product Support: AWS Marketplace Product Support Connection makes it easy to support your customers on AWS Marketplace.
FPGA: A Field Programmable Gate Array is a device that consists of very large numbers of configurable logic and memory elements interconnected by configurable routing resources. FPGAs differ from CPUs and GPUs by having no fixed instruction set, and in their ability to implement operations and processes that are pipelined and parallelized in an almost unlimited number of ways, using arbitrarily sized bit-widths. AFI (Amazon FPGA Image): a file containing the binary image for an FPGA bitstream. Loading an AFI onto an FPGA “programs” that device, within seconds, to perform one of more application-specific functions. HDL (Hardware Description Language): a low-level programming language designed for describing logic functions for the purposed of simulation and for conversion (via synthesis) to an FPGA or ASIC.Vivado and SDAccel: a set of design tools produced by Xilinx (provider of the F1 FPGA devices) for development of FPGA logic, pre-integrated and provided at no charge by AWS. Verilog: a commonly-used HDL for FPGA design and simulation, supported by Vivado.VHDL: another commonly-used HDL for FPGA , also supported by Vivado.
F1 Glossary
OpenCL (Open Computing Language): a higher-level alternative to HDL programming based on C-language, and supported in the Xilinx SDAccel design tools. OpenCL can be used to target either FPGAs or GPUs.HDK (Hardware Development Kit): a set of tools, documentation, and associated FPGA libraries provided by AWS to assist FPGA developers with more rapid FPGA development, in particular to simplify the use of I/O from the FPGA to the host EC2 instance via PCIe, from FPGA to memory, and from FPGA to FPGA.AXI: an FPGA-internal bus format providing standardized interfaces for memory-mapped communications and for high-speed streaming data. AXI is used in the F1 HDK to define interfaces between AWS-provided interface logic, and custom logic provided by FPGA developers.Developer AMI: a preconfigured AMI, available in the AWS Marketplace, that includes all necessary software and libraries for FPGA development, including the Vivado software and the HDK libraries enabling HDL design and simulation.
F1 Glossary (cont)
Synthesis: the process, using software tools provided with Vivado, of converting an HDL or OpenCL application into a lower-level format (sometimes referred to as a “netlist”) representing the individual logic elements of the application, for example AND, OR, XOR gates, adders and multipliers, shift registers, etc. This “netlist” must be further processed, using place-and-route software, to create a downloadable bitstream.Place-and-Route: the process, using software tools provided with Vivado, of mapping individual logic elements to precise locations in the target FPGA, and specifying their interconnections. Place-and-route is an iterative process that can require hours to complete for larger applications and larger FPGAs.Bitstream: a binary format representing the synthesized, placed, and routed FPGA application ready for downloading to an FPGA.Design Check Point (DCP): a binary file format containing the FPGA bitstream, ready for ingestion during the creation of an Amazon FPGA Image (AFI).
F1 Glossary (cont)
Additional Resources
AWS F1 details: https://aws.amazon.com/ec2/instance-types/f1/
AWS Marketplace: https://aws.amazon.com/marketplace/
AWS Educate: https://aws.amazon.com/education/awseducate/
Edico Genome: http://www.edicogenome.com/
NGCODEC: http://www.ngcodec.com
Maxeler: http://www.maxeler.com/
Ryft: https://www.ryft.com
Thank you!