API for Real time/Dynamic Image and Video Processing · API for Real time/Dynamic Image and Video...

Presented By

Ivan WongVP and Co-founder

API for Real time/Dynamic Image and Video Processing

About CTAccel

Founded in Mar, 2016

Based in Hong Kong & Shenzhen

Focus on FPGA-based accelerated computing for Data

Center

US Patent

Core value of our company: Quality & Innovation

CTAccel Limited

CTAccel & Xilinx partnership

Partner with Xilinx Inc since 2016

Delivered first Xilinx FPGA based Off-Premise

solution in 2016

Delivered first Xilinx FPGA based Cloud

solution in 2017

Background

Challenges:• Huge consumption of computational and storage

resource • Server and storage performance IS NOT KEEPING PACE

Faster Access

Better Quality

More Data

Market demand and challenges

Users generate more image and video data everyday

Higher resolution in capture and display

Users crave better viewing experience

Better Quality

Faster Access Customers demand instant access to the resource

More Data

Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)*

*Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/

Background

CPU performance per core has not increased since 2005

Internet traffic increases by 26% annually, image and video contributes a large portion of internet data.

How to achieve the fastest image processing with the least amount of computational resources?

Chip Frequency graphed against year of introduction

Source: Cisco--VNI Forecast Highlights Tool

Constraints in existing solutionsBackground

Source: The Future of Computing Performance (2011)

Modern image and video formats continue to emerge

Lepton

HEIF

Professional image

processing accelerator

OpenCV

ImageMagick

GraphicsMagick

Mozjpeg

BPG

WebP JPEG Baseline/Progressive

Guetzli

Pixels Processing

Background

Where is CTAccel image processor located in customers’ production environment?

Image Video AI

UGC Web portals

News app

……Cloud storage

Cloud album Social network E-commerceScenarios

CIP in Object-oriented storage

CIP software stack and accelerated functions

JPEGWebPLeptonHEIF

ResizingCrop

Rotate

JPEGWebPLeptonHEIF

Pixel Processing EncodeDecode

Accelerated Function UnitsSystem Stack

Image Video AI

/FFmpeg

Video & ImageProcessing

Data Analytics

Genomics

MachineLearning

Compute Acceleration

GenomicsFinancial AnalyticsVideo/Image Processing Big Data AnalyticsMachine

LearningSecurity

Applications

Software-Defined Development Environments

Solution Development Partners & Customers

Cloud, Enterprise to Edge-Computing

Increase Service Level

Why Xilinx?Image Video AI

Problems solved Solution

Key features

Scenarios

Values

CIP

End-userSmart phone/PAD Camera PC

AppCloud album Social network News app…

…

JPEGDecodeFPGA Resize JPEG

Encode

Accelerate the speed of JPEG

image thumbnail generation,

reduce image-processing

servers.

High Throughput

Low Latency

Low CPU Utilization

TCO reduction：

Reduce image-processing servers

OPEX reduction

Improve customer experience：

Accelerate the image rendering speed

Internet apps with JPEG

E-commerce platform

Cloud album

Social network with pictures

Web portals

News application

Image Video AI

Product 1：JPEG 2 JPEG

JPEG 2 JPEG performance on cloud

Input：10000*4096x2160（Average Size 803k）

310.4

62.27.7 7

050

100150200250300350

640x480 2048x1080

Throughput(MB/s)

Input：10000*1024x768 (Average Size 130k）

94.9

50.8

13.3 8.5

0

20

40

60

80

100

240x180 768x576

Throughput(MB/s)

28201

1122 1162

0200400600800

100012001400

640x480 2048x1080

Latency(ms)

423

48

197

0

50

100

150

200

250

240x180 768x576

Latency(ms)

63.00%

95.00%99.00% 99.00%

0.00%20.00%40.00%60.00%80.00%

100.00%120.00%

640x480 2048x1080

CPU Utilization

37.00%

70.00%

83.00% 87.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

240x180 768x576

CPU Utilization

VU9P (HUAWEI fp1c.2xlarge)

8vCPU (HUAWEI fp1c.2xlarge)

Image Video AI

Image Video AI


Key features

Scenarios

Values

CIP


AppE-commerce Social network News app…

…

JPEGDecodeFPGA Resize WebP

Encode

Vast amount of computational

resources and long time when

processing WebP encoding by

CPU;

Accelerate the transcoding from

JPEG to WebP；

Internet apps with WebP

E-commerce platform


Web portals

News application

High Throughput

Low Latency

Low CPU Utilization

TCO reduction：

Save CDN traffic


OPEX reduction



Product 2: JPEG 2 WebP(M6) – Highest compression ratio

384286

108

721

470

20899 56 25

0100200300400500600700800

640x480 1024x768 2048x1080

QPS

FPGA*1 FPGA*2 CPU

60.97 82.79220.07

32.7 49.94 113.03240.64

422.73

941.62

0

200

400

600

800

1000

640x480 1024x768 2048x1080

Latency (ms)

FPGA*1 FPGA*2 CPU

25% 23% 21%

44% 45% 45%

100% 100% 100%

0%

20%

40%

60%

80%

100%

120%

640x480 1024x768 2048x1080

CPU Utilization

FPGA*1 FPGA*2 CPU

318237

89

599

390

17382 46 21

0100200300400500600700

640x480 1024x768 2048x1080

Throughput (MB/s)

FPGA*1 FPGA*2 CPU

Input: 10000*4096 x2160 images

QPSSingle FPGA is 5.11 times of CPUDual FPGA is 8.39 times of CPU

LatencySingle FPGA is 0.20 times of CPUDual FPGA is 0.12 times of CPU

JPEG to WebP performance

Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2RAM: 128GBOS: CentOS Linux release 7.2.1511

FPGA Used: Single Xilinx UltraScale+

Dual Xilinx UltraScale+

Image Video AI

Modern image format-Lepton and its problemImage Video AI

Lepton is a new image compression format developed by Dropbox and made open source in 2016. Lepton compresses JPEG images losslessly. It reduces file size by an average of 22%. It preserves the original JPEG file bit-by-bite perfectly, including all metadata. Lepton can be applied into the scenarios with massive images storage. It can effectively reduce

the storage cost.

JPEG Lepton Compression Ratio

1024x768 420 1.3G 1.1G 15%

1024x768 422 1.4G 1.1G 21%

1024x768 444 1.6G 1.3G 18%

4096x2160 420 8.3G 6.3G 24%

4096x2160 422 8.7G 6.6G 24%

4096x2160 444 9.5G 7.1G 25%

Average 21%

The downside of Lepton is that it requires heavy computation power for both compression and decompression. A conventional X86 server with dual E5-2630 CPU can only compress JPEG files into Lepton format at a rate of 20 megabytes per second.

Problem

Product 3: JPEG to LeptonImage Video AI


Key features

Scenarios

Values

CIP


AppCloud album Cloud storage

…

JPEGDecodeFPGA Lepton

Encode

Scenarios with massive images

storage ：

Cloud album；

Cloud storage；

TCO Reduction：

Save CDN traffic


Reduce image-storage servers

OPEX reduction

High Throughput

Low Latency

Low CPU Utilization



processing Lepton encoding by

CPU;

Accelerate the transcoding from

JPEG to Lepton；

JPEG to Lepton performanceFPGA Used: Virtex UltraScale+ FPGA Alveo U200

Input images: Total number: 999 images

Total file size: 3.8GB

108.2

25.7

0.0

20.0

40.0

60.0

80.0

100.0

120.0

24 threads

Throughput(MB/s)28.3

6.7

0.0

5.0

10.0

15.0

20.0

25.0

30.0

24 threads

QPS

827.77

3492.08

0.00500.00

1000.001500.002000.002500.003000.003500.004000.00

24 threads

Latency(ms)

7%

100%

0%

20%

40%

60%

80%

100%

120%

24 threads

CPU Utilization

QPSFPGA is 4.2 times of CPU

LatencyFPGA is 0.24 times of CPU

Test environment ：

CPU: Intel(R) Xeon(R) E5-2630v2 x2

RAM: 128GB

OS: CentOS Linux release 7.3.1611

U200

CPU

U200

CPUU200

CPU

U200

CPU

Image Video AI

Product 4: JPEG to HEIFImage Video AI


Key features

Scenarios

Values

CIP


AppCloud album Social network News app…

…

JPEGDecodeFPGA Resize HEIF

Encode

High Throughput

Low Latency

Low CPU Utilization

TCO reduction：


OPEX reduction



Internet apps with HEIF

E-commerce platform


Web portals

News application



processing HEIC encoding by CPU;

Poor customer experience;

JPEG to HEIF performanceImage Video AI

Params JPG FPGA CPU

Images 100 100 100

Total Size(Bytes)

173738658 135858024 133290287

CompressionRatio

78.2% 76.72%

Param -slice_qp 5 crf=12 (10179.74 kb/s)

FPS 200.12 20.02

PSNR 54.51 57.06

VMAF 97.25 97.31

QPS: FPGA*1 is 10 times that of

CPU

Latency： FPGA*1 is 0.1 times that of

CPU

Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2680v4 RAM: 128GB OS: CentOS Linux release 7.2.1511 Kernel version: 3.10.0-862.el7.x86_64

Input: Resolution:1920*1080 VQ: preset=medium Total files: 100

Output: Resolution: 1920*1080

Features of CIP

5-10 times throughput promotion compare to CPU3 times latency reduction compare to CPU

20W-40W per FPGA cardTCO reduction: improve computing density of DC, reduce racks

Support: ImageMagick, OpenCV, GraphicsMagick, LeptonAllow seamless migration from software-based implementation to CIP

Remote upgradingPR(Partial Reconfiguration) technology allows fast and easy context switch in accelerator functionality without rebooting the server

High performance

Low power

Software compatible

Ease ofmaintenance

Image Video AI

Summary of CIP’s values

Advantages

Image thumbnail generation for cloud album

Scenario

Customer

Leading mobile phone manufacturer in China

Feedback from customer

· Maintaining stable operating status

JPEGdecode Resize

CIP

· CIP has been deployed in the production environment for more than 16 months

Highperformance

TCO reduction

Ease ofmaintenance

Reduce TCO by 25%

3x latency reduction

Deterministic timing

Smaller cluster of servers

Reduce the total number of server cluster by 50%

Improve the throughput by 50%

Better QoS

Xilinx FPGA

Image Video AI

Case study 1

Image Video AI

Case study 2Advantages

Transcoding from JPEG to WebP

Scenario

Customer

Famous video portal website in China

Feedback from customer

· CIP has passed the test of customer

JPEGdecode Resize

CIP

· Test period: 2 months

TCO reduction

Reduce TCO by 50% at least

Improve customer experience

Reduce latency by xx% (not announced the specific values by our customer)

Deterministic timing: the same low latency at full-load and no-load

Highperformance

1 server with one CIP accelerator has theequivalent computing capability with 3 servers without CIP

WebPencode

· CIP will be deployed in the production environment in 2018Q4

Image Video AI

Evolution planning

- Image codec : JPEG, Webp, HEIF, Lepton, BPG, GIF, PNG

- Pixel processing: High quality resizing, High performance resizing, smart-crop, super-resolution

- Video codec: H264, H265, AV1, AVS3

Accelerated image/video transcoding

- AI function products around pictures and video content: content identification, face detection, pedestrian detection, etc.

- Functional services such as searching for pictures, searching for videos, etc.

Accelerated image/video analytics

- Ultimately all CTAccel accelerated functions will be served by a unified SaaS framework

- Unified APIs across CSPs and on-premise

Software as a Service

Plan & Development

JPEG Lepton

JPEG WebP

Smart-cropSuper-resolution

JPEG

JPEG HEIF

Ready to Market

Resize/Crop/Rotate

MPSoC - H265

MPSoC - H264

AV1

AVS3

CIP

Image Codecs

CVP

Video Codecs

Products

PNG

GIF

Video/ImageAnalytics

Scan for pornographic contentVideo game recognition

AI assist security inspection

AVIF

AlveoU200

AlveoU50

Kintex® UltraScale™115

Zynq MPSoC

PartnersCloud platform support

Hardware platform support

Image Video AI

Video AIImage

Higher performance density

Improve quality of service

TCO reduction

High-quality FPGA-based accelerated computing solution provider

Product line of CTAccelImage Video AI

Xilinx MPSoC with VCU - H264/H265Problems solved Solution

Key features

Scenarios

Values

Xilinx MPSoC with VCU

End-userSmartphone PC TV

AppVideo portal

websiteShort video

appLive

broadcast…

…

H264/H265DecodeFPGA H264/H265

Encode

Applications which require video

transcoding:

Video portal websites ；

Short video apps；

FPS

PSNR

TCO Reduction：

Reduce processing servers

OPEX reduction


Accelerate the speed of video

processing, improve customer

experience

Image Video AI

Resize

Performance of H.264 and H.265 (MPSoC)Image Video AI

H.264 Encode H.265 Encode

307.2

449.94

643.87578

118.96227.13

377.57

516.21

1920*1080 1280*720 960*540 640*480

FPS

CVP X264 medium

324.77

478.07

712.39636

30.37 63.89 100.34150.97

1920*1080 1280*720 960*540 640*480

FPS

CVP X265 medium

Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 RAM: 128GB OS: CentOS Linux release 7

Performance of H.264 and H.265 (MPSoC)Image Video AI

183.22200.42

85.89

28.3

CVP H.264 CVP H.265 X264 medium X265 medium

FPS

Server: 2*E5-2630v2 Server: 2*E5-2680v4

200.68

236.41 236.03

91.09

CVP H.264 CVP H.265 X264 medium X265 medium

FPS

Input:Crowdrun1080p, H264 Output:720p540p480p

Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2

& 2*Intel(R) Xeon(R) CPU E5-2680v4 RAM: 128GB OS: CentOS Linux release 7

Image Video AI

Quality of H.264 and H.265 (MPSoC)

20.0030.0040.0050.0060.0070.0080.0090.00

100.00

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

H.265 VMAF

CVPH.265

X265Medium

NVP4 H.265Medium

24.00

26.00

28.00

30.00

32.00

34.00

36.00

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

H.265 PSNR

CVPH.265

X265Medium

NVP4 H.265Medium

20.0030.0040.0050.0060.0070.0080.0090.00

100.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

H.264 VMAF

CVPH.264

X264Medium

NVP4 H.264Medium

22.00

24.00

26.00

28.00

30.00

32.00

34.00

36.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

H.264 PSNR

CVPH.264

X264Medium

NVP4 H.264Medium

Bitrate(Mbps) Bitrate(Mbps)

Bitrate(Mbps)Bitrate(Mbps)

Test environment: CPU: 2*Intel(R) Xeon(R)

CPU E5-2630v2 RAM: 128GB OS: CentOS Linux release 7

Input:High dynamic and high complexity video

- Crowdrun, park_joy

Image Video AI

I-Frame 2 JPEG


Key features

Scenarios

Values

CIP + MPSoC

End-userSmartphone PC TV

AppVideo portal

websiteShort video

appLive

broadcast…

…

H264/H265DecodeFPGA Resize JPEG

Encode

Applications which require “I-

Frame 2 JPEG”:

Video portal websites ；

Short video apps；

High Throughput

Low Latency

Low CPU Utilization

TCO Reduction：

Reduce processing servers

OPEX reduction


Accelerate the speed of ”I-Frame 2

JPEG”, improve customer

experience

Image Video AI

AI accelerator: face detection/recognition


Key features

Scenarios

Values

Smart City；

Intelligent Transportation；

Safe City；

Security system；

End-userSmartphone PAD PC

AppSafe City Smart City Intelligent

Transportation…

…

Image Decode

FPGA

Resize Facerecognition

GPU/FPGA

High Throughput

Low Latency

Low CPU Utilization

TCO Reduction：


OPEX reduction


Improve the speed of face

detection/recognition

Accelerate the speed of ”face

detection/recognition”, improve

customer experience, reduce

TCO

Image Video AI

AI accelerator: test scenarios

GPU GPU

GPU

GPU

GPU

GPU computing matrix

Memory caching

FPGA #1

FPGA #2

Test-1：Alexnet model

FPGA #3

Test-2：nsfw model

Test-3：Inception V4 model

Test-4：ResNet-50 model

Test environment CPU: 2*Intel(R) Xeon(R) CPU E5-2690v4 x 2 RAM: 128GB OS: CentOS Linux release 7.4 Kernel version: 3.10.0-514.2.2.el7.x86_64 Python version: 2.7

Image type Resolution Quantity8 Mega Pixel 2647X3278 14708

16 Mega Pixel 5312x2988 92801080p 1920x1080 21400

Test data

Image Video AI

Test result 1-AlexnetAlexnet model

QPS:

FPGA+GPU is 2.5 faster than CPU+GPU

Latency：FPGA+GPU is 30% of CPU+GPU

CPU usage：FPGA+GPU is 20% of CPU+GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

8M Pixels 16M Pixels 1080p

Latency(ms)

3 FPGA+ 5 GPU CPU+ 5 GPU

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00


CPU utilization(%)

3 FPGA+ 5 GPU CPU + 5 GPU

0.00

500.00

1000.00

1500.00

2000.00

2500.00


Throughput(MB/S)

3 FPGA + 5 GPU CPU+ 5 GPU

Image Video AI

Test result 2-nsfwnsfw model

0.00

500.00

1000.00

1500.00

2000.00

2500.00


Throughput(MB/S)

3 FPGA+ 5 GPU CPU+5 GPU

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00


Latency(ms)


0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00


CPU utilization(%)


QPS:

FPGA+GPU is 3 faster than CPU+GPU



Image Video AI

Test result 3-Inception V4InceptionV4 model

0

500

1000

1500

2000

2500


Throughput(MB/S)


0

10

20

30

40

50

60

70

80

90


Latency(ms)


0

10

20

30

40

50

60

70


CPU utilization(%)


QPS:




Image Video AI

Test result 4-ResNet-50ResNet-50 model

0.00

500.00

1000.00

1500.00

2000.00

2500.00


Throughput(MB/S)


0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00


Latency(ms)


0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00


CPU utilization(%)


QPS:




Image Video AI

Test result 5-Accuracy

Model Category Accuracy(top1) Accuracy(top5)

Alexnet TensorFlow 0.49 0.74

accel 0.49 0.73

Inceptionv4 TensorFlow 0.80 0.95

accel 0.79 0.95

ResNet50 TensorFlow 0.73 0.91

accel 0.72 0.90

nsfw TensorFlow 0.75 N/A

accel 0.75 N/A

Product website:www.ct-accel.com

Product hotline:+86-0755-88914045

Product enquiry email:[email protected]

Cloud partners:AWS

HUAWEI Cloud

Alibaba Cloud

Baidu Cloud

Tencent Cloud

For any product related enquiries:

Date post:	26-Jul-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

API for Real time/Dynamic Image and Video Processing · API for Real time/Dynamic Image and Video...

Documents