Presented By
Ivan WongVP and Co-founder
API for Real time/Dynamic Image and Video Processing
About CTAccel
Founded in Mar, 2016
Based in Hong Kong & Shenzhen
Focus on FPGA-based accelerated computing for Data
Center
US Patent
Core value of our company: Quality & Innovation
CTAccel Limited
CTAccel & Xilinx partnership
Partner with Xilinx Inc since 2016
Delivered first Xilinx FPGA based Off-Premise
solution in 2016
Delivered first Xilinx FPGA based Cloud
solution in 2017
Background
Challenges:• Huge consumption of computational and storage
resource • Server and storage performance IS NOT KEEPING PACE
Faster Access
Better Quality
More Data
Market demand and challenges
Users generate more image and video data everyday
Higher resolution in capture and display
Users crave better viewing experience
Better Quality
Faster Access Customers demand instant access to the resource
More Data
Data storage supply and demand worldwide, from 2009 to 2020 (in exabytes)*
*Source: https://www.statista.com/statistics/751749/worldwide-data-storage-capacity-and-demand/
Background
CPU performance per core has not increased since 2005
Internet traffic increases by 26% annually, image and video contributes a large portion of internet data.
How to achieve the fastest image processing with the least amount of computational resources?
Chip Frequency graphed against year of introduction
Source: Cisco--VNI Forecast Highlights Tool
Constraints in existing solutionsBackground
Source: The Future of Computing Performance (2011)
Modern image and video formats continue to emerge
Lepton
HEIF
Professional image
processing accelerator
OpenCV
ImageMagick
GraphicsMagick
Mozjpeg
BPG
WebP JPEG Baseline/Progressive
Guetzli
Pixels Processing
Background
Where is CTAccel image processor located in customers’ production environment?
Image Video AI
UGC Web portals
News app
……Cloud storage
Cloud album Social network E-commerceScenarios
CIP in Object-oriented storage
CIP software stack and accelerated functions
JPEGWebPLeptonHEIF
ResizingCrop
Rotate
JPEGWebPLeptonHEIF
Pixel Processing EncodeDecode
Accelerated Function UnitsSystem Stack
Image Video AI
/FFmpeg
Video & ImageProcessing
Data Analytics
Genomics
MachineLearning
Compute Acceleration
GenomicsFinancial AnalyticsVideo/Image Processing Big Data AnalyticsMachine
LearningSecurity
Applications
Software-Defined Development Environments
Solution Development Partners & Customers
Cloud, Enterprise to Edge-Computing
Increase Service Level
Why Xilinx?Image Video AI
Problems solved Solution
Key features
Scenarios
Values
CIP
End-userSmart phone/PAD Camera PC
AppCloud album Social network News app…
…
JPEGDecodeFPGA Resize JPEG
Encode
Accelerate the speed of JPEG
image thumbnail generation,
reduce image-processing
servers.
High Throughput
Low Latency
Low CPU Utilization
TCO reduction:
Reduce image-processing servers
OPEX reduction
Improve customer experience:
Accelerate the image rendering speed
Internet apps with JPEG
E-commerce platform
Cloud album
Social network with pictures
Web portals
News application
Image Video AI
Product 1:JPEG 2 JPEG
JPEG 2 JPEG performance on cloud
Input:10000*4096x2160(Average Size 803k)
310.4
62.27.7 7
050
100150200250300350
640x480 2048x1080
Throughput(MB/s)
Input:10000*1024x768 (Average Size 130k)
94.9
50.8
13.3 8.5
0
20
40
60
80
100
240x180 768x576
Throughput(MB/s)
28201
1122 1162
0200400600800
100012001400
640x480 2048x1080
Latency(ms)
423
48
197
0
50
100
150
200
250
240x180 768x576
Latency(ms)
63.00%
95.00%99.00% 99.00%
0.00%20.00%40.00%60.00%80.00%
100.00%120.00%
640x480 2048x1080
CPU Utilization
37.00%
70.00%
83.00% 87.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
240x180 768x576
CPU Utilization
VU9P (HUAWEI fp1c.2xlarge)
8vCPU (HUAWEI fp1c.2xlarge)
Image Video AI
Image Video AI
Problems solved Solution
Key features
Scenarios
Values
CIP
End-userSmart phone/PAD Camera PC
AppE-commerce Social network News app…
…
JPEGDecodeFPGA Resize WebP
Encode
Vast amount of computational
resources and long time when
processing WebP encoding by
CPU;
Accelerate the transcoding from
JPEG to WebP;
Internet apps with WebP
E-commerce platform
Social network with pictures
Web portals
News application
High Throughput
Low Latency
Low CPU Utilization
TCO reduction:
Save CDN traffic
Reduce image-processing servers
OPEX reduction
Improve customer experience:
Accelerate the image rendering speed
Product 2: JPEG 2 WebP(M6) – Highest compression ratio
384286
108
721
470
20899 56 25
0100200300400500600700800
640x480 1024x768 2048x1080
QPS
FPGA*1 FPGA*2 CPU
60.97 82.79220.07
32.7 49.94 113.03240.64
422.73
941.62
0
200
400
600
800
1000
640x480 1024x768 2048x1080
Latency (ms)
FPGA*1 FPGA*2 CPU
25% 23% 21%
44% 45% 45%
100% 100% 100%
0%
20%
40%
60%
80%
100%
120%
640x480 1024x768 2048x1080
CPU Utilization
FPGA*1 FPGA*2 CPU
318237
89
599
390
17382 46 21
0100200300400500600700
640x480 1024x768 2048x1080
Throughput (MB/s)
FPGA*1 FPGA*2 CPU
Input: 10000*4096 x2160 images
QPSSingle FPGA is 5.11 times of CPUDual FPGA is 8.39 times of CPU
LatencySingle FPGA is 0.20 times of CPUDual FPGA is 0.12 times of CPU
JPEG to WebP performance
Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2RAM: 128GBOS: CentOS Linux release 7.2.1511
FPGA Used: Single Xilinx UltraScale+
Dual Xilinx UltraScale+
Image Video AI
Modern image format-Lepton and its problemImage Video AI
Lepton is a new image compression format developed by Dropbox and made open source in 2016. Lepton compresses JPEG images losslessly. It reduces file size by an average of 22%. It preserves the original JPEG file bit-by-bite perfectly, including all metadata. Lepton can be applied into the scenarios with massive images storage. It can effectively reduce
the storage cost.
JPEG Lepton Compression Ratio
1024x768 420 1.3G 1.1G 15%
1024x768 422 1.4G 1.1G 21%
1024x768 444 1.6G 1.3G 18%
4096x2160 420 8.3G 6.3G 24%
4096x2160 422 8.7G 6.6G 24%
4096x2160 444 9.5G 7.1G 25%
Average 21%
The downside of Lepton is that it requires heavy computation power for both compression and decompression. A conventional X86 server with dual E5-2630 CPU can only compress JPEG files into Lepton format at a rate of 20 megabytes per second.
Problem
Product 3: JPEG to LeptonImage Video AI
Problems solved Solution
Key features
Scenarios
Values
CIP
End-userSmart phone/PAD Camera PC
AppCloud album Cloud storage
…
JPEGDecodeFPGA Lepton
Encode
Scenarios with massive images
storage :
Cloud album;
Cloud storage;
TCO Reduction:
Save CDN traffic
Reduce image-processing servers
Reduce image-storage servers
OPEX reduction
High Throughput
Low Latency
Low CPU Utilization
Vast amount of computational
resources and long time when
processing Lepton encoding by
CPU;
Accelerate the transcoding from
JPEG to Lepton;
JPEG to Lepton performanceFPGA Used: Virtex UltraScale+ FPGA Alveo U200
Input images: Total number: 999 images
Total file size: 3.8GB
108.2
25.7
0.0
20.0
40.0
60.0
80.0
100.0
120.0
24 threads
Throughput(MB/s)28.3
6.7
0.0
5.0
10.0
15.0
20.0
25.0
30.0
24 threads
QPS
827.77
3492.08
0.00500.00
1000.001500.002000.002500.003000.003500.004000.00
24 threads
Latency(ms)
7%
100%
0%
20%
40%
60%
80%
100%
120%
24 threads
CPU Utilization
QPSFPGA is 4.2 times of CPU
LatencyFPGA is 0.24 times of CPU
Test environment :
CPU: Intel(R) Xeon(R) E5-2630v2 x2
RAM: 128GB
OS: CentOS Linux release 7.3.1611
U200
CPU
U200
CPUU200
CPU
U200
CPU
Image Video AI
Product 4: JPEG to HEIFImage Video AI
Problems solved Solution
Key features
Scenarios
Values
CIP
End-userSmart phone/PAD Camera PC
AppCloud album Social network News app…
…
JPEGDecodeFPGA Resize HEIF
Encode
High Throughput
Low Latency
Low CPU Utilization
TCO reduction:
Reduce image-processing servers
OPEX reduction
Improve customer experience:
Accelerate the image rendering speed
Internet apps with HEIF
E-commerce platform
Social network with pictures
Web portals
News application
Vast amount of computational
resources and long time when
processing HEIC encoding by CPU;
Poor customer experience;
JPEG to HEIF performanceImage Video AI
Params JPG FPGA CPU
Images 100 100 100
Total Size(Bytes)
173738658 135858024 133290287
CompressionRatio
78.2% 76.72%
Param -slice_qp 5 crf=12 (10179.74 kb/s)
FPS 200.12 20.02
PSNR 54.51 57.06
VMAF 97.25 97.31
QPS: FPGA*1 is 10 times that of
CPU
Latency: FPGA*1 is 0.1 times that of
CPU
Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2680v4 RAM: 128GB OS: CentOS Linux release 7.2.1511 Kernel version: 3.10.0-862.el7.x86_64
Input: Resolution:1920*1080 VQ: preset=medium Total files: 100
Output: Resolution: 1920*1080
Features of CIP
5-10 times throughput promotion compare to CPU3 times latency reduction compare to CPU
20W-40W per FPGA cardTCO reduction: improve computing density of DC, reduce racks
Support: ImageMagick, OpenCV, GraphicsMagick, LeptonAllow seamless migration from software-based implementation to CIP
Remote upgradingPR(Partial Reconfiguration) technology allows fast and easy context switch in accelerator functionality without rebooting the server
High performance
Low power
Software compatible
Ease ofmaintenance
Image Video AI
Summary of CIP’s values
Advantages
Image thumbnail generation for cloud album
Scenario
Customer
Leading mobile phone manufacturer in China
Feedback from customer
· Maintaining stable operating status
JPEGdecode Resize
CIP
· CIP has been deployed in the production environment for more than 16 months
Highperformance
TCO reduction
Ease ofmaintenance
Reduce TCO by 25%
3x latency reduction
Deterministic timing
Smaller cluster of servers
Reduce the total number of server cluster by 50%
Improve the throughput by 50%
Better QoS
Xilinx FPGA
Image Video AI
Case study 1
Image Video AI
Case study 2Advantages
Transcoding from JPEG to WebP
Scenario
Customer
Famous video portal website in China
Feedback from customer
· CIP has passed the test of customer
JPEGdecode Resize
CIP
· Test period: 2 months
TCO reduction
Reduce TCO by 50% at least
Improve customer experience
Reduce latency by xx% (not announced the specific values by our customer)
Deterministic timing: the same low latency at full-load and no-load
Highperformance
1 server with one CIP accelerator has theequivalent computing capability with 3 servers without CIP
WebPencode
· CIP will be deployed in the production environment in 2018Q4
Image Video AI
Evolution planning
- Image codec : JPEG, Webp, HEIF, Lepton, BPG, GIF, PNG
- Pixel processing: High quality resizing, High performance resizing, smart-crop, super-resolution
- Video codec: H264, H265, AV1, AVS3
Accelerated image/video transcoding
- AI function products around pictures and video content: content identification, face detection, pedestrian detection, etc.
- Functional services such as searching for pictures, searching for videos, etc.
Accelerated image/video analytics
- Ultimately all CTAccel accelerated functions will be served by a unified SaaS framework
- Unified APIs across CSPs and on-premise
Software as a Service
Plan & Development
JPEG Lepton
JPEG WebP
Smart-cropSuper-resolution
JPEG
JPEG HEIF
Ready to Market
Resize/Crop/Rotate
MPSoC - H265
MPSoC - H264
AV1
AVS3
CIP
Image Codecs
CVP
Video Codecs
Products
PNG
GIF
Video/ImageAnalytics
Scan for pornographic contentVideo game recognition
AI assist security inspection
AVIF
AlveoU200
AlveoU50
Kintex® UltraScale™115
Zynq MPSoC
PartnersCloud platform support
Hardware platform support
Image Video AI
Video AIImage
Higher performance density
Improve quality of service
TCO reduction
High-quality FPGA-based accelerated computing solution provider
Product line of CTAccelImage Video AI
Xilinx MPSoC with VCU - H264/H265Problems solved Solution
Key features
Scenarios
Values
Xilinx MPSoC with VCU
End-userSmartphone PC TV
AppVideo portal
websiteShort video
appLive
broadcast…
…
H264/H265DecodeFPGA H264/H265
Encode
Applications which require video
transcoding:
Video portal websites ;
Short video apps;
FPS
PSNR
TCO Reduction:
Reduce processing servers
OPEX reduction
Improve customer experience
Accelerate the speed of video
processing, improve customer
experience
Image Video AI
Resize
Performance of H.264 and H.265 (MPSoC)Image Video AI
H.264 Encode H.265 Encode
307.2
449.94
643.87578
118.96227.13
377.57
516.21
1920*1080 1280*720 960*540 640*480
FPS
CVP X264 medium
324.77
478.07
712.39636
30.37 63.89 100.34150.97
1920*1080 1280*720 960*540 640*480
FPS
CVP X265 medium
Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2 RAM: 128GB OS: CentOS Linux release 7
Performance of H.264 and H.265 (MPSoC)Image Video AI
183.22200.42
85.89
28.3
CVP H.264 CVP H.265 X264 medium X265 medium
FPS
Server: 2*E5-2630v2 Server: 2*E5-2680v4
200.68
236.41 236.03
91.09
CVP H.264 CVP H.265 X264 medium X265 medium
FPS
Input:Crowdrun1080p, H264 Output:720p540p480p
Test environment: CPU: 2*Intel(R) Xeon(R) CPU E5-2630v2
& 2*Intel(R) Xeon(R) CPU E5-2680v4 RAM: 128GB OS: CentOS Linux release 7
Image Video AI
Quality of H.264 and H.265 (MPSoC)
20.0030.0040.0050.0060.0070.0080.0090.00
100.00
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
H.265 VMAF
CVPH.265
X265Medium
NVP4 H.265Medium
24.00
26.00
28.00
30.00
32.00
34.00
36.00
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
H.265 PSNR
CVPH.265
X265Medium
NVP4 H.265Medium
20.0030.0040.0050.0060.0070.0080.0090.00
100.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
H.264 VMAF
CVPH.264
X264Medium
NVP4 H.264Medium
22.00
24.00
26.00
28.00
30.00
32.00
34.00
36.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
H.264 PSNR
CVPH.264
X264Medium
NVP4 H.264Medium
Bitrate(Mbps) Bitrate(Mbps)
Bitrate(Mbps)Bitrate(Mbps)
Test environment: CPU: 2*Intel(R) Xeon(R)
CPU E5-2630v2 RAM: 128GB OS: CentOS Linux release 7
Input:High dynamic and high complexity video
- Crowdrun, park_joy
Image Video AI
I-Frame 2 JPEG
Problems solved Solution
Key features
Scenarios
Values
CIP + MPSoC
End-userSmartphone PC TV
AppVideo portal
websiteShort video
appLive
broadcast…
…
H264/H265DecodeFPGA Resize JPEG
Encode
Applications which require “I-
Frame 2 JPEG”:
Video portal websites ;
Short video apps;
High Throughput
Low Latency
Low CPU Utilization
TCO Reduction:
Reduce processing servers
OPEX reduction
Improve customer experience
Accelerate the speed of ”I-Frame 2
JPEG”, improve customer
experience
Image Video AI
AI accelerator: face detection/recognition
Problems solved Solution
Key features
Scenarios
Values
Smart City;
Intelligent Transportation;
Safe City;
Security system;
End-userSmartphone PAD PC
AppSafe City Smart City Intelligent
Transportation…
…
Image Decode
FPGA
Resize Facerecognition
GPU/FPGA
High Throughput
Low Latency
Low CPU Utilization
TCO Reduction:
Reduce image-processing servers
OPEX reduction
Improve customer experience
Improve the speed of face
detection/recognition
Accelerate the speed of ”face
detection/recognition”, improve
customer experience, reduce
TCO
Image Video AI
AI accelerator: test scenarios
GPU GPU
GPU
GPU
GPU
GPU computing matrix
Memory caching
FPGA #1
FPGA #2
Test-1:Alexnet model
FPGA #3
Test-2:nsfw model
Test-3:Inception V4 model
Test-4:ResNet-50 model
Test environment CPU: 2*Intel(R) Xeon(R) CPU E5-2690v4 x 2 RAM: 128GB OS: CentOS Linux release 7.4 Kernel version: 3.10.0-514.2.2.el7.x86_64 Python version: 2.7
Image type Resolution Quantity8 Mega Pixel 2647X3278 14708
16 Mega Pixel 5312x2988 92801080p 1920x1080 21400
Test data
Image Video AI
Test result 1-AlexnetAlexnet model
QPS:
FPGA+GPU is 2.5 faster than CPU+GPU
Latency:FPGA+GPU is 30% of CPU+GPU
CPU usage:FPGA+GPU is 20% of CPU+GPU
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
8M Pixels 16M Pixels 1080p
Latency(ms)
3 FPGA+ 5 GPU CPU+ 5 GPU
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
8M Pixels 16M Pixels 1080p
CPU utilization(%)
3 FPGA+ 5 GPU CPU + 5 GPU
0.00
500.00
1000.00
1500.00
2000.00
2500.00
8M Pixels 16M Pixels 1080p
Throughput(MB/S)
3 FPGA + 5 GPU CPU+ 5 GPU
Image Video AI
Test result 2-nsfwnsfw model
0.00
500.00
1000.00
1500.00
2000.00
2500.00
8M Pixels 16M Pixels 1080p
Throughput(MB/S)
3 FPGA+ 5 GPU CPU+5 GPU
0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00
8M Pixels 16M Pixels 1080p
Latency(ms)
3 FPGA+ 5 GPU CPU+5 GPU
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
8M Pixels 16M Pixels 1080p
CPU utilization(%)
3 FPGA+ 5 GPU CPU+5 GPU
QPS:
FPGA+GPU is 3 faster than CPU+GPU
Latency:FPGA+GPU is 30% of CPU+GPU
CPU usage:FPGA+GPU is 20% of CPU+GPU
Image Video AI
Test result 3-Inception V4InceptionV4 model
0
500
1000
1500
2000
2500
8M Pixels 16M Pixels 1080p
Throughput(MB/S)
3 FPGA+ 5 GPU CPU+5 GPU
0
10
20
30
40
50
60
70
80
90
8M Pixels 16M Pixels 1080p
Latency(ms)
3 FPGA+ 5 GPU CPU+5 GPU
0
10
20
30
40
50
60
70
8M Pixels 16M Pixels 1080p
CPU utilization(%)
3 FPGA+ 5 GPU CPU+5 GPU
QPS:
FPGA+GPU is 2 faster than CPU+GPU
Latency:FPGA+GPU is 40% of CPU+GPU
CPU usage:FPGA+GPU is 30% of CPU+GPU
Image Video AI
Test result 4-ResNet-50ResNet-50 model
0.00
500.00
1000.00
1500.00
2000.00
2500.00
8M Pixels 16M Pixels 1080p
Throughput(MB/S)
3 FPGA+ 5 GPU CPU+5 GPU
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
8M Pixels 16M Pixels 1080p
Latency(ms)
3 FPGA+ 5 GPU CPU+5 GPU
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
8M Pixels 16M Pixels 1080p
CPU utilization(%)
3 FPGA+ 5 GPU CPU+5 GPU
QPS:
FPGA+GPU is 3 faster than CPU+GPU
Latency:FPGA+GPU is 30% of CPU+GPU
CPU usage:FPGA+GPU is 20% of CPU+GPU
Image Video AI
Test result 5-Accuracy
Model Category Accuracy(top1) Accuracy(top5)
Alexnet TensorFlow 0.49 0.74
accel 0.49 0.73
Inceptionv4 TensorFlow 0.80 0.95
accel 0.79 0.95
ResNet50 TensorFlow 0.73 0.91
accel 0.72 0.90
nsfw TensorFlow 0.75 N/A
accel 0.75 N/A
Product website:www.ct-accel.com
Product hotline:+86-0755-88914045
Product enquiry email:[email protected]
Cloud partners:AWS
HUAWEI Cloud
Alibaba Cloud
Baidu Cloud
Tencent Cloud
For any product related enquiries: