Date post: | 07-Feb-2018 |
Category: |
Documents |
Upload: | nguyentram |
View: | 213 times |
Download: | 0 times |
1
Qualcomm Technologies, Inc. All Rights Reserved Qualcomm Technologies, Inc. All Rights Reserved
Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications
Lucian Codrescu Sr. Director, Technology Qualcomm Technologies, Inc.
2
Qualcomm Technologies, Inc. All Rights Reserved
Camera
Display
JPEG
Video
Other
• aDSP: Real-time media & sensor processing
Hexagon™ DSP processors in Snapdragon products
Multimedia Fabric System Fabric
Krait
CPU Adreno
GPU
Krait
CPU
Krait
CPU
Krait
CPU
2MB L2
Misc.
Connectivity
Modem
Snapdragon 800
Fabric & Memory Controller
LPDDR3 LPDDR3
Hexagon
aDSP
Hexagon
mDSP
• mDSP: Dedicated modem processing
Audio
Sensors
3
Qualcomm Technologies, Inc. All Rights Reserved
Expansion of Hexagon DSP use cases beyond audio
Image Enhancement Camera, Still, Video HexagonV4 based products
Video HexagonV5 based products
Sensors HexagonV5 based products
Computer Vision & Augmented Reality HexagonV4 based products
HexagonV2/V3
Voice Audio
Hexagon DSP is evolving for use beyond voice and audio to
computer vision, video and imaging features
4
Qualcomm Technologies, Inc. All Rights Reserved
The Hexagon DSP evolution
Generational improvements in performance and power efficiency driven by both architecture and implementation
V4L 28nm
Apr 2011
V3M 45nm
June 2009
V2 65nm
Dec 2007
V3C 45nm Aug
2009
V3L 45nm Nov
2009
V4M 28nm
Dec 2010
V4C 28nm
Dec 2010
V5A 28nm
Dec 2012
V1 65nm
Oct 2006
Time
V5H 28nm
Dec 2012
5
Qualcomm Technologies, Inc. All Rights Reserved
Requirements
• Require fixed real-time performance level (fps, Mbit/sec, etc.)
• Extremely aggressive power & area targets
Key characteristics of modem & multimedia applications
Characteristics
• Mix of signal processing & control code
− For modem, Qualcomm does not use a split CPU/DSP architecture. All processing is done on Hexagon DSP
− Multimedia apps have significant control in the RTOS & frameworks
• Heavy L2$ misses
− Multimedia is data intensive
− Modem is code intensive
6
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP blends features targeted to modem & multimedia
Hexagon
DSP
VLIW • Need multi-issue to
meet performance
• Low complexity for
Area & Power
Innovate in ISA to
maximize IPC • More work/VLIW packet
reduces energy/instruction
• Keep the pipelines full for
MIPS/mm2
• Target both Signal
Processing & Control code
Multi-Threading • To reduce L2$ miss
penalty without the need
for a large L2
• Increases
instructions/VLIW packet
because compiler doesn’t
need to schedule latency
7
Qualcomm Technologies, Inc. All Rights Reserved
Instruction Unit
VLIW: Area & power efficient multi-issue
Data Unit
(Load/
Store/
ALU)
Data Unit
(Load/
Store/
ALU)
Execution
Unit
(64-bit
Vector)
Execution
Unit
(64-bit
Vector)
Data Cache
L2
Cache
/ TCM
Instruction
Cache
• Dual 64-bit
load/store
units
• Also 32-bit
ALU
Variable sized
instruction packets
(1 to 4 instructions
per Packet)
• Dual 64-bit execution units
• Standard 8/16/32/64bit data
types
• SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
• Up to 8 16b MAC/cycle
• 2 SP FMA/cycle
Register File Register File
Register File/Thread
• Unified 32x32bit
General Register
File is best for
compiler.
• No separate Address
or Accum Regs
• Per-Thread
Device DDR
Memory
8
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the signal processing code work/packet
Example from inner loop of FFT: Executing 29 “simple RISC ops” in 1 cycle
Rs
Add
I R
Rt
*
32
<<0-1
*
32
<<0-1
Rd
I R
Add
I R
*
32
<<0-1
*
32
<<0-1
I R
Rs
Rt
-0x80000x8000
Sat_32 Sat_32
High 16bitsHigh 16bits
I R
+ + + +
{ R17:16 = MEMD(R0++M1)
MEMD(R6++M1) = R25:24
R20 = CMPY(R20, R8):<<1:rnd:sat
R11:10 = VADDH(R11:10, R13:12)
}:endloop0
Complex multiply with
round and saturation
Vector 4x16-bit Add
64-bit Load and
Zero-overhead loops • Dec count
• Compare
• Jump top
64-bit Store with post-update addressing
9
Qualcomm Technologies, Inc. All Rights Reserved
Maximizing the control code work/packet
Hexagon DSP ISA improves control code efficiency over traditional VLIW
Example C code
void example(int *ptr, int val) {
if (ptr!=0) {
*ptr = *ptr + val + 2;
}}
p0 = cmp.eq(r0,#0)
{
if (!p0) r2=memw(r0)
if (p0) jumpr:nt r31
}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31
}
Instr/Packet = 7 instr/5 packets = 1.4
Tradional VLIW
Assembly Code
①
②
③ ④
⑤
{
p0 = cmp.eq (r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31
}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31
}
Hexagon DSP:
Dot-New Predication
①
② ③
④
{
p0 = cmp.eq(r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31
}
r1 = add(r1,add(r2,#2))
{
memw(r0) = r1
jumpr r31
}
Hexagon DSP:
Compound ALU
①
②
③
{
p0 = cmp.eq(r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31
}
{
r1 = add(r1,add(r2,#2))
memw(r0) = r1.new
jumpr r31
}
Hexagon DSP:
New-Value Store
①
②
Instr/Packet =
7 instr/2packets = 3.5
10
Qualcomm Technologies, Inc. All Rights Reserved
High avg. instructions/packet for targeted use cases
Compound instructions count as 2
Audio Computer
Vision Video Imaging Control
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Av
era
ge
In
str
uc
tio
ns /
VL
IW P
ac
ke
t
Source: Qualcomm internal measurements
11
Qualcomm Technologies, Inc. All Rights Reserved
• Hexagon V5 includes three hardware threads
• Architected to look like a multi-core with communication through shared memory
Programmer’s view of Hexagon DSP HW multi-threading
Thread 0
D
U
X
U
Shared Data Cache
L2
Cache /
TCM
Register File
D
U
X
U
Thread 1
D
U
X
U
Register File
D
U
X
U
Thread 2
D
U
X
U
Register File
D
U
X
U
Shared Instruction Cache
12
Qualcomm Technologies, Inc. All Rights Reserved
• Number of threads match execution pipe depth (three threads three execute stages)
• All instructions complete before next packet dispatch
• Compiler schedules for zero-latency which helps to increase instructions/VLIW packet
Hexagon DSP V1-V4: Interleaved multi-threading
Simple round-robin thread scheduling
Thread 0 Dispatch
T0: { Ld Ld Add Cmp }
Thread 1 Dispatch
T1: { St Ld Mpy Add }
T0: { Ld Ld Add Cmp }
Thread 2 Dispatch
T2: { Ld Add Jump }
T1: { St Ld Mpy Add }
T0: { Ld Ld Add Cmp }
13
Qualcomm Technologies, Inc. All Rights Reserved
• Remove a thread from IMT rotation
− On L2 cache misses
− When in wait-for-interrupt or off mode
• Additional forwarding to support 2-cycle packets
• VLIW packets with dependencies between long latency instructions will stall
− But many VLIW packets with simple instructions can complete in 2 processor clocks
Hexagon DSP V5: Dynamic HW multi-threading
Recover some performance when threads idle or stalled
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Dhrystone DMIPS/MHz
IMT DMT
0
1
2
3
4
5
6
7
8
Coremarks/MHz
IMT DMT
Source: Qualcomm internal measurements
14
Qualcomm Technologies, Inc. All Rights Reserved
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Avera
ge I
nstr
ucti
on
s /
Cycle
IPC_DMT
IPC_IMT
Hexagon DSP instructions per cycle
Single-Threaded Apps
Multi-Threaded Apps
Source: Qualcomm internal measurements
15
Qualcomm Technologies, Inc. All Rights Reserved
Clock Rate (MHz) 430-520 100-267 300-800
DSP Performance (BDTImark2000) 4730-5720 1810-4840 5430-14520*
* - Projected best case score for 3-threads
Source: BDTI - For more detailed information see www.BDTI.com. All scores ©2013 BDTI
0
2
4
6
8
10
12
14
16
18
20
Mobile Competitor Qualcomm Hexagon V5(1 thread)
Qualcomm Hexagon V5(3 threads)
DS
P P
erf
orm
an
ce p
er
MH
z
BDTIm
ark2000™/MHz
Hexagon DSP V5: Efficient Architecture
Highly efficient mobile application processor — designed for more performance per MHz
16
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP Power Benefits
17
Qualcomm Technologies, Inc. All Rights Reserved
MP3 playback power for competitive smartphones
Competitor A Qualcomm /Hexagon-based
Competitor B Competitor C Competitor D Competitor E Competitor F Competitor G
• Power measured at the battery for various phones
• Includes everything: DSP, CPU, memory, analog components, etc
Power
Lo
wer
is b
ett
er
Source: Qualcomm internal measurements
18
Qualcomm Technologies, Inc. All Rights Reserved
Augmented Reality Java Application
Call Feature Detect
ARM Only FastCV Library
Feature Detect
Function
App CPU
FastCV Call Router
Hexagon (QDSP6) FastCV Library
Feature Detect
Function
App DSPVeNum
ARM/VeNum FastCV Library
Feature Detect
Function
32% Less Power* 7% Less Time 52% Less CPU
Augmented Reality Java App finding objects in
image using FastCV Feature Detect
Comparison of Feature Detect run on:
• App CPU (ARM/Neon)
• App DSP (Hexagon)
Computer vision offload – ARM/neon to Hexagon DSP
Detection Time (%)
Total Device Power (%)
CPU Utilization (%)
Source: Qualcomm internal measurements. * Power measured at the device battery
19
Qualcomm Technologies, Inc. All Rights Reserved
• Excellent near-linear power scalability (as threads go idle, power used by the thread is nearly eliminated)
• Achieved through optimized clock tree design & clock gating
Hexagon DSP power for different thread utilizations
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Dhrystone Power, IMT Mode
Actual
Ideal
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
FIR Power, IMT Mode
Actual
Ideal
Source: Qualcomm internal measurements
20
Qualcomm Technologies, Inc. All Rights Reserved
Hexagon DSP Software Development
21
Qualcomm Technologies, Inc. All Rights Reserved
Independent Algorithm Developers on Hexagon DSP
Converging markets require converged feature sets
22
Qualcomm Technologies, Inc. All Rights Reserved
Announcing the Hexagon DSP SDK See the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)
Visit http://developer.qualcomm.com for more information.
23
Qualcomm Technologies, Inc. All Rights Reserved
For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog
©2013 Qualcomm Technologies, Inc. Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other product and brand names may be trademarks or registered trademarks of their respective owners. Hexagon is a product of Qualcomm Technologies, Inc.
Thank you
Follow us on: