Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | mistico-harold |
View: | 33 times |
Download: | 3 times |
Multi-core SOC for Future Media Processing
Qin Xing, Yan Xiaolang
The Institute of VLSI Design, Zhejiang University
The Institute of VLSI Design, Zhejiang Univ. 223/4/19
Outline
Opportunities & challenges from media processing
Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work
The Institute of VLSI Design, Zhejiang Univ. 323/4/19
Opportunities Video conference IP-phone Smart terminal PDA Video camera HDTV Set-top box …
The Institute of VLSI Design, Zhejiang Univ. 423/4/19
Challenges—multiple standards
0
1
2
3
4
5
6
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
Mb
it/s
MPEG-2
MPEG-4
H.26L
H.263
1st MPEG-2 Encoder
2nd Generation Encoder
3rd Generation Encoder
4th Generation Encoder
H.264 / MPEG-4 part 10
5th Generation Encoder
H.264
AVS
WMV
VP3
WMV
AVS
VP3
The Institute of VLSI Design, Zhejiang Univ. 523/4/19
Challenges — excellent hardware Very high computation complexity
H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS
Multiple standards co-exist Demands of flexibility & programmability
Low power Low cost
Best choice : Application Specific Instruction Processor
The Institute of VLSI Design, Zhejiang Univ. 623/4/19
Multimedia algorithm characteristics Outer-loop and inner loop
Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring
Inner loop: Regular algorithms
(Prediction, FIR, DCT, motion estimation)
Interface
Operation System
Data transfering
Bitstream Parsing
Filtering 2D transform Block add/sub
Outer Loop
Inner Loop
The Institute of VLSI Design, Zhejiang Univ. 723/4/19
Multimedia algorithm mapping
Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer
loop Vector processor(VP, VLIW+SIMD) — inner loop
General MCU Enhanced DSP Vector Processor
The Institute of VLSI Design, Zhejiang Univ. 823/4/19
Multi-core SOC architecture Top level
CK520
IM
DM
AH
B
Master
EDSP
DM
IM
DMA
VP
IM
DM
AH
B
Master
AH
B
Slave
AH
B
Slave
AH
B
Slave I/F
I/F
Mem Ctrl
AMBA AHB
AMBA APB
Media processing kernel
The Institute of VLSI Design, Zhejiang Univ. 923/4/19
Inside the media processing kernel
E-D
P
V-D
P1
V-D
P2
V-D
P3
V-D
P4
GD
M
V-D
M1
V-D
M2
V-D
M3
V-D
M4
GTM
2D crossbar connection network
GAG1 GAG2 GAG3GAG4
ED
SP-c
on
trol p
ath
Vect
or
con
trol p
ath
DM
A a
nd o
ff c
hip
mem
ori
es
The Institute of VLSI Design, Zhejiang Univ. 1023/4/19
Technologies— specified instruction set
for (j=0;j<BLOCK_SIZE;j++){
for (i=0;i<BLOCK_SIZE;i++){
m5[i]=img->cof[i0][j0][i][j];
}
m6[0]=(m5[0]+m5[2]);
m6[1]=(m5[0]-m5[2]);
m6[2]=(m5[1]>>1)-m5[3];
m6[3]=m5[1]+(m5[3]>>1);
}
__asm{ mov edx, mptr
movdqu xmm1, [edx]
packssdw xmm1,xmm1// read m50] from memory to xmm1}
__asm{ movdqu xmm4, [edx +48]
packssdw xmm4,xmm4// read m5[3] from memory}
__asm{ movq xmm5,xmm1
psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]);
paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]);
movq xmm5, xmm2
psraw xmm2,1
psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3]
psraw xmm4,1
paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)}
Our IS
Intel MMX:13 cycles
6 cycles
Integer IDCT in H.264
The Institute of VLSI Design, Zhejiang Univ. 1123/4/19
Technologies—instruction mergence
result = 0;
pres_y = dy == 1 ? y_pos : y_pos+1;
pres_y = max(0,min(maxold_y,pres_y));//load
for(x=-2;x<4;x++) //control
{
pres_x = max(0,min(maxold_x,x_pos+x));//load
result += imY[pres_y][pres_x]*COEF[x+2];
// computation, permutation and load
}
result1 = max(0, min(255, (result+16)/32));//computation
Control
Computation
Permutation
Load/Store 30%
25%
35%
10%
Computation
Control
Ld/St and Perm. Merged
Reduce a half of time
6 – tap sub- pixels interpolation
The Institute of VLSI Design, Zhejiang Univ. 1223/4/19
Benchmarking results for CPU core
CK520
MIPS
0
100200
300
400
500600
700
The Institute of VLSI Design, Zhejiang Univ. 1323/4/19
Simulation results for DSP performance
Enhanced DSP CAVLC(context adaptive variable length coding)
OGG(new audio standard)Function MIPS/frame
MDCT 6
De_VQ 2.5
Floor/Coupling 3.5
Sequence
(CIF)
MIPS/frame
Max Average
Foreman 0.147,832 0.029,898
Mobile 0.541,943 0.134,240
The Institute of VLSI Design, Zhejiang Univ. 1423/4/19
Simulation results for DSP performance
Vector processor H.264 baseline decoder
Sequence
(298 frames)
MIPS@30 frames
Max Average
QCIF Foreman 28.1 12.7
Aikyo 19.8 5.3
CIF Foreman 116.3 52.3
Aikyo 92.9 22.8
The Institute of VLSI Design, Zhejiang Univ. 1523/4/19
Project status Finished 2 versions of CPU Core Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools
The Institute of VLSI Design, Zhejiang Univ. 1623/4/19
Future work Scheduling for task level parallelism(TLP)
between heterogeneous processors Simulation/debugging tools for heterogeneous
processors Methodologies for design space exploration