Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 6 times |
Connex Technology Proprietary and Confidential
1
The CA1024: A Massively Parallel Processor for Cost-Effective HDTV
Connex Technology Proprietary and Confidential
2
• Fabless semiconductor company in Silicon Valley• VC funded (series A & B) • In the product-development stage with 26+ employees
– Deep experience with video algorithms, processor design, and digital-video system software
• Core asset: ConnexArrayTM vector-processor architecture– Architecture verified in CA4096 test chip
• Six patent applications on Connex vector-processor technology – 1 US patent granted, 3 US patents pending, 2 US provisional– Granted and pending patents also filed in China, Taiwan, Korea,
EEC, Japan, Singapore• Initial market focus on DTV
Company Background
Connex Technology Proprietary and Confidential
3
Presentation Agenda
• Why a massively parallel processor (MPP)?
• How is MPP integrated in an SoC?
• Processor performance
• Project status
Connex Technology Proprietary and Confidential
4
• HDTV codec & post-processing are computationally intensive
• Computation is dominated by data-parallel processes
• HDTV is a fast-evolving domain
• ASICs are a very costly solution
Challenges
Connex Technology Proprietary and Confidential
5
Our Solution:Integral Parallel Machine
• Data-parallel computation
• Time-parallel computation (supported by speculative parallelism)
• I/O process is transparent to the computational process
Connex Technology Proprietary and Confidential
6
Key Technology
• Fully programmable solution for HDTV video encoding, decoding, and transcoding at the system and algorithm levels– Simple programming model
• Silicon-efficient architecture; die size competitive with similar function ASICs– Re-use of transistors– Minimal dedicated hard-wired blocks
• Sufficient performance to enable multistandard, multichannel, high-definition DTV– Linearly scalable
Connex Technology Proprietary and Confidential
7
The Connex Architecture
1
I/OController
Connex Array
0
1
n
0 2 m
CA1024-PVP:m = n = 32 for a 1,024-PE Connex Machine
Test Chip:m = n = 64 for a 4,096-PE Connex Array; sequencer and I/O control in an FPGA
3.2 GByte/sec I/O channel in parallel with code running on the Connex Array
ConnexI/O
AUX
16-bitRAM
Address
SelectIndex
16 bitALU
Sequencer
255
R0R1
01
254
R2R3R4R5R6R7
Connex Technology Proprietary and Confidential
8
16 bitALU
Connex Cell Architecture
• PE (Processing Element) has eight accumulator registers, including Connex, Aux, and I/O special-function registers
• Select flag enables or disables instruction processing
• Index is a unique cell number used to direct certain instructions
• Bidirectional 16-bit bus to 256 RAM locations
• Connex register includes connections for shifts to/from adjacent PE
• Aux and I/O registers dedicated to specific instruction functions
Address 0
ConnexI/O
AUX
RAM
1
255254
Index
R0R1R2R3R4R5R6R7
Select
Connex Technology Proprietary and Confidential
916 bitALU
16 bitALU
16 bitALU
ConnexArray Structure
• Replicated Connex cells each include PE and local RAM
• Linear interconnect of neighbor registers
• Conditional execution based on state of select bit or index value
• All selected cells execute the same instruction stream
255254
255
R0R1
01
254
R2R3R4R5R6R7
1On
1023
R0R1
01
On0
Off
R2R3R4R5R6R7
255
R0R1
01
254
R2R3R4R5R6R7
Connex Technology Proprietary and Confidential
10
Connex Data-Array Structure
0
255
0 1023Element n
Line m
16-bit data operands
256 lines with 1024 16-bit elements per line1GByte data I/O in parallel with computation operations
Connex Technology Proprietary and Confidential
11
Full Line Operations:Operate On All Elements in Parallel
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Line k = Line i OP Line j
Line k = Line i OP scalar value (repeated for all elements)
Connex Technology Proprietary and Confidential
12
Columns Active Based On Repeating Patterns
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Example: Mark all odd columns active. Or mark every third column active. Or mark every third and fourth column active, etc.
Connex Technology Proprietary and Confidential
13
Columns Active Based On Results of Previous Operations
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Example: Apparently random columns are active, marked, based on Data-dependent results of previous operations.This enables selective processing based on data content.
Connex Technology Proprietary and Confidential
14
0
255
0 1023
Line i
Line j
Example: 128 sets of 8x8 run in parallel in a 1024-cell array
7
7
8x8 8x8 8x8 8x8
Outer-Loop Parallelism:Program in context of 128+ data-structure instances
Example: 8x8 DCT
……..
Connex Technology Proprietary and Confidential
15
I/O System
I/O Plane
Connex Array
IOC
Switch Fabric
IS
Interrupts
DDR-DRAMController
DRAMDRAM
DRAMDRAM
Connex Technology Proprietary and Confidential
16
Computational-IntensiveArchitecture
• All forms of parallelism are strongly segregated– Connex Array for data-parallel computation– Speculative Array for time-parallel computation
• The granularity perfectly fits the application domain – 16-bit processing elements– no MACs, no FPUs, no multipliers…
Connex Technology Proprietary and Confidential
17
High I/O Bandwidth
• External I/O: 3.2 GBytes/sec– Serial access and random access with similar
performance
• Internal I/O: 400 GBytes/sec
Connex Technology Proprietary and Confidential
18
Area & Power Efficiency
• 2 GOPS/mm2 (peak performance)
• GOPS/Watt is 25–50 times greater than a mature sequential technology
Connex Technology Proprietary and Confidential
19
Programming Connex• CPL (Connex Programming Language) is
an extension of C with C/C++ syntax
• Code that operates on scalar data is written in regular C notation
• Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections
• CPL uses sequential operators and
control structures on vector and select datatypes
• Using CPL, the Connex Machine is programmed the same way as conventional sequential machines
• Hides the complexities of the parallel execution hardware
• Complete SDK
{ ...const short OFFSET = 15;...short vector x, y;short vector min, max;...sel = all;x += OFFSET;...min = (x < y)? x : y;max = (x > y)? x : y;...
}
Vectors are arrays of scalar components.
Selections are arrays of Boolean values that dictate which vector components are active.
Connex Technology Proprietary and Confidential
20
Performance
• DCT: 0.35 clock cycle per pixel
• SAD: 0.0025 clock cycle per pixel
Connex Technology Proprietary and Confidential
21
H.264 Dual HD Stream Decoding
Clock Cycles Per Macroblock
Dezigzagging 37.3
Intra Prediction 54.1
IT/IQ 97.3
Motion Compensation 114.3
Deblocking Filter 27.1
Total [ Clock Cycles/Macroblock ]337.8
Allowed clock cycles per macroblock (2-channel 1080i): 409 cycles
Connex Technology Proprietary and Confidential
22
H.264 CABAC (SA) Decoding
• Targeted profile and level: 4.1 Main Profile• Bit-rate/stream considered: 35Mbps (45Mbps
maximum)• Number of bins to decode using CABAC : 47M/sec• Number of clock cycles per bin: 1 cycle• Cycles to decode bins/stream: 50MHz• Typical bit-rate expected for DVB: 10Mbps• Cycles to decode bins for typical stream (DVB):
15MHz
Connex Technology Proprietary and Confidential
23
Sw
itc
h F
ab
ric
Switch Fabric
Au
dio
Ou
tV
ide
oO
ut
Vid
eo
Ou
t
HOSTI/F
Au
dio
Ou
t
Ext.Bus
Au
dio
InA
ud
ioIn
Vid
eo
InV
ide
oIn
Test ICE
PCI v2.2or
Generic
64-bit Wide DRAM
5x-I2S
1xI2S
BT.656/1120
BT.656/1120
Flash
2x-I2S orS/PDIF
BT.656/1120
2x-I2S orS/PDIF
BT.656/1120
DDR-DRAM Ctrl(400 MHz Data Rate)
JTAGGPIO I2C
S/PDIF
SAHostCPU
Audio CPU
TS/SecCPU
VideoCPU
Instruction Sequencer
Sw
itc
h F
ab
ric
I/O
C
on
tro
ller
ConnexArray™Programmable Media Processor
Multi-Codec ProcessingPre-Analysis
3D FilterScaling
Graphics ProcessingVideo Merge/Blend
Motion Adaptive De-interlacing
CA1024
Switch Fabric
Connex Technology Proprietary and Confidential
24
CA1024 Project Status
ACF
MIPS MIPS MIPS PCI
MIPSSA
DD
RC
WO
A CA256CA256 CA256 CA256
• TSMC 0.13 micron• 676-pin PBGA• Samples Q3 2006• [email protected]
Connex Technology Proprietary and Confidential
25
In Summary…..
• Fully programmable processor
• Computational-intensive architecture
• High-bandwidth I/O
• Connex Programming Language & SDK
• Die-area and power-efficient architecture
Connex Technology Proprietary and Confidential
26
Thank You !