Connex Technology Proprietary
and Confidential
1
The CA1024 :
A fully programmable system-on-chip for cost-effective HDTV media processing
Lazar Bivolarski, Bogdan Mitu, Anand Sheel,
Gheorghe Stefan, Tom Thomson, Dan Tomescu
Connex Technology Proprietary
and Confidential
2
Connex Technology, Inc.
� Core asset: ConnexArrayTM an efficient
data-parallel architecture
– 200 MHz
– 200 GOPS (16-bit simple integer operations)
– 60 GOPS/Watt
– 3.2 GB/sec external; 400 GB/sec internal
� Application domain: HDTV
Connex Technology Proprietary
and Confidential
3
Our Solution:
Integral Parallel Machine
� Data-parallel computation:
ConnexArray
� Time-parallel computation (supported by speculative parallelism):
Stream Accelerator
� I/O process is transparent to the main data-parallel computational process:
I/OPlan & IOC
Connex Technology Proprietary
and Confidential
4
The Connex Architecture
1
I/O
Controller
(4KB data
&
4KB
program
memory)
Connex Array
Connex Array:
1,024 linearly
connected 16-bit
Processing Cells
Sequencer:
32-bit stack machine
& program memory
& data memory
issues in each cycle
(on a 2-stage pipe)
one 64-bit instruction
for Connex Array
and a 24-bit
instruction for itself
IO Controller:
32-bit stack machine
controls a 3.2 GB/s
IO channel
Processing Cell:
Integer unit & data
memory & Boolean
unit
I/O channel works in parallel with code
running on the Connex Array
Connex
I/O
AUX
16-bit
RAM
For
data
Address
BooleanIndex
16 bit
ALU
Sequencer (4KB data & 32Kb program memory)
255
R0R1
0
1
254
R2R3R4R5R6R7
Connex Technology Proprietary
and Confidential
5
16 bit
ALU16 bit
ALU
16 bit
ALU
Connex Array Structure� Processing Cells are
linearly connectedusing only the register R0
� IO Plan consists in all R1s supervised mainly by the IO Controller
� Conditional executionbased on the state of Boolean unit
� Integer unit, Boolean unit and Data memoryexecute in each cycle command fields from a 64-bit instruction issued by Sequencer
� Vector reduction operations with scalar results in the TOS of Sequencer (receiving through a 3-stage pipedata from the array of cells)
255
254
255
R0R1
0
1
254
R2R3R4R5R6R7
off
1023on
R0R1
0
1
0on
1
R2R3R4R5R6R7
255
R0R1
0
1
254
R2R3R4R5R6R7
Connex Technology Proprietary
and Confidential
6
I/O System
I/O Plane
Connex
Array
IOC
Switch Fabric (128-bit word)
IS
Interrupts
DDR-DRAM
Controller
DRAM
DRAM
DRAM
DRAM
Connex Technology Proprietary
and Confidential
7
Full Line Operations:
Operate On All Elements in Parallel
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Line k = Line i OP Line j
Line k = Line i OP scalar value (repeated for all elements)
16-bit data operand
Connex Technology Proprietary
and Confidential
8
Columns Active Based On
Repeating Patterns
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Example: Mark all odd columns active. Or mark every third column active.
Or mark every third and fourth column active, etc.
Connex Technology Proprietary
and Confidential
9
Columns Active Based On Results
of Previous Operations
0
255
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Example: Apparently random columns are active, marked, based on
Data-dependent results of previous operations.
This enables selective processing based on data content.
Connex Technology Proprietary
and Confidential
10
0
255
0 1023
Line i
Line j
Example: 128 sets of 8x8 run in parallel in a 1024-cell array
7
7
8x8 8x8 8x8 8x8
Outer-Loop Parallelism:
Program in context of 128+ data-structure instances
Example: 8x8 DCT
……..
Connex Technology Proprietary
and Confidential
11
Fine-Grain Parallelism and Time
Distributed Processing
0
255
0 1023
Input/Output
Motion Comp
Fine-Grained
The Fine-Grain Parallelism allows different algorithms to be
applied at the same time for increased parallelism
Pro
cessin
g
Prediction
IDCT/IQ
De-zigzag
Deblocking
16x16 macroblock
Connex Technology Proprietary
and Confidential
12
Local Memory Mapping Based on
Data Dependency
0
255
0
Local
Mapping
……
..
1023
16x1610
4
2
5
5
4
1
0
2
In Frame
1
2
3
4
Local data dependency remapping and processing of multiple
neighboring blocks enables high degree of parallelism
Connex Technology Proprietary
and Confidential
13
Programming Connex� CPL (Connex Programming
Language) is an extension of C
� Code that operates on scalar data written in regular C notation
� Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections
� CPL uses sequential operators and control structures on vector and select data-types
� Using CPL the Connex Machine is programmed the same way as conventional sequential machines
{ ...
const short OFFSET = 15;
...
short vector x, y;
short vector min, max;
...
sel = all;
x += OFFSET;
...
min = x;
max = x;
min = (min > y)? y; /* min = min(x, y) */
max = (max < y)? y; /* max = max(x, y) */
...
}
Vectors are arrays of scalar components.
Selections are arrays of Boolean values that
dictate what vector components are active.
Connex Technology Proprietary
and Confidential
14
Co
nfi
gu
rab
le S
wit
ch
Fa
bri
c
Configurable Switch Fabric
Au
dio
Ou
t
Vid
eo
Ou
tV
ideo
Ou
t
HOST
I/F
Au
dio
Ou
t
Ext.
Bus
Au
dio
In
Au
dio
In
Vid
eo
In
Vid
eo
In
Test ICE
PCI v2.2
or
Generic
64-bit Wide
DRAM
1x-I2S
4xI2S
BT.656/1120
BT.656/1120
Flash
1x-I2S
BT.656/1120
1x-I2S
BT.656/1120
DDR-DRAM Ctrl(400 MHz Data Rate)
EJTAGGPIO I2C
S/PDIF
StreamAccelerator
Host
CPU
Audio
CPUTS/Sec
CPU
Video
CPU
Instruction
Sequencer
Co
nfi
gu
rab
le S
wit
ch
Fa
bri
c
Test
I/O
Seq
uen
cer
ConnexArray™Programmable Media Processor
Multi-Codec Processing
Pre-Analysis
3D Filter
Scaling
Video Merge/Blend
Motion Adaptive De-interlacing
CA1024
Configurable Switch Fabric
Connex Technology Proprietary
and Confidential
15
The main strategic decisions in
defining Connex Architecture
� Simple architecture: – nothing spectacular at the circuit level
– no technological challenges
� Fully programmable (no pieces of hardware to solve critical problems)
� Tuned on the application domain (HDTV)
� Programming language able to hide the structural details (because they are simple)– Efficient compiler
– Cycle accurate simulator
� Imaginative algorithms to adapt the architecture to the application domain
Connex Technology Proprietary
and Confidential
16
What differentiate Connex from
other Parallel Architectures
� All forms of parallelism are strongly segregated
– ConnexArray for data-parallel computation
– Stream Accelerator for time-parallel (speculative) computation
� The granularity perfectly fits the application domain
– 16-bit small & simple processing elements
– enough local data memory (256 16-bit words)
– no MACs, no FPUs, no multipliers…
� The simplest interconnection network allowed by the parallel
computational locality
� “Smart” IO process able to save computation or supported
by additional computation for IO bounded applications
Connex Technology Proprietary
and Confidential
17
Performances
� > 2 GOPS/mm2 (peak performance)
� 60 GOPS/Watt
� Dot Product: 28 cycles (16-bit 1Kcomponent vectors)
� DCT: 0.35 clock cycle per pixel
� SAD: 0.0025 clock cycle per pixel
� Using 83% of ConnexArray computational power decodes H.264 dual HD stream
Connex Technology Proprietary
and Confidential
18
Performance Comparisons
16-bit Fixed-Point Sum of Absolute Differences
(16X16 SAD - Motion Estimation)
SAD/MHz
25X
50X
100X
Analog Devices
BF651
TI
C64xx
Equator
BSP16-500
Connex
CA1024
1,000
10,000
100,000
1,000,000
16-bit Fixed-Point Discrete Cosine Transform
(8X8 DCT - Image Compression)
DCT/MHz
20X
70X100X
Analog Devices
BF651
TI
C64xxTensilica
VectraDSP
Connex
CA1024
1,000
10,000
100,000
1,000,000
Connex Technology Proprietary
and Confidential
19
ConnexArray Performance Decoder
VC-1 Dual HD Stream
106.7IT/IQ
14.3Deringing Filter
276.8 (67%)Total [ Clock cycles/ macro-block ]
15.4Loop Filter
35.3Motion Vector Compensation
20Motion Vector Reconstruction
20.8Overlap Transform
16.3DC Prediction
23.3AC Prediction
24.7Dezigzagging
Clock Cycles/
Macro-Block
Allowed Clock cycles/macro-block (2 channel, 1080i): 409 Clocks/MB
Connex Technology Proprietary
and Confidential
20
CA1024 Project Status
ACF
MIPS MIPS MIPS PCI
MIPSSA
DD
RC
WO
A CA256CA256 CA256 CA256
� TSMC 0.13 micron
� 200 MHz clock rate
� Standard ASIC flow
� 676-pin PBGA
� Samples Q4 2006
Connex Technology Proprietary
and Confidential
21
Thank You !
Connex Technology Proprietary
and Confidential
22
Back-up slides
Connex Technology Proprietary
and Confidential
23
Connex Value Proposition
� Fully programmable solution for HDTV
video encoding, decoding, trans-coding
and post-processing
� Silicon efficient architecture with die size
competitive with similar function ASICs
� High performance to enabling multi-
standard, multi-channel HDTV
Connex Technology Proprietary
and Confidential
24
ConnexArray Performance Decoder
H.264 Dual HD Stream
97.3IT/IQ
337.8Total [ Clock Cycles/Macroblock ]
27.1Deblocking Filter
114.3Motion Compensation
54.1Intra Prediction
37.3Dezigzagging
Clock Cycles/
Macroblock
Allowed Clock cycles/Macroblock (2 channel, 1080i): 409 Clks/MB
Connex Technology Proprietary
and Confidential
25
StreamAccelerator performing H.264 CABAC
Decoding
� Targeted profile and level: 4.1 Main Profile
� Bit-rate/stream considered: 25Mbps
� Number of bins to decode using CABAC : 35M/sec
� Number of clock cycles per bin: < 2 cycles
� Cycles to decode bins/stream: 70M
� Typical bit-rate expected for DVB: 10Mbps
� Cycles to decode bins for typical stream (DVB): 30M
� Available cycles/stream: 100M
Connex Technology Proprietary
and Confidential
26
Relative Pad Limited Die Size
BCM7041
BCM7038
EM8634STI7100
Xilleon226
CX24176
CA1024
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
0 1 2 3 4 5 6 7 8
Device Cost Comparison
Assumptions:
1) Die Size is pad limited
2) Staggered, minimum pitch pads
3) All devices are in 130nm process
Broadcom Sigma ST ATI Conexant Connex