+ All Categories
Home > Documents > HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and...

HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and...

Date post: 16-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
26
Connex Technology Proprietary and Confidential 1 The CA1024 : A fully programmable system-on-chip for cost- effective HDTV media processing Lazar Bivolarski, Bogdan Mitu, Anand Sheel, Gheorghe Stefan, Tom Thomson, Dan Tomescu
Transcript
Page 1: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

1

The CA1024 :

A fully programmable system-on-chip for cost-effective HDTV media processing

Lazar Bivolarski, Bogdan Mitu, Anand Sheel,

Gheorghe Stefan, Tom Thomson, Dan Tomescu

Page 2: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

2

Connex Technology, Inc.

� Core asset: ConnexArrayTM an efficient

data-parallel architecture

– 200 MHz

– 200 GOPS (16-bit simple integer operations)

– 60 GOPS/Watt

– 3.2 GB/sec external; 400 GB/sec internal

� Application domain: HDTV

Page 3: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

3

Our Solution:

Integral Parallel Machine

� Data-parallel computation:

ConnexArray

� Time-parallel computation (supported by speculative parallelism):

Stream Accelerator

� I/O process is transparent to the main data-parallel computational process:

I/OPlan & IOC

Page 4: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

4

The Connex Architecture

1

I/O

Controller

(4KB data

&

4KB

program

memory)

Connex Array

Connex Array:

1,024 linearly

connected 16-bit

Processing Cells

Sequencer:

32-bit stack machine

& program memory

& data memory

issues in each cycle

(on a 2-stage pipe)

one 64-bit instruction

for Connex Array

and a 24-bit

instruction for itself

IO Controller:

32-bit stack machine

controls a 3.2 GB/s

IO channel

Processing Cell:

Integer unit & data

memory & Boolean

unit

I/O channel works in parallel with code

running on the Connex Array

Connex

I/O

AUX

16-bit

RAM

For

data

Address

BooleanIndex

16 bit

ALU

Sequencer (4KB data & 32Kb program memory)

255

R0R1

0

1

254

R2R3R4R5R6R7

Page 5: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

5

16 bit

ALU16 bit

ALU

16 bit

ALU

Connex Array Structure� Processing Cells are

linearly connectedusing only the register R0

� IO Plan consists in all R1s supervised mainly by the IO Controller

� Conditional executionbased on the state of Boolean unit

� Integer unit, Boolean unit and Data memoryexecute in each cycle command fields from a 64-bit instruction issued by Sequencer

� Vector reduction operations with scalar results in the TOS of Sequencer (receiving through a 3-stage pipedata from the array of cells)

255

254

255

R0R1

0

1

254

R2R3R4R5R6R7

off

1023on

R0R1

0

1

0on

1

R2R3R4R5R6R7

255

R0R1

0

1

254

R2R3R4R5R6R7

Page 6: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

6

I/O System

I/O Plane

Connex

Array

IOC

Switch Fabric (128-bit word)

IS

Interrupts

DDR-DRAM

Controller

DRAM

DRAM

DRAM

DRAM

Page 7: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

7

Full Line Operations:

Operate On All Elements in Parallel

0

255

0 1023

Line i

Line k

Line j

+, -, *, XOR, etc.

=

Line k = Line i OP Line j

Line k = Line i OP scalar value (repeated for all elements)

16-bit data operand

Page 8: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

8

Columns Active Based On

Repeating Patterns

0

255

0 1023

Line i

Line k

Line j

+, -, *, XOR, etc.

=

Example: Mark all odd columns active. Or mark every third column active.

Or mark every third and fourth column active, etc.

Page 9: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

9

Columns Active Based On Results

of Previous Operations

0

255

0 1023

Line i

Line k

Line j

+, -, *, XOR, etc.

=

Example: Apparently random columns are active, marked, based on

Data-dependent results of previous operations.

This enables selective processing based on data content.

Page 10: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

10

0

255

0 1023

Line i

Line j

Example: 128 sets of 8x8 run in parallel in a 1024-cell array

7

7

8x8 8x8 8x8 8x8

Outer-Loop Parallelism:

Program in context of 128+ data-structure instances

Example: 8x8 DCT

……..

Page 11: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

11

Fine-Grain Parallelism and Time

Distributed Processing

0

255

0 1023

Input/Output

Motion Comp

Fine-Grained

The Fine-Grain Parallelism allows different algorithms to be

applied at the same time for increased parallelism

Pro

cessin

g

Prediction

IDCT/IQ

De-zigzag

Deblocking

16x16 macroblock

Page 12: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

12

Local Memory Mapping Based on

Data Dependency

0

255

0

Local

Mapping

……

..

1023

16x1610

4

2

5

5

4

1

0

2

In Frame

1

2

3

4

Local data dependency remapping and processing of multiple

neighboring blocks enables high degree of parallelism

Page 13: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

13

Programming Connex� CPL (Connex Programming

Language) is an extension of C

� Code that operates on scalar data written in regular C notation

� Connex-specific operators defined for features not available in C, e.g. operations on vectors, selections

� CPL uses sequential operators and control structures on vector and select data-types

� Using CPL the Connex Machine is programmed the same way as conventional sequential machines

{ ...

const short OFFSET = 15;

...

short vector x, y;

short vector min, max;

...

sel = all;

x += OFFSET;

...

min = x;

max = x;

min = (min > y)? y; /* min = min(x, y) */

max = (max < y)? y; /* max = max(x, y) */

...

}

Vectors are arrays of scalar components.

Selections are arrays of Boolean values that

dictate what vector components are active.

Page 14: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

14

Co

nfi

gu

rab

le S

wit

ch

Fa

bri

c

Configurable Switch Fabric

Au

dio

Ou

t

Vid

eo

Ou

tV

ideo

Ou

t

HOST

I/F

Au

dio

Ou

t

Ext.

Bus

Au

dio

In

Au

dio

In

Vid

eo

In

Vid

eo

In

Test ICE

PCI v2.2

or

Generic

64-bit Wide

DRAM

1x-I2S

4xI2S

BT.656/1120

BT.656/1120

Flash

1x-I2S

BT.656/1120

1x-I2S

BT.656/1120

DDR-DRAM Ctrl(400 MHz Data Rate)

EJTAGGPIO I2C

S/PDIF

StreamAccelerator

Host

CPU

Audio

CPUTS/Sec

CPU

Video

CPU

Instruction

Sequencer

Co

nfi

gu

rab

le S

wit

ch

Fa

bri

c

Test

I/O

Seq

uen

cer

ConnexArray™Programmable Media Processor

Multi-Codec Processing

Pre-Analysis

3D Filter

Scaling

Video Merge/Blend

Motion Adaptive De-interlacing

CA1024

Configurable Switch Fabric

Page 15: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

15

The main strategic decisions in

defining Connex Architecture

� Simple architecture: – nothing spectacular at the circuit level

– no technological challenges

� Fully programmable (no pieces of hardware to solve critical problems)

� Tuned on the application domain (HDTV)

� Programming language able to hide the structural details (because they are simple)– Efficient compiler

– Cycle accurate simulator

� Imaginative algorithms to adapt the architecture to the application domain

Page 16: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

16

What differentiate Connex from

other Parallel Architectures

� All forms of parallelism are strongly segregated

– ConnexArray for data-parallel computation

– Stream Accelerator for time-parallel (speculative) computation

� The granularity perfectly fits the application domain

– 16-bit small & simple processing elements

– enough local data memory (256 16-bit words)

– no MACs, no FPUs, no multipliers…

� The simplest interconnection network allowed by the parallel

computational locality

� “Smart” IO process able to save computation or supported

by additional computation for IO bounded applications

Page 17: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

17

Performances

� > 2 GOPS/mm2 (peak performance)

� 60 GOPS/Watt

� Dot Product: 28 cycles (16-bit 1Kcomponent vectors)

� DCT: 0.35 clock cycle per pixel

� SAD: 0.0025 clock cycle per pixel

� Using 83% of ConnexArray computational power decodes H.264 dual HD stream

Page 18: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

18

Performance Comparisons

16-bit Fixed-Point Sum of Absolute Differences

(16X16 SAD - Motion Estimation)

SAD/MHz

25X

50X

100X

Analog Devices

BF651

TI

C64xx

Equator

BSP16-500

Connex

CA1024

1,000

10,000

100,000

1,000,000

16-bit Fixed-Point Discrete Cosine Transform

(8X8 DCT - Image Compression)

DCT/MHz

20X

70X100X

Analog Devices

BF651

TI

C64xxTensilica

VectraDSP

Connex

CA1024

1,000

10,000

100,000

1,000,000

Page 19: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

19

ConnexArray Performance Decoder

VC-1 Dual HD Stream

106.7IT/IQ

14.3Deringing Filter

276.8 (67%)Total [ Clock cycles/ macro-block ]

15.4Loop Filter

35.3Motion Vector Compensation

20Motion Vector Reconstruction

20.8Overlap Transform

16.3DC Prediction

23.3AC Prediction

24.7Dezigzagging

Clock Cycles/

Macro-Block

Allowed Clock cycles/macro-block (2 channel, 1080i): 409 Clocks/MB

Page 20: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

20

CA1024 Project Status

ACF

MIPS MIPS MIPS PCI

MIPSSA

DD

RC

WO

A CA256CA256 CA256 CA256

� TSMC 0.13 micron

� 200 MHz clock rate

� Standard ASIC flow

� 676-pin PBGA

� Samples Q4 2006

Page 21: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

21

Thank You !

Page 22: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

22

Back-up slides

Page 23: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

23

Connex Value Proposition

� Fully programmable solution for HDTV

video encoding, decoding, trans-coding

and post-processing

� Silicon efficient architecture with die size

competitive with similar function ASICs

� High performance to enabling multi-

standard, multi-channel HDTV

Page 24: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

24

ConnexArray Performance Decoder

H.264 Dual HD Stream

97.3IT/IQ

337.8Total [ Clock Cycles/Macroblock ]

27.1Deblocking Filter

114.3Motion Compensation

54.1Intra Prediction

37.3Dezigzagging

Clock Cycles/

Macroblock

Allowed Clock cycles/Macroblock (2 channel, 1080i): 409 Clks/MB

Page 25: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

25

StreamAccelerator performing H.264 CABAC

Decoding

� Targeted profile and level: 4.1 Main Profile

� Bit-rate/stream considered: 25Mbps

� Number of bins to decode using CABAC : 35M/sec

� Number of clock cycles per bin: < 2 cycles

� Cycles to decode bins/stream: 70M

� Typical bit-rate expected for DVB: 10Mbps

� Cycles to decode bins for typical stream (DVB): 30M

� Available cycles/stream: 100M

Page 26: HC18.S5T2.The CA1024 - A Fully Programmable System-On-Chip … · Connex Technology Proprietary and Confidential 1 The CA1024: A fully programmable system-on-chip for cost-effective

Connex Technology Proprietary

and Confidential

26

Relative Pad Limited Die Size

BCM7041

BCM7038

EM8634STI7100

Xilleon226

CX24176

CA1024

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

0 1 2 3 4 5 6 7 8

Device Cost Comparison

Assumptions:

1) Die Size is pad limited

2) Staggered, minimum pitch pads

3) All devices are in 130nm process

Broadcom Sigma ST ATI Conexant Connex


Recommended