Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang,...

Post on 21-Dec-2015

212 views 0 download

Tags:

transcript

Adaptive System on a Chip (aSoC) for Low-Power Signal Processing

Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng,

Wayne Burleson, Russell TessierDepartment of Electrical and Computer

EngineeringUniversity of Massachusetts, Amherst

{alaffely, jliang, pjain, nweng, burleson, tessier} @ecs.umass.edu

This material is based upon work supported by the National Science Foundation under Grant No. 9988238.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Overview

• Motivation• Video Processing

• Architecture• Dynamic Power Management

• Core, Interconnect, and Clock

Problem

• Wireless video processing requires• High throughput • Low Power• Flexible

System on a Chip Solutions

• Take advantage of parallelism• Possible improved performance

• Allow use and reuse of existing integrated components

• If• The application can be partitioned • The appropriate architecture is used

Proposed Architecture: aSoC• High throughput

• Heterogeneous processor elements• Use the right tool for the job

• Fast and predictable interconnect

• Flexible• Runtime reconfiguration of cores and

interconnect

• Power consumption• Implement power saving features in both

cores and interconnect• Use reconfiguration to dynamically control

power consumption

aSoC: adaptive System on a Chip

• Tiled SoC architectureDCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation

aSoC: adaptive System on a Chip

• Tiled SoC architecture• Supports the use of

independently developed heterogeneous cores

• Pick and place cores which best perform the given application

• Increase performance

• Save power• Cores may be any

number of tiles in size

DCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation

aSoC: adaptive System on a Chip

• Tiled SoC architecture• Supports the use of

independently developed heterogeneous cores

• Connected with an interconnect mesh

• Restricted to near neighbor communications

• Creates pipeline• Decreases cycle time

DCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation

aSoC: adaptive System on a Chip

• Tiled SoC architecture• Supports the use of

independently developed heterogeneous cores

• Connected with a fixed interconnect mesh

• Using a communication interface (CI) to manage data

• Network port (Coreport) for each core

• Each CI uses a memory and FSM to repetitively process a predefined schedule of communications

• Crossbar

DCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation

Stream Control• Instruction memory

• Holds the predetermined schedule of communications

• PC • Selects and synchronizes

the communications• Decoder

• Sets crossbar• Controller

• Sets PC • Interprets incoming

configuration commands• Crossbar

• Any input to any set of outputs

NorthSouthEastWest

CoreNorthSouthEastWest

Core

Decoder/Controller

PC

InputsOutputs

Instruction

Memory

LocalConfig

.

Example: Communication

• Stream A-D

• Core CCore BCore A

• A given application requires periodic communications from Core A to Core C

• aSoC uses a prescheduled communication STREAM• Core A places the data in a dedicated STREAM between

the two tiles• Core C pulls the data from that STREAM

• The tile to tile communication uses 3 cycles

Example: Stream

CBA

1 Core to East

Example: Stream

• Stream A-D

• CBA

2 West to East

Example: Stream

CBA

West to Core3

Example: Stream

• Stream A-D

• CBA

West to Core

1

3

2

Core to East

West to EastLoopBack

Static Scheduled Communications

• Creates system scalability by “eliminating” network congestion

• Many interconnect segments managed with time division multiplexing

• lots of Bandwidth

• Improves SoC performance by up to

factor of 8

DCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation

Power Consumption?

• Provide reconfiguration methods for cores and CI

• Develop programmable clocking systems at each tile

Power Aware Core

• Custom motion estimation core• Choose search method

• Full search• 960-600mW (bit width and pel sub-sampling)

• Spiral search• 76mW

• Three step search• 25mW

Data taken with SynopsysTM Power Compiler at the RTL level

aSoC Support

• Multiple streams in and out through dedicated coreports

• Easy to manage on both sides of the port

• Schedule configuration streams in with the data

• Stream A: Input Frame• Stream B: Configuration

(Choose search mode and size)

• Stream C: Motion Vectors

Motion Estimation

Core

in1 in2 out2out1

Stream AStream B

Stream C

Coreports

Reconfigurable Interconnect

• P-frame

• I-frame

ME MC

-

+

InputFrame

DCTInputFrame

DCT

aSoC Support

• Lumped ME, MC and Summation into one double core

DCTMotion Estimation& Compensation

aSoC Support: P-Frame

InputFrame

(Stream A)

DCTMotion Estimation& Compensation

DifferenceFrame

(Stream B)

aSoC Support: Schedule Change

InputFrame

(Stream A)

DCTMotion Estimation& Compensation

DifferenceFrame

(Stream B)

Configuration Streams (C & D)

aSoC Support: Schedule Change

InputFrame

(Stream A)

DCTMotion Estimation& Compensation

DifferenceFrame

(Stream B)

Configuration(Streams C)

Schedule 1

Schedule 2

PC

aSoC Support: Schedule Change

InputFrame

(Stream A)

DCTMotion Estimation& Compensation

DifferenceFrame

(Stream B)

Configuration(Streams C)

Schedule 1

Schedule 2

PC

aSoC Support: Schedule Change

InputFrame

(Stream A)

DCTMotion Estimation& Compensation

Configuration(Streams D)

Schedule 1

Schedule 2

PC

aSoC Support: Schedule Change

InputFrame

(Stream A’)

DCTMotion Estimation& Compensation

Configuration(Streams D)

Schedule 1

Schedule 2

PC

aSoC Support: I-Frame

InputFrame

(Stream A’)

DCTMotion Estimation& Compensation

OFF

Operating Frequency?

• Interconnect synchronized• H-tree clock distribution

• Core frequencies depend on critical path• Tile provides clock reference• Coreport provides asynchronous boundary

• Dynamic core configuration requires dynamic clock configuration• aSoC clock reference provides multiples of

interconnect clock (… 4x, 2x, 1x, 0.5x, 0.25x, …)

• Configured through the tile controller

Mixed vs. Fixed Core Frequencies

• Cores not designed with clock gating• Core power from Synopsys RTL simulation• Interconnect from SPICE• Assumes 10 cycle schedule, 4 pixels/word

Optimal Independent Frequencies

Fixed Worst Case 105MHz

Core: Mode

Frequency MHz

Power mW

Power mW

ME: Full Search

105 973 973

ME: Spiral

9.9 76 659

ME: Three Step Search

2.75 25 580

DCT 9.6 54 349 Interconnect 6.34 0.14 0.81

Current Density and Clocking

• Red: fixed worst case clocking

• Short spikes of high current

• Green: optimal independent clocking

• Slow and low

• Optimal clocking eliminates current spikes (improved battery life)

DeadlineProcess Start

ME: Full Search

ME: Spiral

ME: Three Step Search

DCT

Time

Current

Configuration Overhead• Configuration adds up

to 2 streams per tile• Only 2 required for

data• Total BW =5xTxN

• 5 streams/(cycle,tile)• T tiles• N cycles in schedule

• Single tile can support up to 50 different streams in 10 cycle schedule

DCT

TransformFrame

(Stream D)

InputFrame

(Stream B)

ConfigurationStreams

Configuration Power Overhead

• Configuration streams used infrequently• Once/Macro block or Once/Frame

• Architecture disables unused streams• Data valid bit already used for flow control

• Only 4-9% of interconnect power is due to configuration streams

Conclusion

• aSoC supports dynamic power management with Reconfiguration• Cores• Interconnect• Clocks

• Low configuration overhead in both• Communication Bandwidth• Power

Future Work

• Add reconfigurable voltage supplies at each tile

• Finish test chip• Import larger applications

Questions

aSoC: adaptive System on a Chip

DCT

VLE

MemoryViterbiFIR

EncryptControl

Motion Estimationand Compensation Cores

Interconnect

Interface

Tile

Example: Stream

• Stream A-D

• CBA

Partitioning

• Automated partitioning a non trivial problem

• For small signal processing systems user defined partitioning may be possible

• Key: Perfectly partitioning the system may not be possible• How can the SoC mitigate the

penalty?