LambdaFabric for Machine Learning Acceleration

V8.8.2016

LambdaFabric for ML AccelerationModular ASIC Design Based on Interconnects, Memory, CPUs, and Peripherals

V8.8.2016

The KnuEdge Team

Dan Goldin | CEO Martin Seyer | COO

Kate Dilligan | EVP

Doug Palmer | CTO

Steve Cumings | CMOKelly Smales | CFO

David Bradley | Chief ResearchDavid Eames | Chief Engineer

12

Investors & Advisors available upon request

Raghuram Tupuri | VP Engineering

22016 © KnuEdge™ Proprietary - Protected by NDA2/17/2017

V8.8.2016

The Goal: Scaling

3

V8.8.2016

The Neurobiological ComputerNeurobiological Systems: Flexible and Scalable

Animal # Neurons

Flatworm 302

Medicinal leech 10,000

Pond snail 11,000

Sea slug 18,000

Fruit fly 100,000

Lobster 100,000

Ant 250,000

Honey bee 960,000

Cockroach 1,000,000

Frog 16,000,000

Mouse 75,000,000

Bat 110,000,000

Octopus 300,000,000

Human 100,000,000,000

Elephant 200,000,000,000

4

Current generation

machine learning

capabilities

Where KnuEdge

wants to be in 2021:

MindScale.

Take away: Scalable

42017 © KnuEdge™ 2/17/2017

V8.8.2016

MindScale Requirements

Property Specification

Scatter/gather > 1 Peta Transfers/Sec

N/S Edge processing 100 racks standard x86 server processing

Power/LFE ”rack” 125KW

Compactness Comm paths < 50M

Cost/ToC $500M / $100M/Year

Streams/Rate ~100 billion, average data rate ~1 KB/s

Storage NVM, MicroBlade Form Factor

MicroBlade Modules 500K chips, heavy lifting via reconfigurable logic,

64 GB, 10 Tera-Ops, 1.2 TB/s I/O, ~20 pJ/op

Latency < 250 nS PtP, fetch ~400nS

Communications Short: LVDS, Medium: mm optical, Long: DWDM

Energy: ~20 pJ/bit

Fault tolerance 1 processor per hour

Scraping Assumption: 90% of scraping performed at edge

Vert Multi-Story Concept

Over-Riding Rule: No sharp departures from existing technologies

5.5D stacked architecture

52017 © KnuEdge™ 2/17/2017

V8.8.2016

Problems with Existing Architectures for ML

6

V8.8.2016

Current Generation ASICs are a Problem

• Memory gets larger but not faster

• Logic gets faster but spends more time waiting for memory

• Logic gets more energy efficient but memory transport does not

7

Today’s processors: Time & Energy Dominated by Fetch

72017 © KnuEdge™ 2/17/2017

V8.8.2016

LambdaFabricRethinking Computing: Silicon & Software

Taking Lessons From Nature

8

V8.8.2016

Structure of the Mammalian Nervous System

• Extremely complex and diverse

• Thousands or millions of neuron types

• Complex topology of connections like internet

Take away: Heterogeneous and sparsely connected

92017 © KnuEdge™ 2/17/2017

V8.8.2016

Solution: Wetware to Silicon

• Neuron body contains “context”

• Communication via synapses

• Axonal connections define geometry

KnuEdge Design Principles

• Scalable heterogeneous parallelism

• Proximity of memory and processing

• Support for complex connections

• Communications driven architecture

• All information is “pushed”

• Must operate with noise and failures

Nature’s Design Principles

F0

F1

F2

F3

W1,0

W0,2

W2,1

W3,1

W2,3W0,3

Directed Graph

'

3

'

2

'

1

'

0

3

2

1

0

1,3

3,21,2

0,1

3,02,0

000

00

000

00

A

A

A

A

A

A

A

A

W

WW

W

WW

Neurobiology Random Access OperationsASIC

Take away: Be flexible.

102017 © KnuEdge™ 2/17/2017

V8.8.2016

Two Rules of Thumb:

1. You can’t bring the desktop processor to the cloud …

… you must bring the cloud to the processor

2. 80% of machine learning is digital signal processing

2017 © KnuEdge™ 2/17/2017 11

V8.8.2016

Asynchronous Cloud Processor Data Plane

L1 Router

L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2

> 8 Gbytes Highly Segmented Memory

MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC

Supervisor Root of Trust

Multiple Memory Blocks

tDSP Processor core:

• 256 registers

• 4 Packet I/O engines

• Single cycle sleep state

• HW Event synchronization

• Independent clock domains

High-speed Serial:

• 28.5 Gbps

• FEC

• 802.3 Phy

• 10GBaseKR compatible

Layer 1 Router

• Ports: 49

• Rate: 8 GB/s/port

• Size: 2 mm2

• Latency: 7 ns

All communications via flit packets:

• Physical & virtual addressing

• Operation codes

Flit links:

• 64b/256b parallel

• Clock domain crossing

• Interposer or die xing

122017 © KnuEdge™ 2/17/2017

V8.8.2016

Flit Communications

• Only a single modality of operation

• Can operate from registers instead of SRAM

• Superior for real-time transport

• “Cut-through” operations prioritize important traffic

• No bus contention

• All comms are bi-directional (core can receive and transmit at the same time)

• Flit headers carry VLIW packet execution codes (POP)

Queue

Normal Packet Queuing

Que

ue

Flit Packet Queuing

Next packet along must wait for long period before

previous packet leaves the queue. Up to 60,000 clocks.

Next packet along must wait for only 4 clocks before

previous packet leaves the queue. Only 4 clocks.

Advantages:

Payload

(0 to 32 QWDS)

O

P

C

ADR

S

I

Z

V

P

O

P

V

Packet Format

132017 © KnuEdge™ 2/17/2017

V8.8.2016

Flexible Communication Topology

• LambdaFabric network is of arbitrary topology

• ARTs in the routers direct packets

• Guaranteed C2C delivery

• Multi-dimensional grids eliminate hops

• Automatic topology discovery

• Up to 500,000 devices in a block

Linked Hermosa chips create a LambdaFabric

142017 © KnuEdge™ 2/17/2017

V8.8.2016

Performance: Image Segmentation

• Image segmentation using k-means clustering• Object/boundary detection

2017 © KnuEdge™ 2/17/2017

• The above image is a frame from a processed video.

• Each outlined region is a superpixel which is determined by image

segmentation using k-means clustering.

• The superpixels clearly show the boundaries of the various objects.

Source: http://www.robots.ox.ac.uk/~victor/gslicr/index.html

152016 © KnuEdge™ Proprietary - Protected by NDA2/17/2017

V8.8.2016

LambdaFabricTM: Scalable Across Resources

Cluster Chip Board Rack Multi-Rack

Scale invariant network architecture

Low latency to everywhere

High bandwidth

Multi-dimensional connectivity

Scalable up to 512,000 chips

6ns latency 62ns 247ns 437ns

Low-Latency, High-Throughput, Low-Power Computing Fabric

© KNUPATH™ 2016. All Rights Reserved.

tDSP

0

tDSP

1

tDSP

2

tDSP

6

tDSP

5

tDSP

3

tDSP

7

tDSP

4AIP

Memory

(2MB)

162017 © KnuEdge™ 2/17/2017

V8.8.2016

LambdaFabric: Concluding Remarks

• A computing “cloud” consisting of CPUs, memory, and communications

• Essentially a Cloud-on-a-Chip that scales to 512K chips

• Packet/router based scalable architecture

• Optimized for scatter/gather (random) operations

• A flat-model (layers 2,3) design

• Highly-efficient, low-latency communications

• “Push architecture” like data-flow

• Asynchronous design

• Hierarchical control plane

Optimized for heterogeneous recurrent networks,

DSP, CFD, and general graph processing.

Scaling 1:106

2/17/2017 2017 © KnuEdge™ 17

V8.8.2016

™

Date post:	21-Feb-2017
Category:	Technology
Upload:	knuedge
View:	96 times
Download:	1 times

LambdaFabric for Machine Learning Acceleration

Technology