Date post: | 21-Feb-2017 |
Category: |
Technology |
Upload: | knuedge |
View: | 96 times |
Download: | 1 times |
V8.8.2016
LambdaFabric for ML AccelerationModular ASIC Design Based on Interconnects, Memory, CPUs, and Peripherals
V8.8.2016
The KnuEdge Team
Dan Goldin | CEO Martin Seyer | COO
Kate Dilligan | EVP
Doug Palmer | CTO
Steve Cumings | CMOKelly Smales | CFO
David Bradley | Chief ResearchDavid Eames | Chief Engineer
12
Investors & Advisors available upon request
Raghuram Tupuri | VP Engineering
22016 © KnuEdge™ Proprietary - Protected by NDA2/17/2017
V8.8.2016
The Goal: Scaling
3
V8.8.2016
The Neurobiological ComputerNeurobiological Systems: Flexible and Scalable
Animal # Neurons
Flatworm 302
Medicinal leech 10,000
Pond snail 11,000
Sea slug 18,000
Fruit fly 100,000
Lobster 100,000
Ant 250,000
Honey bee 960,000
Cockroach 1,000,000
Frog 16,000,000
Mouse 75,000,000
Bat 110,000,000
Octopus 300,000,000
Human 100,000,000,000
Elephant 200,000,000,000
4
Current generation
machine learning
capabilities
Where KnuEdge
wants to be in 2021:
MindScale.
Take away: Scalable
42017 © KnuEdge™ 2/17/2017
V8.8.2016
MindScale Requirements
Property Specification
Scatter/gather > 1 Peta Transfers/Sec
N/S Edge processing 100 racks standard x86 server processing
Power/LFE ”rack” 125KW
Compactness Comm paths < 50M
Cost/ToC $500M / $100M/Year
Streams/Rate ~100 billion, average data rate ~1 KB/s
Storage NVM, MicroBlade Form Factor
MicroBlade Modules 500K chips, heavy lifting via reconfigurable logic,
64 GB, 10 Tera-Ops, 1.2 TB/s I/O, ~20 pJ/op
Latency < 250 nS PtP, fetch ~400nS
Communications Short: LVDS, Medium: mm optical, Long: DWDM
Energy: ~20 pJ/bit
Fault tolerance 1 processor per hour
Scraping Assumption: 90% of scraping performed at edge
Vert Multi-Story Concept
Over-Riding Rule: No sharp departures from existing technologies
5.5D stacked architecture
52017 © KnuEdge™ 2/17/2017
V8.8.2016
Problems with Existing Architectures for ML
6
V8.8.2016
Current Generation ASICs are a Problem
• Memory gets larger but not faster
• Logic gets faster but spends more time waiting for memory
• Logic gets more energy efficient but memory transport does not
7
Today’s processors: Time & Energy Dominated by Fetch
72017 © KnuEdge™ 2/17/2017
V8.8.2016
LambdaFabricRethinking Computing: Silicon & Software
Taking Lessons From Nature
8
V8.8.2016
Structure of the Mammalian Nervous System
• Extremely complex and diverse
• Thousands or millions of neuron types
• Complex topology of connections like internet
Take away: Heterogeneous and sparsely connected
92017 © KnuEdge™ 2/17/2017
V8.8.2016
Solution: Wetware to Silicon
• Neuron body contains “context”
• Communication via synapses
• Axonal connections define geometry
KnuEdge Design Principles
• Scalable heterogeneous parallelism
• Proximity of memory and processing
• Support for complex connections
• Communications driven architecture
• All information is “pushed”
• Must operate with noise and failures
Nature’s Design Principles
F0
F1
F2
F3
W1,0
W0,2
W2,1
W3,1
W2,3W0,3
Directed Graph
'
3
'
2
'
1
'
0
3
2
1
0
1,3
3,21,2
0,1
3,02,0
000
00
000
00
A
A
A
A
A
A
A
A
W
WW
W
WW
Neurobiology Random Access OperationsASIC
Take away: Be flexible.
102017 © KnuEdge™ 2/17/2017
V8.8.2016
Two Rules of Thumb:
1. You can’t bring the desktop processor to the cloud …
… you must bring the cloud to the processor
2. 80% of machine learning is digital signal processing
2017 © KnuEdge™ 2/17/2017 11
V8.8.2016
Asynchronous Cloud Processor Data Plane
L1 Router
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
> 8 Gbytes Highly Segmented Memory
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
Supervisor Root of Trust
Multiple Memory Blocks
tDSP Processor core:
• 256 registers
• 4 Packet I/O engines
• Single cycle sleep state
• HW Event synchronization
• Independent clock domains
High-speed Serial:
• 28.5 Gbps
• FEC
• 802.3 Phy
• 10GBaseKR compatible
Layer 1 Router
• Ports: 49
• Rate: 8 GB/s/port
• Size: 2 mm2
• Latency: 7 ns
All communications via flit packets:
• Physical & virtual addressing
• Operation codes
Flit links:
• 64b/256b parallel
• Clock domain crossing
• Interposer or die xing
122017 © KnuEdge™ 2/17/2017
V8.8.2016
Flit Communications
• Only a single modality of operation
• Can operate from registers instead of SRAM
• Superior for real-time transport
• “Cut-through” operations prioritize important traffic
• No bus contention
• All comms are bi-directional (core can receive and transmit at the same time)
• Flit headers carry VLIW packet execution codes (POP)
Queue
Normal Packet Queuing
Que
ue
Flit Packet Queuing
Next packet along must wait for long period before
previous packet leaves the queue. Up to 60,000 clocks.
Next packet along must wait for only 4 clocks before
previous packet leaves the queue. Only 4 clocks.
Advantages:
Payload
(0 to 32 QWDS)
O
P
C
ADR
S
I
Z
V
P
O
P
V
Packet Format
132017 © KnuEdge™ 2/17/2017
V8.8.2016
Flexible Communication Topology
• LambdaFabric network is of arbitrary topology
• ARTs in the routers direct packets
• Guaranteed C2C delivery
• Multi-dimensional grids eliminate hops
• Automatic topology discovery
• Up to 500,000 devices in a block
Linked Hermosa chips create a LambdaFabric
142017 © KnuEdge™ 2/17/2017
V8.8.2016
Performance: Image Segmentation
• Image segmentation using k-means clustering• Object/boundary detection
2017 © KnuEdge™ 2/17/2017
• The above image is a frame from a processed video.
• Each outlined region is a superpixel which is determined by image
segmentation using k-means clustering.
• The superpixels clearly show the boundaries of the various objects.
Source: http://www.robots.ox.ac.uk/~victor/gslicr/index.html
152016 © KnuEdge™ Proprietary - Protected by NDA2/17/2017
V8.8.2016
LambdaFabricTM: Scalable Across Resources
Cluster Chip Board Rack Multi-Rack
Scale invariant network architecture
Low latency to everywhere
High bandwidth
Multi-dimensional connectivity
Scalable up to 512,000 chips
6ns latency 62ns 247ns 437ns
Low-Latency, High-Throughput, Low-Power Computing Fabric
© KNUPATH™ 2016. All Rights Reserved.
tDSP
0
tDSP
1
tDSP
2
tDSP
6
tDSP
5
tDSP
3
tDSP
7
tDSP
4AIP
Memory
(2MB)
162017 © KnuEdge™ 2/17/2017
V8.8.2016
LambdaFabric: Concluding Remarks
• A computing “cloud” consisting of CPUs, memory, and communications
• Essentially a Cloud-on-a-Chip that scales to 512K chips
• Packet/router based scalable architecture
• Optimized for scatter/gather (random) operations
• A flat-model (layers 2,3) design
• Highly-efficient, low-latency communications
• “Push architecture” like data-flow
• Asynchronous design
• Hierarchical control plane
Optimized for heterogeneous recurrent networks,
DSP, CFD, and general graph processing.
Scaling 1:106
2/17/2017 2017 © KnuEdge™ 17
V8.8.2016
™