Post on 13-Apr-2017
transcript
V8.8.2016
Enabling Computation for Machine Learning Algorithms Inspired by Neurobiology
Taking Lessons From Nature
Presentation to CSRC Colloquium, SDSU
January 27, 2017
Doug Bergman, PhD, Staff Scientist / Mathemaperson
KnuEdge
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 2
V8.8.2016
The Neurobiological ComputerNeurobiological Systems: Flexible and Scalable
Animal # Neurons (10x scale)
Roundworm 302 (2)
Medicinal leech 10,000 (4)
Pond snail 11,000 (4)
Sea slug 18,000 (4)
Lobster 100,000 (5)
Fruit Fly 250,000 (5)
Ant 250,000 (5)
Honey bee 960,000 (6)
Cockroach 1,000,000 (6)
Frog 16,000,000 (7)
Mouse 71,000,000 (8)
Finch 131,000,000 (8)
Octopus 500,000,000 (9)
Human 100,000,000,000 (14)
Elephant 200,000,000,000 (14)
Current generation
machine learning
capabilities
Where KnuEdge
wants to be in 2021:
MindScale.
© KnuEdge™ 2016. All Rights Reserved. 3
Source: Wikipedia
https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons
V8.8.2016
BatVision
• SONAR or optical “blind” vision
• 25,000 bats in one cave can avoid collisions
• Can tell a moth from a fly at 20 m
• Outgoing pulses– Frequency: Up to120 KHz, f ~ 60% (decay)
– Pulse width: 2 msec
– Pulse rep rate of ~130 Hz – increases with proximity
– Sound amplitude of 130 decibels (jet plane loud)
• Incoming reception– Time resolution of 2 μsecs
• Signal Processing – ~ 10 million neurons
© KnuEdge™ 2016. All Rights Reserved. 4
V8.8.2016
Structure of the Mammalian Nervous System
• Extremely complex and diverse
• Thousands or millions of neuron types
• Complex topology of connections like internet
Take away: Heterogeneous and sparsely connected
© KnuEdge™ 2016. All Rights Reserved. 5
V8.8.2016
Natural Pattern Recognition
• Recognition is immediate and automatic
• Only a few unlabeled samples needed
• Inspiring development of new machine learning models
• Perfected through a billion years of natural selection
• Heterogeneity of neuron types
• Sparse,“plastic” evolving connections
© KnuEdge™ 2016. All Rights Reserved. 6
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 7
V8.8.2016
Artificial Pattern Recognition: Neural Networks
• Lots of costly computation required
• Hundreds or thousands of labeled samples to train
• Largely variations on homogeneous Multilayered Perceptron models
• Relatively easy to develop• Import library into script, train with data, run
• Build models without sophisticated theory; choose parameters arbitrarily
“Throw mud on wall, see what sticks”
This has been the thrust of development in part because these single-instruction, multiple-data (SIMD) models map well to existing hardware e.g. GPUs
© KnuEdge™ 2016. All Rights Reserved. 8
V8.8.2016
Nature-Inspired Learning Models
• Heterogeneity of neuron types
• Hebbian Learning: neurons interconnect sparsely after repeated sympathetic firing
• Plasticity and pruning: connections may change over time
• Recurrent models
• Sequential “Deep Learning” models allow for training on unlabeled or partially-labeled data
Field has been slower to develop due in part to unavailability of Multiple-Instruction, Multiple Data (MIMD) processing
© KnuEdge™ 2016. All Rights Reserved. 9
Raudies, Zilli, Hasselmo
http://journals.plos.org/plosone/article?id
=10.1371/journal.pone.0093250
Piekniewski, Laurent, Petre, Richert,
Fisher, Hylton
https://arxiv.org/pdf/1607.06854v3.pdf
V8.8.2016
Problem: Software
• Brittleness/Lack of Security
• Multiple modalities of communications
• Poor scatter/gather communications
• Power consumption
• Ease of use
• Lack of skilled manpower
24 million1.8 million90 million 2.0 million
Increasing Complexity of Applications (Lines of Code)
Answer: Learning, based on neurobiological principles, could be preferable to programming.
© KnuEdge™ 2016. All Rights Reserved. 10
V8.8.2016
Current Generation Chips are a Problem
• Memory gets larger but not faster
• Logic gets faster but spends more time waiting for memory
• Logic gets more energy efficient but memory transport does not
Today’s processors: Time & Energy Dominated by Fetch
© KnuEdge™ 2016. All Rights Reserved. 11
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 12
V8.8.2016
Hardware Solution: Wetware to Silicon
• Neuron body contains “context”
• Communication via synapses
• Axonal connections define geometry
Intellisis Design Goals
• Scalable heterogeneous parallelism
• Proximity of memory and processing
• Support for complex connections
• Communications driven architecture
• All information is “pushed”
• Must operate with noise and failures
Nature’s Design Principles
F0
F1
F2
F3
W1,0
W0,2
W2,1
W3,1
W2,3W0,3
Directed Graph
'
3
'
2
'
1
'
0
3
2
1
0
1,3
3,21,2
0,1
3,02,0
000
00
000
00
A
A
A
A
A
A
A
A
W
WW
W
WW
Neurobiology Random Access OperationsASIC
Take away: Be flexible.
© KnuEdge™ 2016. All Rights Reserved. 13
V8.8.2016© KnuEdge™ 2016. All Rights Reserved. 14
Nodal Processing
• Data flow from inputs (raw data) to outputs (calculated results)
• Finite-state processing: “Lambda” push model
• Parallel processing kernels occur independent of other activities and flows
• Nodes can be subnetworks themselves
• New nodes can be inserted at any time
14
A
B
C
G
H
E
F
J
K
Q
T
Fig 5. Heterogeneous network with heterogeneous nodes and subnets.
L
M
PIn
pu
ts
Ou
tpu
ts
X
Net A
Net D
Net B
Net CNet D
Net X
Processing Pipeline
Classifier
V8.8.2016
Hermosa Processing
fj
i = 0
i = 1
i = n-1
i = 2
i = 3
Hermosa Processing Unit Model
Inputs(connections) Outputs
(Activations)
Processing(neuron)
k
Hermosa Connection Processing
J
Router
Each node accepts n inputs.
When all n inputs are processed a
single output is generated
Specially designed for coupled
differential equation solving
Node Processing
There are many more software
connections than hardware
connections
A router performs the task of
connecting neurons.
Edge Processing
F4
F1
F2
F5
F3
F8
F6
F7
Directed Graph Representation of Processing
Each line is called an “edge” and each circle is called a “vertex”
© KnuEdge™ 2016. All Rights Reserved. 15
V8.8.2016
Lambda Architecture Model• The fundamental architecture block is a cluster, a physical analog of a network node or subgraph
• Arbitrarily scalable and nestable
• Each cluster contains data storage and specialized processors
Co-location minimizes latency and energy consumption
• Processes are pipelined by message passing among clusters
• Clusters retain finite state information; are free to perform other tasks while waiting for responses
• Internal handling of processing within clusters minimizes traffic congestion
16© KnuEdge™ 2016. All Rights Reserved.1/30/2017
Cluster 0
Cluster 1
Cluster 2
Cluster 3
V8.8.2016
Another application: Graph Analytics
In addition to Machine Learning, the Hermosa processor and Lambda fabric architecture will handle graph processing elegantly, using data in compressed format
© KnuEdge™ 2016. All Rights Reserved. 17
Memory Cluster 0 Memory Cluster 1 Memory Cluster N-1
Worker 4 Worker 3
Wo
rke
r 2
Wo
rke
r 1
Manager Worker 0
Wo
rke
r 5
Wo
rke
r 6
Vertex Pointers, Working Data
in Local Memory 0
Edge Successor Data in Auxiliary
Memory 0
10
2 3
5 4
10
2 3
5 4
. . .
Router
Worker 4 Worker 3
Wo
rke
r 2
Wo
rke
r 1
Manager Worker 0
Wo
rke
r 5
Wo
rke
r 6
10
2 3
5 4
10
2 3
5 4
Worker 4 Worker 3
Wo
rke
r 2
Wo
rke
r 1
Manager Worker 0
Wo
rke
r 5
Wo
rke
r 6
10
2 3
5 4
10
2 3
5 4
Vertex Pointers, Working Data
in Local Memory 1
Vertex Pointers, Working Data
in Local Memory N-1
Edge Successor Data in Auxiliary
Memory 1
Edge Successor Data in Auxiliary
Memory N-1
Packe
t
V8.8.2016
Graph Finite-State Operations• Graph Partitions = Data Objects = Memory Clusters
• Partitions have local states that track states of primitive functions
18© KnuEdge™ 2016. All Rights Reserved.1/30/2017
1
4
2
3
min
min
min
0,2 2,3 ,0,3 min
0,1 1,3
D DD
D D
min 0,1D
min 0,2D
0
2,3D
1,3D
Example: Find Minimum Distance from Vertex 0 to Vertex 3
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 19
V8.8.2016
Asynchronous Cloud Processor Data Plane
L1 Router
L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2
> 8 Gbytes Highly Segmented Memory
MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC
Supervisor Root of Trust
Multiple Memory Blocks
tDSP Processor core:
• 256 registers
• 4 Packet I/O engines
• Single cycle sleep state
• HW Event synchronization
• Independent clock domains
High-speed Serial:
• 28.5 Gbps
• FEC
• 802.3 Phy
• 10GBaseKR compatible
Layer 1 Router
• Ports: 49
• Rate: 8 GB/s/port
• Size: 2 mm2
• Latency: 7 ns
All communications via flit packets:
• Physical & virtual addressing
• Operation codes
Flit links:
• 64b/256b parallel
• Clock domain crossing
• Interposer or die xing
© KnuEdge™ 2016. All Rights Reserved. 20
V8.8.2016
“Hermosa” ASIC
24.6 mm
Clusters
Central Router
SERDES
© KnuEdge™ 2016. All Rights Reserved. 21
V8.8.2016
LambdaFabricTM: Scalable Across Resources
Cluster Chip Board Rack Multi-Rack
Scale invariant network architecture
Low latency to everywhere
High bandwidth
Multi-dimensional connectivity
Scalable up to 512,000 chips
6ns latency 62ns 247ns 437ns
Low-Latency, High-Throughput, Low-Power Computing Fabric
tDSP
0
tDSP
1
tDSP
2
tDSP
6
tDSP
5
tDSP
3
tDSP
7
tDSP
4AIP
Memory
(2MB)
© KnuEdge™ 2016. All Rights Reserved. 22
V8.8.2016
“Tiny DSP” Tuned to Stream Operations
Cluster and tDSP Processor
• 256 registers (128 GP and 128 special)
• 1 GHz clock (gated)
• Harvard program/data memory separation
• Fixed 32 bit instruction word.
• Storage merged into shared register file
• Built-in synch w/ event flags
• Scatter/gather engines
• Three states: Sleep, run, or single step
• Single instruction packet launch
• 2K instruction store
• Multiple state machines EPIC-style
• No interrupts
• Everything is addressed including registers Op3Op1 Op2Opcode
Instruction format
Fits in about 4 microns2 – about the size of a neuron.
SRAM
tDSP tDSP tDSP tDSP
Controller
Dispatcher
Packet
Router
Feeder
To other
Routers
DRAM
To other
Routers
AIP
…
Device
Cluster
© KnuEdge™ 2016. All Rights Reserved. 23
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 24
V8.8.2016
“Flit” Packet Communications
• Only a single modality of operation
• Can operate from registers instead of SRAM
• Superior for real-time transport
• “Cut-through” operations prioritize important traffic
• No bus contention
• All comms are bi-directional (core can receive and transmit at the same time)
• Flit headers carry VLIW packet execution codes (POP)
Queue
Normal Packet Queuing
Que
ue
Flit Packet Queuing
Next packet along must wait for long period before
previous packet leaves the queue. Up to 60,000 clocks.
Next packet along must wait for only 4 clocks before
previous packet leaves the queue. Only 4 clocks.
Advantages:
Payload
(0 to 32 QWDS)
O
P
C
ADR
S
I
Z
V
P
O
P
V
Packet Format
© KnuEdge™ 2016. All Rights Reserved. 25
V8.8.2016
Hermosa Control-Plane Organization
• Router Arbitration
• Synchronization
• Flags (Notifications)
• Mutexes (Ownership of a resource)
• Shared Memory Control
• Vector memory
• Matrix memory
• Layer Processing
• Feeders == Programmable cache controllers
• Error handling
• SAM – System Activity Monitoring
• Traffic flows
HRouter1
HRouter2 HRouter2 HRouter2Ctrl2
tDSP
To Next Device
Hermosa Hierarchical Control and Data Distribution Scheme
tDSP
Ctrl2 Ctrl2
Supv
Ctrl2 Ctrl2
Control Path
Data Path
Each control element includes
local memory and control
registers.
Control Plane/Data Plane separation is central to Software Defined Networks
4 External Device Inputs tDSP m Event Register
EVFG0
CCR: Cluster Event Register
EVFG1EVFG2
EVFG0
DCR: Device Event Register
EVFG1EVFG2EVFG3EVFD0EVFD1EVFD2EVFD3
EVFG3EVFD0EVFD1EVFD2EVFD3
EVFMBX
EVFIOR
EVFDMA2
EVFFDR3
EVFG0EVFG1EVFG2EVFG3EVFD0EVFD1EVFD2EVFD3EVFC0EVFC1EVFC2EVFC3
EVFC0EVFC1EVFC2EVFC3
EVFMBXEVFIOR
EVFDMAEVFFDR
CCR: tDSP Latch & Control
EVFMBXEVFIOR
EVFDMA2EVFFDR3
Mask against all event flags via
WFLOR or WFLAND functions. If true
then activate tDSP. Event flag will remain latched until cleared by
tDSP.
H1000 Event Flag MappingtDSP
· AIP Synch Function sets EVFCn event flags.· Supervisor sets EFDn event flags.· Host machine sets EVFGn event flags· EVFCn & EVFUn event flags are settable by
the tDSPs directly.· EVFCn event flags are shared by all tDSPs in
a cluster.EVFFU0
EVFU15
··EVFFU0
EVFU15
··
© KnuEdge™ 2016. All Rights Reserved. 26
V8.8.2016
KNUPATH Hermosa Programming Model
A range of programming interfaces to balance performance and productivity
Hermosa Assembly
Language
KNUPATH
Performance Interface
(KPI)
KNUPATH Network
Interface
(KNI)
Pe
rfo
rma
nce
A
bstr
actio
n
• Implicit message passing model
• Dataflow programming model based on Kahn Process Networks (KPN)
• C++ libraries and compiler extensions are used to support user-defined kernels and
networks
• Host target allows development of KNI networks and kernels without Hermosa
hardware
• Explicit message passing model
• Hermosa kernel functions library for MPI-like message passing and
synchronization
• C/C++ kernel development using the clang/LLVM compiler toolchain with a
Hermosa target
• Host/device accelerator model, similar to CUDA/OpenCL workflows
© KnuEdge™ 2016. All Rights Reserved. 27
V8.8.2016
Contents
• Neurobiological inspiration for machine learning algorithms
• Problems with existing machine learning tools and hardware
• Design principles
• Hermosa processor design
• Developer tools and Programming
• Events
© KnuEdge™ 2016. All Rights Reserved. 28
V8.8.2016
KnuEdge Web Service
• Hermosa developer boards available for development of parallel computation on a queued server
• Available to anyone, free of charge
• Training materials available
• User agreement required
• Contact jotchis@knupath.com for details
© KnuEdge™ 2016. All Rights Reserved. 29
V8.8.2016
Upcoming EventsDate Event Information
Thursday, Feb. 16, 2017 Machine Learning Society Meetup:
Processing Hardware for Deep Learning
Panelist: D. Palmer (KnuEdge CTO)
At ScaleMatrix, 5775 Kearny Villa
Road, San Diego
https://www.meetup.com/machine-
learning-society/events/237055385/
www.mlsociety.com
April 26-28, 2017 Workshop on Sparse and Heterogeneous
Neural Networks
Hosted by California Institute for
Telecommunications and Information
Technology (Calit2)
Sponsored by KnuEdge
Participants Wanted – Submit Your
Abstract!
At Calit2, Atkinson Hall, UCSD
campus
https://www.knuedge.com/about-
us/events/hnnworkshop/
TBA KnuEdge Developer Users’ Group Stay tuned!
© KnuEdge™ 2016. All Rights Reserved. 30
V8.8.2016
Questions?
31© KnuEdge™ 2016. All Rights Reserved.
V8.8.2016
™