Bergman Enabling Computation for neuro ML external

transcript

V8.8.2016

Enabling Computation for Machine Learning Algorithms Inspired by Neurobiology

Taking Lessons From Nature

Presentation to CSRC Colloquium, SDSU

January 27, 2017

Doug Bergman, PhD, Staff Scientist / Mathemaperson

KnuEdge

V8.8.2016

Contents

• Neurobiological inspiration for machine learning algorithms

• Problems with existing machine learning tools and hardware

• Design principles

• Hermosa processor design

• Developer tools and Programming

• Events

V8.8.2016

The Neurobiological ComputerNeurobiological Systems: Flexible and Scalable

Animal # Neurons (10x scale)

Roundworm 302 (2)

Medicinal leech 10,000 (4)

Pond snail 11,000 (4)

Sea slug 18,000 (4)

Lobster 100,000 (5)

Fruit Fly 250,000 (5)

Ant 250,000 (5)

Honey bee 960,000 (6)

Cockroach 1,000,000 (6)

Frog 16,000,000 (7)

Mouse 71,000,000 (8)

Finch 131,000,000 (8)

Octopus 500,000,000 (9)

Human 100,000,000,000 (14)

Elephant 200,000,000,000 (14)

Current generation

machine learning

capabilities

Where KnuEdge

wants to be in 2021:

MindScale.

Source: Wikipedia

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

V8.8.2016

BatVision

• SONAR or optical “blind” vision

• 25,000 bats in one cave can avoid collisions

• Can tell a moth from a fly at 20 m

• Outgoing pulses– Frequency: Up to120 KHz, f ~ 60% (decay)

– Pulse width: 2 msec

– Pulse rep rate of ~130 Hz – increases with proximity

– Sound amplitude of 130 decibels (jet plane loud)

• Incoming reception– Time resolution of 2 μsecs

• Signal Processing – ~ 10 million neurons

V8.8.2016

Structure of the Mammalian Nervous System

• Extremely complex and diverse

• Thousands or millions of neuron types

• Complex topology of connections like internet

Take away: Heterogeneous and sparsely connected

V8.8.2016

Natural Pattern Recognition

• Recognition is immediate and automatic

• Only a few unlabeled samples needed

• Inspiring development of new machine learning models

• Perfected through a billion years of natural selection

• Heterogeneity of neuron types

• Sparse,“plastic” evolving connections

V8.8.2016

Contents

• Events

V8.8.2016

Artificial Pattern Recognition: Neural Networks

• Lots of costly computation required

• Hundreds or thousands of labeled samples to train

• Largely variations on homogeneous Multilayered Perceptron models

• Relatively easy to develop• Import library into script, train with data, run

• Build models without sophisticated theory; choose parameters arbitrarily

“Throw mud on wall, see what sticks”

This has been the thrust of development in part because these single-instruction, multiple-data (SIMD) models map well to existing hardware e.g. GPUs

V8.8.2016

Nature-Inspired Learning Models

• Heterogeneity of neuron types

• Hebbian Learning: neurons interconnect sparsely after repeated sympathetic firing

• Plasticity and pruning: connections may change over time

• Recurrent models

• Sequential “Deep Learning” models allow for training on unlabeled or partially-labeled data

Field has been slower to develop due in part to unavailability of Multiple-Instruction, Multiple Data (MIMD) processing

Raudies, Zilli, Hasselmo

http://journals.plos.org/plosone/article?id

=10.1371/journal.pone.0093250

Piekniewski, Laurent, Petre, Richert,

Fisher, Hylton

https://arxiv.org/pdf/1607.06854v3.pdf

V8.8.2016

Problem: Software

• Brittleness/Lack of Security

• Multiple modalities of communications

• Poor scatter/gather communications

• Power consumption

• Ease of use

• Lack of skilled manpower

24 million1.8 million90 million 2.0 million

Increasing Complexity of Applications (Lines of Code)

Answer: Learning, based on neurobiological principles, could be preferable to programming.

V8.8.2016

Current Generation Chips are a Problem

• Memory gets larger but not faster

• Logic gets faster but spends more time waiting for memory

• Logic gets more energy efficient but memory transport does not

Today’s processors: Time & Energy Dominated by Fetch

V8.8.2016

Contents

• Events

V8.8.2016

Hardware Solution: Wetware to Silicon

• Neuron body contains “context”

• Communication via synapses

• Axonal connections define geometry

Intellisis Design Goals

• Scalable heterogeneous parallelism

• Proximity of memory and processing

• Support for complex connections

• Communications driven architecture

• All information is “pushed”

• Must operate with noise and failures

Nature’s Design Principles

W2,3W0,3

Directed Graph

3,21,2

3,02,0

Neurobiology Random Access OperationsASIC

Take away: Be flexible.

Nodal Processing

• Data flow from inputs (raw data) to outputs (calculated results)

• Finite-state processing: “Lambda” push model

• Parallel processing kernels occur independent of other activities and flows

• Nodes can be subnetworks themselves

• New nodes can be inserted at any time

Fig 5. Heterogeneous network with heterogeneous nodes and subnets.

Net CNet D

Processing Pipeline

Classifier

V8.8.2016

Hermosa Processing

i = n-1

Hermosa Processing Unit Model

Inputs(connections) Outputs

(Activations)

Processing(neuron)

Hermosa Connection Processing

Router

Each node accepts n inputs.

When all n inputs are processed a

single output is generated

Specially designed for coupled

differential equation solving

Node Processing

There are many more software

connections than hardware

connections

A router performs the task of

connecting neurons.

Edge Processing

Directed Graph Representation of Processing

Each line is called an “edge” and each circle is called a “vertex”

V8.8.2016

Lambda Architecture Model• The fundamental architecture block is a cluster, a physical analog of a network node or subgraph

• Arbitrarily scalable and nestable

• Each cluster contains data storage and specialized processors

Co-location minimizes latency and energy consumption

• Processes are pipelined by message passing among clusters

• Clusters retain finite state information; are free to perform other tasks while waiting for responses

• Internal handling of processing within clusters minimizes traffic congestion

Cluster 0

Cluster 1

Cluster 2

Cluster 3

V8.8.2016

Another application: Graph Analytics

In addition to Machine Learning, the Hermosa processor and Lambda fabric architecture will handle graph processing elegantly, using data in compressed format

Memory Cluster 0 Memory Cluster 1 Memory Cluster N-1

Worker 4 Worker 3

Manager Worker 0

Vertex Pointers, Working Data

in Local Memory 0

Edge Successor Data in Auxiliary

Memory 0

Router

Worker 4 Worker 3

Manager Worker 0

Worker 4 Worker 3

Manager Worker 0

in Local Memory 1

in Local Memory N-1

Memory 1

Memory N-1

V8.8.2016

Graph Finite-State Operations• Graph Partitions = Data Objects = Memory Clusters

• Partitions have local states that track states of primitive functions

0,2 2,3 ,0,3 min

0,1 1,3

min 0,1D

min 0,2D

Example: Find Minimum Distance from Vertex 0 to Vertex 3

V8.8.2016

Contents

• Events

V8.8.2016

Asynchronous Cloud Processor Data Plane

L1 Router

L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2

> 8 Gbytes Highly Segmented Memory

MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC MAC

Supervisor Root of Trust

Multiple Memory Blocks

tDSP Processor core:

• 256 registers

• 4 Packet I/O engines

• Single cycle sleep state

• HW Event synchronization

• Independent clock domains

High-speed Serial:

• 28.5 Gbps

• FEC

• 802.3 Phy

• 10GBaseKR compatible

Layer 1 Router

• Ports: 49

• Rate: 8 GB/s/port

• Size: 2 mm2

• Latency: 7 ns

All communications via flit packets:

• Physical & virtual addressing

• Operation codes

Flit links:

• 64b/256b parallel

• Clock domain crossing

• Interposer or die xing

V8.8.2016

“Hermosa” ASIC

24.6 mm

Clusters

Central Router

SERDES

V8.8.2016

LambdaFabricTM: Scalable Across Resources

Cluster Chip Board Rack Multi-Rack

Scale invariant network architecture

Low latency to everywhere

High bandwidth

Multi-dimensional connectivity

Scalable up to 512,000 chips

6ns latency 62ns 247ns 437ns

Low-Latency, High-Throughput, Low-Power Computing Fabric

Memory

V8.8.2016

“Tiny DSP” Tuned to Stream Operations

Cluster and tDSP Processor

• 256 registers (128 GP and 128 special)

• 1 GHz clock (gated)

• Harvard program/data memory separation

• Fixed 32 bit instruction word.

• Storage merged into shared register file

• Built-in synch w/ event flags

• Scatter/gather engines

• Three states: Sleep, run, or single step

• Single instruction packet launch

• 2K instruction store

• Multiple state machines EPIC-style

• No interrupts

• Everything is addressed including registers Op3Op1 Op2Opcode

Instruction format

Fits in about 4 microns2 – about the size of a neuron.

tDSP tDSP tDSP tDSP

Controller

Dispatcher

Packet

Router

Feeder

To other

Routers

To other

Routers

Device

Cluster

V8.8.2016

Contents

• Events

V8.8.2016

“Flit” Packet Communications

• Only a single modality of operation

• Can operate from registers instead of SRAM

• Superior for real-time transport

• “Cut-through” operations prioritize important traffic

• No bus contention

• All comms are bi-directional (core can receive and transmit at the same time)

• Flit headers carry VLIW packet execution codes (POP)

Normal Packet Queuing

Flit Packet Queuing

Next packet along must wait for long period before

previous packet leaves the queue. Up to 60,000 clocks.

Next packet along must wait for only 4 clocks before

previous packet leaves the queue. Only 4 clocks.

Advantages:

Payload

(0 to 32 QWDS)

Packet Format

V8.8.2016

Hermosa Control-Plane Organization

• Router Arbitration

• Synchronization

• Flags (Notifications)

• Mutexes (Ownership of a resource)

• Shared Memory Control

• Vector memory

• Matrix memory

• Layer Processing

• Feeders == Programmable cache controllers

• Error handling

• SAM – System Activity Monitoring

• Traffic flows

HRouter1

HRouter2 HRouter2 HRouter2Ctrl2

To Next Device

Hermosa Hierarchical Control and Data Distribution Scheme

Ctrl2 Ctrl2

Control Path

Data Path

Each control element includes

local memory and control

registers.

Control Plane/Data Plane separation is central to Software Defined Networks

4 External Device Inputs tDSP m Event Register

CCR: Cluster Event Register

EVFG1EVFG2

DCR: Device Event Register

EVFG1EVFG2EVFG3EVFD0EVFD1EVFD2EVFD3

EVFG3EVFD0EVFD1EVFD2EVFD3

EVFMBX

EVFIOR

EVFDMA2

EVFFDR3

EVFG0EVFG1EVFG2EVFG3EVFD0EVFD1EVFD2EVFD3EVFC0EVFC1EVFC2EVFC3

EVFC0EVFC1EVFC2EVFC3

EVFMBXEVFIOR

EVFDMAEVFFDR

CCR: tDSP Latch & Control

EVFMBXEVFIOR

EVFDMA2EVFFDR3

Mask against all event flags via

WFLOR or WFLAND functions. If true

then activate tDSP. Event flag will remain latched until cleared by

H1000 Event Flag MappingtDSP

· AIP Synch Function sets EVFCn event flags.· Supervisor sets EFDn event flags.· Host machine sets EVFGn event flags· EVFCn & EVFUn event flags are settable by

the tDSPs directly.· EVFCn event flags are shared by all tDSPs in

a cluster.EVFFU0

EVFU15

··EVFFU0

EVFU15

V8.8.2016

KNUPATH Hermosa Programming Model

A range of programming interfaces to balance performance and productivity

Hermosa Assembly

Language

KNUPATH

Performance Interface

KNUPATH Network

Interface

• Implicit message passing model

• Dataflow programming model based on Kahn Process Networks (KPN)

• C++ libraries and compiler extensions are used to support user-defined kernels and

networks

• Host target allows development of KNI networks and kernels without Hermosa

hardware

• Explicit message passing model

• Hermosa kernel functions library for MPI-like message passing and

synchronization

• C/C++ kernel development using the clang/LLVM compiler toolchain with a

Hermosa target

• Host/device accelerator model, similar to CUDA/OpenCL workflows

V8.8.2016

Contents

• Events

V8.8.2016

KnuEdge Web Service

• Hermosa developer boards available for development of parallel computation on a queued server

• Available to anyone, free of charge

• Training materials available

• User agreement required

• Contact jotchis@knupath.com for details

V8.8.2016

Upcoming EventsDate Event Information

Thursday, Feb. 16, 2017 Machine Learning Society Meetup:

Processing Hardware for Deep Learning

Panelist: D. Palmer (KnuEdge CTO)

At ScaleMatrix, 5775 Kearny Villa

Road, San Diego

https://www.meetup.com/machine-

learning-society/events/237055385/

www.mlsociety.com

April 26-28, 2017 Workshop on Sparse and Heterogeneous

Neural Networks

Hosted by California Institute for

Telecommunications and Information

Technology (Calit2)

Bergman Enabling Computation for neuro ML external

Documents