+ All Categories
Home > Documents > Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus...

Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus...

Date post: 14-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
38
Heterogeneous Embedded Computer Architectures and Programming Paradigms for Enabling Internet of Things (IoT) Charles Liu, Ph.D., Professor, Department of Electrical Engineering and Computer Engineering California State University, Los Angeles [email protected]
Transcript
Page 1: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Heterogeneous Embedded Computer

Architectures and Programming Paradigms

for Enabling Internet of Things (IoT)

Charles Liu, Ph.D.,Professor,Department of Electrical Engineering and Computer EngineeringCalifornia State University, Los Angeles

[email protected]

Page 2: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

From <https://mobilunity.com/blog/iot-developer-salary-rates/>

Page 3: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

From <https://blogs-images.forbes.com/louiscolumbus/files/2016/11/IHS.jpg>

Page 4: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

From <https://blogs-images.forbes.com/louiscolumbus/files/2016/07/Internet-of-Things-Forecast.jpg>

Page 5: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

From < https://www.qualcomm.com/news/onq/2017/08/16/we-are-making-device-ai-ubiquitous >

Page 6: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Edge Computing

The ability to do advanced on-device processing and analytics is referred to as “edge computing.”Edge computing is a counterpart to the cloud computing

Edge computing provides new possibilities in IoT applications,machine learning for tasks such as object detection, face recognition, language processing, and obstacleavoidance

Page 7: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Instead of sending streams of images/videos to the cloudfor processing, in-situ pre-processing is performed

Advantages: Saving in network and computing resources,reducing latency, improving security and privacy (personallyidentifiable information vs. demographic information)

[E.g.] Proactive in-car service - natural language interfaceusing Edge computing allows smart speakers to react morequickly by interpreting voice instructions locally.

Page 8: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Heterogeneous computer architectures are adopted for edgecomputing - integrating diverse engines such as CPUs, GPUsand DSPs — in IoT devices so that different workloads areassigned to the most efficient compute engine, thusimproving performance and power efficiency.

[E.g.] The Hexagon DSP with Qualcomm Hexagon Vector eXtensions onSnapdragon 835 has been shown to offer a 25X improvement in energyefficiency and an 8X improvement in performance when comparedagainst running the same workloads (GoogleNet Inception Network)on the Qualcomm Kryo CPU.

From <https://www.qualcomm.com/news/onq/2017/08/16/we-are-making-device-ai-ubiquitous>

Page 9: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Convolutional Neural Network CNN Implementation on Altera FPGA using OpenCL

https://www.youtube.com/watch?v=78Qd5t-Mn0s

Page 10: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Heterogeneous Computer Architectures• CPUs• GPUs• Vector Processors• Image/Signal Processors• FPGAs

Page 11: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Suppose you want to add two vectors of numbers.There are many ways to spell this – programming paradigms.

C uses a loop spelling

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling

a=b+c;

Page 12: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

SIMD uses a "short vector" spelling

void add(uint32_t *a, uint32_t *b, uint32_t *c,int n) {

for(int i=0; i<n; i+=4) {//compute c[i], c[i+1], c[i+2], c[i+3]

uint32x4_t b4 = vld1q_u32(b+i);uint32x4_t c4 = vld1q_u32(c+i);uint32x4_t a4 = vaddq_u32(b4,c4);vst1q_u32(a+i,a4);

}}

SIMT uses a "scalar" spelling

__global__ voidadd(float *a, float *b, float *c) {

int i = blockIdx.x * blockDim.x +threadIdx.x;

a[i]=b[i]+c[i]; //no loop!}

Page 13: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

NXP S32V

Page 14: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

APEX2 GPU

A53 corecomplex

M4

ISPENC-DEC

InternalSRAM

External DDR/QSPI

NXP S32V

Page 15: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 16: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 17: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Programming Paradigms

Page 18: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

https://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_480_Fermi/

NVIDIA GUP: GeForce_GTX_480_Fermi - Single Instruction Multiple Thread (SIMT) architecture

Page 19: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

CUDA and OpenCL terminology correspondence.

Page 20: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

The SMs schedule and execute threads in lockstepgroups of 32 threads called warps.

Page 21: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

https://www.bdti.com/InsideDSP/2015/06/30/CogniVue

CogniVue's APEX vision processor core architectures

Page 22: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 23: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 24: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 25: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Field programmable Gate Array (FPGA) is another Choice!

Page 26: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 27: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Heterogeneous MPSoC Xilinx's 16nm UltraScale+

Page 28: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

SoC Design Challenges

Page 29: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Data streaming

General algorithm execution model

Page 30: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Pipelining

Page 31: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Pipelining

Page 32: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Data Mapping and Remapping

• Coalescing global memory accesses to minimize number of memorytransactions

• Improving memory locality of next-stage thread access• Improving memory locality of inter-thread accesses

Page 33: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Memory Hierarchy

Page 34: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

https://www.mathworks.com/solutions/deep-learning/convolutional-neural-network.html

Convolutional Neural Network

Page 35: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm
Page 36: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

https://www.embedded.com/print/4017551

Page 37: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

OpenCL Memory model

Page 38: Heterogeneous Embedded Computer Architecture...assigned to the most efficient compute engine, thus improving performance and power efficiency. [E.g.] The Hexagon DSP with Qualcomm

Recommended