+ All Categories
Home > Documents > Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic...

Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic...

Date post: 14-Sep-2019
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
81
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming Similarity Search on FPGA based on Dynamic Time Warping Yu WANG Associate Prof., Head, Research Institution of Circuits and Systems, E.E. Dept, Tsinghua University, Beijing, China http://nics.ee.tsinghua.edu.cn/people/wangyu/ Joint work by Tsinghua Univ. and IBM China Research Lab Based on a submitted paper to FPGA 2013 1
Transcript
Page 1: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Department of Electronic Engineering, Tsinghua University

Nano-scale Integrated Circuit and System Lab.

Streaming Similarity Search on FPGA

based on Dynamic Time Warping

Yu WANG

Associate Prof.,

Head, Research Institution of Circuits and Systems,

E.E. Dept, Tsinghua University, Beijing, China

http://nics.ee.tsinghua.edu.cn/people/wangyu/ Joint work by Tsinghua Univ. and IBM China Research Lab

Based on a submitted paper to FPGA 2013

1

Page 2: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Outline

Background and Motivation

Why we need streaming similarity search

Recent achievements and problems to solve

Subsequence Similarity Search on FPGA

Algorithms

Hardware Architectures

Results

Conclusion and future work

2

Page 3: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Alberto Sangiovanni-Vincentelli (Tuesday noon @ ICCAD 2012)

ICCAD at 30 years Where We have been, where we are going

3

Page 4: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

4

Internet of Things

Nowadays

Independent Applications

Traditional Database Techniques

Small Scale “small IOTs”

Monitoring only

Future

Fully connected, and correlated Applications

Advanced IT techniques

Large Scale, and large Volume Data (“big IOTs”)

Different realtime or non-realtime applications BIG DATA (Time and Spatial

Correlated Streaming DATA) Volume, Variety, Velocity

IoT DATA Manage System (IBM RODB©) Collection, Publish, Processing,

Storage, and Query for BIG DATA

Page 5: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

5

RODB

Different Applications

Realtime Oriented DataBase

Application Specific Data Management Middleware (Collection, Publish, Processing, Storage, and Query )

Page 6: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Data format from IOT (CPS, SoS, ect.)

6

Format of Data

Numerical data streams from various sensors (Timing Series)

Multi-media data and sensor data

Industries in Smarter Planet

Petro E&U

Mineral

Chemistry

Steel Manufactory

Smart building Smart City Environment monitoring

Retail Logistic

Healthcare

RFID

Transportation

Page 7: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Mining Task Dependency (Not Complete)

Similarity Search

Correlation Discovery

Classification Clustering

Motif Discovery

Novelty/Anomaly detection

Rule Discovery

Segmentation

Visualization

Data Privacy

Prediction

Burst Detection

No history data involved

May have real time req History data analyses

Page 8: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Finite filed subsequence exact search

Object: string

e.g. find “pattern” in “we have a pattern here” with K.M.P

Finite filed subsequence similarity search

Object: DNA chain, Protein sequence

e.g. find similar subsequence as “ATGAG” in a DNA

chain “ATGACTGAG…” with Smith-Waterman.

Infinite filed subsequence similarity search

Object: time series data

e.g. next slide

“Similarity” Search

Page 9: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Streaming Subsequence Similarity Search

Time series (electrocardiogram) & pattern (query)

Pick out subsequences with sliding window (totally

N subsequences)

Compare the subsequences with the pattern, under

a certain distance measure, to judge if they are

similar

0 100 200 300 400-3

-2

-1

0

1

2

d

Simple DATA representation Tuple [Sensor, Time, Value]

Time Complexity O(N*O(distance))

Page 10: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Distance Measure

Dynamic Time Warping P= p1, p2, p3…pM; S= s1, s2, s3…sM

DTW(S, P) = D(M, M);

D(i-1, j)

D(i, j) = dist(si, pj) + min D(i, j-1)

D(i-1, j-1)

D(0,0) = 0;

D(i, 0) = D(0, j) = infinite,

1<i<M,1<j<M;

DTW is the best distance measure in most domains. It

allows shrinking, sketching, warping, even different

lengths. Distance Complexity Analysis (O(M*M))

Step1: Calculate the distance of each two points

Step2: Find the shortest accumulated path

Page 11: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Challenges for Streaming Similarity Computing

Challenges (Velocity, Volume, Variety)

Real time Analysis

• Both on the sensor part or cloud

Large Volume Streaming DATA to be compared

• Can not afford to storage on Sensors

• Millions of Sensors may be on the Edges

Various Patterns

• People may want to search for different patterns on

different/same dataset

Previous Work

Software preprocessing to reduce the real DTW

Parallel Hardware: More on task level parallelism,

little was performed for fine-grained parallelism

13

Page 12: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Related Work -- Software

1000+ papers on software speedup techniques:

1. Y. Sakurai et al. proposed a computation-reuse

algorithm called SPRING [3]

Only one tuple is different between the two neighboring

subsequences

Merge N M-by-M matrixes into single N-by-M matrix. N

paths grow at the same time

N* =>

It reduces the time complexity from O(N*M*M) to O(N*M)

The whole sequence can’t be normalized in streaming:

26 19 23 16 12

18 19 19 7 5

15 22 18 3 5

14 21 13 3 5

12 13 7 2 4

26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8)

18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8)

15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8)

14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8)

12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8)

Page 13: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

2. Lower bound: A. Fu, E. Keogh et al. tried to

estimate the lower bound of DTW distance in a cheap

way, called LB_Keogh [1].

It constrains the warping path will not deviate more than R*M

cells from the diagonal. Generate an upper envelope and a lower

envelope, and the sum of the subsequence not falling within the

bounding envelope is defined as the LB_Keogh

If the lower bound distance exceeds the threshold, the DTW

distance will also exceed the threshold, and then the

subsequence can be pruned off.

Related Work -- Software

Page 14: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

3. S. H. Lim et al. used indexing techniques to speed

up the search [11]

Build a look up table for different patterns;subsequences

searching speed equals to the look up table searching speed,

which is very fast

Look up table construction cost is even larger than DTW, only

suitable for frequent querying in the same sequence.

No one can index on a streaming sequence which may be

infinitely long.

4. There are also some other techniques, like early

abandoning.

All the former software techniques can be seen as pre-

processing techniques, aiming at reducing the calling times

of DTW calculation, instead of accelerating DTW itself

Related Work -- Software

Page 15: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Several works try to exploit parallel hardware, such

as multi-cores[8], computer cluster[6], GPU[4] to

speedup the search.

All these works try to allocate subsequences starting from

different position of the whole sequence to different processing

units, which can be seen as coarse-grained parallelism.

[4] also uses threads to parallel generate the warping matrixes,

but serially does the path searching, which can be seen as

partial fine-grained parallelism.

Lead to a heavy data-transfer burden, as one subsequence may

consist of too many tuples. The [4]’s partial fine-grained parallel

work even needs to transfer a whole matrix between thread.

Related Work -- Parallel Hardware

Page 16: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

The first and only work[2] using FPGA is generated

by a C-to-VHDL tool called ROCCC

From the reported performance, we think the tool exploits

the fine-grained parallelism inside DTW.

It does not exploit the coarse-grained parallelism.

The lack of insight into FPGA limits the scalability and

flexibility:

• It can not support patterns of length larger than 128.

• It can not support on-line updating patterns of different lengths. For

example, if a new pattern of length 127 is wanted, it must re-

compiled the system and re-download the FPGA, which may cost

several hours.

Related Work -- Parallel Hardware

Page 17: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Problems we try to solve

Problems

Software can’t accelerate DTW itself

Coarse-grained parallelism may leads to heavy burden on

bandwidth

Fine-grained parallelism requires hard-wired

synchronization

FPGA lacks flexibility as software

Solutions

Turn to parallel hardware to accelerate DTW

Choose and modify streaming parallel algorithms (SPRING)

to reduce bandwidth

Use FPGA with flexible structure for fine grained parallelism

Page 18: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Outline

Background and Motivation

Why we need streaming similarity search

Recent achievements and problems to solve

Subsequence Similarity Search on FPGA

Algorithms

Hardware Architectures

Results

Conclusion and future work

20

Page 19: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Algorithms

Normalization

Enable multiple DTW

Hybrid lower bound

Good Preprocessing to leave very few real DTW

Multiple DTW

Coarse-Grain and Fine-Grain Parallelism

Algorithm Framework

NormalizerHybrid Lower

BoundMultiple

DTW

Page 20: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Normalization

Assumption: the offset or the amplitude can be

approximately seen as time-invariant in a little longer

length of M+C, where M is the length of pattern, and

C is a constant.

500 1000 1500 2000 2500

-0.2

0

0.2

a

500 1000 1500 2000 250090

100

110

120

130

b

500 1000 1500 2000 2500

200

400

600

800

c

0 100 200 300 400-3

-2

-1

0

1

2

d

Page 21: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Hybrid Lower Bound

LB_partial DTW

Stable but time-consuming

LB_Keogh/ reversed LB_Keogh

Efficient but in-stable

Significantly degrade when

R increase Ui = max {Pi-R, Pi-R+1…Pi+R-1, Pi+R};

Li = min { Pi-R, Pi-R+1…Pi+R-1, Pi+R };

Di = Si-Ui if Si>Ui

Li-Si if Li>Si

0 else

LB(P1,Y, S1,Y) = sum{D1, D2…, DY}

This can been seen as a combination of early abandoning technique and lower bounding technique

Page 22: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming
Page 23: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Multiple DTW – Modified SPRING

SPRING:

C* =>

DTW(Ss,e, P) = D(e,M)

i-R< sp(i-1,j)+j<i+R ? D(i-1, j) : INF

D(i, j) = dist(si, pj) + min i-R< sp(i-1,j-1)+j<i+R? D(i-1, j-1): INF

i-R< sp(i,j)+j<i+R ? D(i, j-1): INF

D(i, 0) = 0, if valid(i)==1;

INF, if valid(i)==0;

D(0, j) = infinite; where 1<i<N, 1<j<M.

Sp(i-1, j) if D(i-1, j) is the minimum

Sp(i, j)= Sp(i-1, j-1) if D(i-1, j-1) is the minimum

Sp(i, j-1) if D(i, j-1) is the minimum

Sp(i, 0)=i; Sp(0, j)=0; where 1<i<N, 1<j<M.

26 19 23 16 12

18 19 19 7 5

15 22 18 3 5

14 21 13 3 5

12 13 7 2 4

26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8)

18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8)

15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8)

14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8)

12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8)

Page 24: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Hardware Framework

NormNorm

FIFO

PCIE

Norm

LowerBound

Join

value(32 bit) time(32 bit)

value(16 bit) time(16 bit)

LB(16 bit) time(16 bit)

HighPrecisiondomain

LowPrecisiondomain

value(32 bit) time(32 bit)valid(1 bit)

FIFO

Norm

LowerBound

Join

value(32 bit) time(32 bit)valid(1 bit)

FIFO

DTW

Join

DTW

Join

value(16 bit) time(16 bit)

DTW distance(16 bit)

time(16 bit)

valid(1 bit)

value(32 bit) time(32 bit)valid(1 bit)flag(1 bit)

value(32 bit) time(32 bit)

flag(1 bit)

Buffer FIFO

Implementation on FPGA

Four Loops to Guarantee Streaming

Two-Phase Precision Reduction

Support for Multi FPGAs

Page 25: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Normalizer

Hybrid lower bound

Normalization

shifter

Updating mean& std Pipeline latency: K cycle

Tuple2*M+1

Tuple1

Tuple = (Tuple-mean)/std

Tuple2*M+K+1

shifter

MeanStd

Tuple in

Tupleout

Hybrid Lower Bound

LB_pDTW

Tuple in

LB_Keogh

ReversedLB_Keogh

Max

envelope

tuple

distance

distance

+

lower bound

distance

shifterdistance

Implementation on FPGA

Page 26: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

DTW

Single PE

PE...

PE...

PE...

PEW

Min

singledistance

CurrAcc D

/start time

PrevAcc D

/start time

Min

+

|-|

D out

D in

pattern

valid

P outP in

tuple

tupleenable

result

busy

PE1

FIFO

PatternRAM

INF

PEW-1

PE2

SubsequenceFIFO

Tuplerouter

Resultrouter

P valid P valid

INF

valid

DTW distance

1

1 0

0

1

0

Starting

P7=0 INF 26(1) 19(1) 23(2) 16(2) 12(2) 14(2) 12(2) 6(2) 14(2) 17(8) 11(8) 12(8) 14(8) 12(8)

P6=5 INF 18(1) 19(1) 19(2) 7(2) 5(2) 9(2) 6(2) 11(2) 10(8) 8(8) 5(8) 7(8) 9(8) 11(8)

P5=9 INF 15(1) 22(1) 18(2) 3(2) 5(2) 5(2) 8(2) 17(2) 7(8) 4(8) 7(8) 9(8) 11(8) 17(8)

P4=10 INF 14(1) 21(1) 13(2) 3(2) 5(2) 5(2) 8(2) 17(2) 6(8) 4(8) 7(8) 9(8) 11(8) 17(11)

P3=9 INF 12(1) 13(2) 7(2) 2(2) 4(2) 4(2) 7(2) 14(8) 4(8) 3(8) 6(8) 8(8) 10(11)11(14)

P2=5 INF 11(1) 5(2) 2(2) 6(2) 8(2) 11(5) 7(7) 5(8) 3(8) 7(8) 7(8) 8(11) 9(13) 5(14)

P1=0 INF 8(1) 1(2) 4(3) 9(4) 7(5) 9(6) 6(7) 0(8) 8(9) 9(10) 6(11) 7(12) 7(13) 3(14)

value 8 1 4 9 7 9 6 0 8 9 6 7 7 3

time 1 2 3 4 5 6 7 8 9 10 11 12 13 14

PE PE1 PE2 PE3 PE4 PE5 PE6 PE7 PE1 PE2 PE3 PE4 PE5 PE6 PE7

Implementation on FPGA

D(i-1, j)

D(i, j) = dist(si, pj) + min D(i, j-1)

D(i-1, j-1)

Page 27: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Experimental Setup

CPU: intel i7-930+ 16G RAM + window 7

FPGA:Altera Stratix4s530 Combinational ALUTs 362,568/424,960 (85%)

Dedicated logic registers 230,160/424,960 (54%)

Memory bits 1,902,512/21,233,664 (9%)

Fmax: 167.8MHz

X=10,Y=502,PE number W=512

30

Page 28: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Dataset1: medical Data

This dataset has about 8G points, and we need to

find a pattern of length 421 with R = 5%

Experimental Results

Page 29: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Dataset2: speech recognition

We download the CMU_ARCTIC speech synthesis

databases, and construct a speech of 1 minute(1 million

points) by splicing together the first 21 utterances of all the

1132 utterances

128 256 512 1024 2048 4096 8192 1638410

-3

10-2

10-1

100

101

102

103

104

105

pattern length

Tim

e/se

cond

Time taken to search a speech dataset

R = 0.05

R = 0.1

R = 0.2

R = 0.3

R = 0.4

R = 0.5

our work: R =0.05

our work: R =0.5

Experimental Results

Page 30: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

FPGA and GPU:

To GPU & software: we use computation-reuse

technique to exploit the coarse-grained parallelism,

and the fine-grained parallelism can be only exploited

by FPGA.

To FPGA: we use both lower bound technique and

computation-reuse technique.

Experimental Results

Page 31: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Conclusions and Future Work

Conclusions

IOT systems propose a lot of time series data

Sensors and computing clusters (the cloud) have different

requirements on tasks, so the problem is how to design a proper

data manage system in order to help people to use these data

For similarity search, which is a basic task for understanding and

analyzing the streaming time series data, we proposed an FPGA

acceleration architecture.

Future Work

Explore the System Architecture for timing series data

analysis to support the IOT data management system

• Find the system arch patterns, and design the AS-system.

Page 32: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Similarity Search

Correlation Discovery

Classification Clustering

Motif Discovery

Novelty/Anomaly detection

Rule Discovery

Segmentation

Visualization

Data Privacy

Prediction

Burst Detection

No history data involved

May have real time req History data analyses

Future Work

Page 33: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Reference

1. T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria,

and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time

warping. SIGKDD, 2012.

2. D. Sart, A. Mueen, W. Najjar, V. Niennattrakul, and E. Keogh. Accelerating Dynamic Time

Warping Subsequence Search with GPUs and FPGAs. ICDM 2010.

3. Y. Sakurai, C. Faloutsos and M. Yamamuro, Stream Monitoring under the Time Warping

Distance. ICDE 2007.

4. Y. Zhang, K. Adl, and J. Glass. Fast spoken query detection using lower-bound Dynamic

Time Warping on Graphical Processing Units. ICASSP 2012, 5173 – 5176.

5. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. 2008. Querying and

mining of time series data: experimental comparison of representations and distance

measures. PVLDB 1, 2, 1542-52.

6. S. Srikanthan, A.Kumar, and R. Gupta. 2011. Implementing the dynamic time warping

algorithm in multithreaded environments for real time and unsupervised pattern discovery.

IEEE ICCCT, 394-398.

7. M. Grimaldi, D. Albanese, G. Jurman and C. Furlanello. Mining Very Large Databases of

Time-Series: Speeding up Dynamic Time Warping using GPGPU. NIPS 2009 Workshop

8. N. Takhashi, T. Yoshihisa, Y. Sakurai and M. Kanazawa. A Parallelized Data Stream

Processing System using Dynamic Time Warping Distance. CISIS 2009

9. A. Fu, E. Keogh, L. Lau, C. Ratanamahatana, and R. Wong. 2008. Scaling and time

warping in time series querying. VLDB J. 17, 4, 899-921

Page 34: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

Department of Electronic Engineering, Tsinghua University

Nano-scale Integrated Circuit and System Lab.

Thank you !

37

For other domain specific accelerations, such as graph theoretic algorithms, sparse matrix

decomposition, search apps, video apps. Please refer to my webpage:

http://nics.ee.tsinghua.edu.cn/people/wangyu/

Page 35: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

John D. Davis Researcher Microsoft Research Silicon Valley In Collaboration with Chuck Thacker, Eric Chung, Srinidhi Kestur, Lintao Zhang, Fang Yu, Zhangxi Tan, & Ollie Williams

Page 36: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

2

Doubling of transistors

every 18-24 months

2X Compute Capability &

Efficiency

Innovative Applications

Miniaturization Lowered Costs

etc…

Page 37: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

3

10

100

1000

10000

1982 1985 1988 1991 1995 1998 2001 2005 2008 2011 2014

Clo

ck F

requency (

MH

z)

15 GHz Processor (100 Watts)

“The Multicore Revolution”

Microprocessor Trends

Source: http://cpudb.stanford.edu

Page 38: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

4

1

10

100

2004 2005 2006 2008 2009 2010 2012

Co

re C

ou

nt

x Fr

equ

ency

(n

x G

Hz)

Circa 2005 Multicore Trends

16 cores @ 3.6GHz (100W)

The Power Wall

Page 39: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

5

Hardware Specialization

Capabilities (Battery Life, Performance, Killer Apps)

Today Future Past No Longer Can Rely on General Purpose Hardware

Improvements to Enable More Capabilities

Page 40: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

6 * Source: Ning Zhang and Bob Brodersen, ISSCC data

10-100X Gap in Efficiency Between General Purpose Processors and Dedicated Hardware

Page 41: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

7 7

Motivation

HW Accelerators & Goals

Parallel SAT Solver

Matrix-Vector Multiplication Engine

Conclusions

Page 42: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

8 8

HW accelerators Applications (PSAT)

Libraries (MVM)

Language

Common computation architectures to broaden accelerator utility.

Customized memory architecture

Compressed data representations

Precision vs. energy efficiency

Page 43: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

9 9

Transistors are abundant, power is scarce Utilize abundant silicon for FPGA fabrics Energy efficient and post-silicon flexibility

Challenges

FPGAs incur large reconfiguration overheads Must provide significant advantages over other architectures (many-core, GPGPU) What are the right applications?

Exploration enabled by the BEE3

Page 44: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

10 10

We built it! Vehicle for research in computer system architecture

“BEE3”: Berkeley Emulation Engine, version 3 4 FPGAs (3 types of FPGAs)

Logic-focused, DSP-focused or Embedded Processor-focused

64 GB DDR2 DRAM 2 DRAM channels per FPGA, 2 DIMMs per channel

FPGA Ring Interconnect Plenty of I/O to connect to the BEE3

10 GbE, 1 GbE, PCI-Express, QSH

Page 45: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

11

Two design styles

Directly translate SW → HW (generally FSMs)

Easy to debug and compare to SW system

Composable building blocks

Leverage domain (App + HW) expertise

Target FPGA hard macros

General requirements for FPGAs

No reconfiguration

Generalized solution → library-like functionality

Page 46: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

12 12

Page 47: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

13 13

Determine whether a given boolean formula can be true

3SAT is the first known NP-complete problem Often used to prove that other problems are NP-complete

Applications of SAT: Formal verification of circuit design

Cryptography attacks

Solve other NP-complete problems

SAT solver can take a long time Some times hours, days, or even weeks

)()( 128951 XXXXXX

Page 48: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

14 14

CPU FPGA

Loop{ }

10% time

90% time >1000 CPU

cycle/Inference

Branch decision

--set a variable Deduce --loop through all related clauses, obtain inferred

variables; Conflict Analysis

-backtrack, or finish

Software Solver

FSB/HT/PCIe

951 XXX

X1=0 (decision), X5=0 => X9=0

Page 49: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

15 15

Previous work Map clauses to logic directly

Hours of reconfiguration time

This is no longer only a logic problem! It’s an architecture problem!

Design for FPGA fabric

Push computation close to storage (state)

Transform logic problem into a memory indexing problem

Page 50: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

16 16

1: CPU communication module

2: Implication queue

3: Parallel inference engines

4: Inference multiplexer

5: Conflict inference detection

Page 51: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

17 17

An application-specific architecture Reprogram memories for new instance

Avoid global signal wires and careful pipelining

Support tens of thousands of variables and clauses per FPGA

Learned clause support

BCP 5~16 times faster than the conventional software based approach

Page 52: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

18 18

Page 53: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

19 19

Page 54: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

20 20

𝒚 = 𝑨𝒙

Matrix-Vector-Multiply is Critical HPC Kernel 10s of papers published/year on this topic

Existing works on GPU/CPU/FPGA

Performance sensitive to matrix sparsity and formats Processor-centric data formats High power consumption (GPU/CPU)

FPGA opportunities

Exploit custom variable-length formats Low power, large memory configurations Efficient, robust resource utilization

Page 55: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

21 21

Build single FPGA bitfile library for 𝒚 = 𝑨𝒙

Handle large-scale inputs (>GB)

Avoid costly run-time reconfiguration

Exploit bit-level manipulation

Dense and sparse inputs

Process multiple sparse matrix formats COO, CSR, Dense, DIA, ELL, etc.

Page 56: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

22 22

A00 A01 A02 A03

A10 A11 A12 A13

A20 A21 A22 A23

A30 A31 A32 A33

A40 A41 A42 A43

x0

x1

x2

x3

y0

y1

y2

y3

y4

= ×

𝒚 = 𝑨𝒙

Page 57: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

23 23

PE

PE

PE

Universal Format

Decoder

Gaxpy PIPE 0

PE

Rows

Matrix Memory

(A)

Vector Memory

(y)

Tiled DMA Engine

Vector Memory (x)

Gaxpy Control

Gaxpy PIPE 3

Gaxpy PIPE 2

Gaxpy PIPE 1

Page 58: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

24 24

A00 A03

A12

A21

A32

A41 A43

x0

x1

x2

x3

y0

y1

y2

y3

y4

= ×

𝒚 = 𝑨𝒙

Page 59: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

25 25

PE

PE

PE

Universal Format

Decoder

Gaxpy PIPE 0

PE

Gaxpy PIPE 1

Gaxpy PIPE 2

Gaxpy PIPE 3

Rows

Vector Memory

(y)

Tiled DMA Engine

Private Cache (x)

Private Cache (x)

Private Cache (x)

Private Cache (x) A streams Gaxpy Control

A data streams

Page 60: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

26 26

1 4 3 7 2 9 5 8 7 Data Array

1 0 4 0

3 7 0 0

0 0 2 9

5 8 0 7

1 4

3 7

2 9

5 8 7

Page 61: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

27 27

COO Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × (𝟒𝑩 + 𝟒𝑩)

Data Array 1 4 3 7 2 9 5 8 7

Row Index

Column Index

0 0 1 1 2 2 3 3 3

0 2 0 1 2 3 0 1 3

Page 62: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

28 28

CSR Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × 𝟒𝑩 + 𝑹𝒐𝒘𝒔 × 𝟒𝑩

Data Array 1 4 3 7 2 9 5 8 7

Row Pointer

Column Index

0 2 4 6 9

0 2 0 1 2 3 0 1 3

Page 63: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

29 29

Column Metadata

1 4 *

3 7 *

2 9 *

5 8 7

Data w/ Padding

0 2 *

0 1 *

2 3 *

0 1 3

ELL Overhead = 𝟒𝑩 × 𝑹𝒐𝒘𝒔 × 𝒌 + 𝑫𝒂𝒕𝒂𝑷𝒂𝒅 × 𝟖𝑩

k=3

Page 64: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

30 30

Page 65: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

31 31

Data Array

[ 1 (0,1) 1 (0,1) 1 (0,4) 1 1 1 1 (0,1) 1 ]

[ 1 (0,1) 1 (0,1) 1 (0,4) 1 1 1 1 (0,1) 1 ]

[ 1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1 ]

Bit Vector (BV)

BV Overhead = 𝑹𝒐𝒘𝒔 × 𝑪𝒐𝒍𝒔 𝒙 𝟏𝒃𝒊𝒕

1 4 3 7 2 9 5 8 7

1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1

Page 66: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

32 32

Data Array

Bit Vector (BV) Compressed BV (CBV)

1 4 3 7 2 9 5 8 7

1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1

1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1

32-bit “zero” fields

CBV Overhead = 𝑵𝒐𝒏𝒛𝒆𝒓𝒐𝒔 × 𝟏𝒃𝒊𝒕 +𝒁𝒆𝒓𝒐𝑪𝒍𝒖𝒔𝒕𝒆𝒓𝒔 × 𝟑𝟐𝒃𝒊𝒕

Page 67: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

33 33

Data Array 1 4 3 7 2 9 5 8 7

1 0 1 0 1 1 0 0 0 0 1 1 1 1 0 1

4-bit header + {4,8,…,32}-bit zero field

CVBV Overhead ~ input-dependent

1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1

Compressed Variable BV (CVBV)

Page 68: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

34 34

Page 69: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

35 35

Universal Format/

CVBV Decoder

Gaxpy PIPE 0

Gaxpy PIPE 1

Gaxpy PIPE 2

Gaxpy PIPE 3

Rows

Vector

Memory

(y)

private cache (x)

private cache (x)

private cache (x)

private cache (x) Gaxpy

Control

A data streams

Page 70: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

36 36

Specify matrix format descriptors Fixed/variable length, padding, index/ptr, etc.

Translates row/column into sequence #

Generate *BV (reduce storage/BW)

Generate modified COO (consumed by PEs) Row index, nonzero count per row, col indices

Page 71: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

37 37

Page 72: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

38 38

PEs LUT

(% area)

RAM

(% area)

DSP

(% area)

GFLOP

s

(Peak)

GFLOPs

(off-chip)

BW

(% peak)

Dense V5-LX155T 16 72% 86% 88% 3.1 0.92 64.7

Dense V6-LX240T 32 71% 63% 56% 6.4 1.14 80

Dense+Sparse V5 16 74% 87% 91% 3.1 - -

Sparse

Inputs

V5-LX155T (Ours) HC-1 (32 PE)1 Tesla S10702

GFLOPS / BWUsed GFLOPS /

BWUsed

GFLOPS /

BWUsed

dw8192 0.10 / 10.3% 1.7 / 13.2% 0.5 / 3.1%

t2d_q9 0.15 / 14.4% 2.5 / 19.3% 0.9 / 5.7%

epb1 0.17 / 17.1% 2.6 / 20.2% 0.8 / 4.9%

raefsky1 0.20 / 18.5% 3.9 / 29.0% 2.6 / 15.3%

psmigr_2 0.20 / 18.6% 3.9 / 29.6% 2.8 / 16.7%

torso2 0.04 / 4.0% 1.2 / 9.1% 3.0 / 18.3% [1] Nagar et al., A Sparse Matrix Personality for the Convey HC-1, FCCM’11 [2] Bell et al., Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors, SC’09

Page 73: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

39 39

We defined CVBV/CBV sparse format 25% reduction in storage/bandwidth compared to well-known CSR

Exploits bit-level manipulation of FPGA

Single bit file for dense AND sparse MVM Universal matrix format decoder

DMA and caches for memory management

Stall-free accumulator

Scalable design, implemented on multiple platforms

Page 74: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

40 40

Demonstrated HW as SW library replacement Bottom up approach (Time consuming/not scalable)

Pros: Common computation architecture and input insensitive

Customized memory architecture

Compressed data representations

Other energy efficiency tools to exploit

Cons: Time consuming: requires designers

Page 75: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

41 41

Moving beyond manycore and GPUs

Need tools to automate HW/SW co-design Granularity?

Algorithm specification?

HW building blocks? IP Integration issues

Future FPGA Architecture? More custom building blocks?

Software/OS support Fast communication and synchronization

Accelerators as 1st class building blocks

Page 76: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

42 42

Page 77: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

43

© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market

conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.

MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 78: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

44

Page 79: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

45 45

Inference Engine

Clause Index

Walk

(ieng.vhdl)

Literal Value

Inference

(ieng.vhdl)

Walk Table RAM

18 *2K BRAM

(ieng_ram.vhdl)

Clause Status

Table RAM

18 * 1K BRAM

(ieng_ram.vhdl)

Inference Engine

Clause Index

Walk

(ieng.vhdl)

Literal Value

Inference

(ieng.vhdl)

Walk Table RAM

18 *2K BRAM

(ieng_ram.vhdl)

Clause Status

Table RAM

18 * 1K BRAM

(ieng_ram.vhdl)

Inference

Result

Multiplexer

(ibus.vhd)

Conflict Inference Detection

2-stage Pipeline

(conflict_detect.vhdl)

2 * 8K bits BRAM (xN)

Global Variable Status

(Conflict_detect.vhdl)

Literal to Variable

(External RAM)

To CPU

Communication

Buffer TX

BRAM

36*1K

BRAM

36*1K

36 x 2 bits

Communication

Buffer RX

From CPU

18 * 4 bits

36 / 36*2 bit

(Enqueue / to CPU)

Demux FIFO

Overflow

Buffer

(DRAM)

36*1K

Undo

Undo

Enqueue

New Decision/

Undo

16-Entry FIFO

(distributed RAM)

#1

#64

Inference Engine

Clusters (16 x 4)

36*1K

18 * 1K FIFO (x4)

16-Entry FIFO

(distributed RAM)

Demux Bus

(ibus.vhd)

CPU Communication

Dispatch Unit

Parallel Inference Engine Clusters

Conflict Detection

X1=1 X1=1

X1=1 X1=1

X3=1

X5=0

X3=1 X5=0

X3=1

X5=0

X3=1

X5=0 X5=0

X3=1

X5=0

Page 80: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

46 46

0

2000

4000

6000

8000

10000

12000

14000

3-SAT 4-SAT 5-SAT 6-SAT

Co

nvert

ed

CP

U c

ycle

s

CPU

FPGA (HT)

FPGA (PCIe)

BCP 6.7 – 38.6 times faster than the conventional software based approach

Page 81: Streaming Similarity Search on FPGA based on Dynamic Time ... · Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Streaming

47 47

BCP 5~16 times faster than the conventional software based approach

0

500

1000

1500

2000

2500

3000

3500

Co

nvert

ed

CP

U c

ycle

s p

er

imp

licati

on

Software

FPGA (tree in BRAM)

FPGA( (tree in BRAM and distributed RAM)


Recommended