+ All Categories
Home > Documents > Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture •...

Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture •...

Date post: 29-Aug-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
27
Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor Tuan Tu Tran , Simon Schindel, Yongchao Liu, Bertil Schmidt Institut für Informatik Johannes Gutenberg – University of Mainz Germany
Transcript
Page 1: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

Bit-Parallel Approximate Pattern Matching on the Xeon Phi Coprocessor

Tuan Tu Tran, Simon Schindel, Yongchao Liu, Bertil Schmidt

Institut für Informatik

Johannes Gutenberg – University of Mainz

Germany

Page 2: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

2SBAC-PAD´2014

Page 3: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

3SBAC-PAD´2014

Page 4: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Approximate Pattern Matching

“Given a pattern P of length m, a text T of length n over analphabet Σ and a constant k, find all sub-strings of T whose editdistances with P are at most k”

The Levenshtein edit distance: substitution, deletion and insertion

Example:T = ACTGCAT, P = CTGA, k = 1Matched with:• 1 deletion: ACTGCAT• 1 substitution: ACTGCAT• 1 insertion: ACTGCAT

4SBAC-PAD´2014

Page 5: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Bit-parallel Approximate Pattern Matching with the prefix automaton

Bit-parallelism:

• Encodes calculated values into a machine work, seen as a bit array;

• Allows for simultaneous updates of multiple values by a single bit operation;

• Directly simulates all states of the NFA (Wu-Manber algorithm [Wu and Manber, 1992]);

• Limited by the size of machine words.5SBAC-PAD´2014

Page 6: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Xeon Phi Architecture

• A coprocessor running Linux

• Connected to a Host CPU via PCIe

• Run in either “native” or “offload” mode

• 1 GHz Clock

• 8 GB DDR5 RAM

• 60 Cores (4 threads per core)6SBAC-PAD´2014

Page 7: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Xeon Phi Architecture• Cores interconnected by a high-speed bidirectional ring;

• 512-KB L2-Cache per core

– High-speed access to all other L2 caches;

– Cache coherent across the entire processor;

• Four hardware threads per core

• 512-bit wide vector registers in addition to 64-bit x86

– 16 x 32-bit Integer or Single Precision Floating Point values

– 8 x 64-bit Integer or Double Precision Floating Point values

• Vectorize-and-scale approach to achieve high performance

7SBAC-PAD´2014

Intel Xeon Phi architecture (image courtesy of Intel Corporation)

Page 8: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Motivations and Related work

Motivations:• Usage of Xeon Phi vector registers →Matching with patterns

longer than a machine word;• Parallelization on the massive number of cores → Approximate

pattern matching on large texts. Related work• Usage of CPU vector registers for bit-parallel matching algorithms:

[Külekci, 2009], [Faro and Külekci, 2012], [Fredriksson, 2003]• Implementation of the Wu-Manber algorithm on GPU: [Li et al.,

2011], [Tran et al., 2012]• Implementation of the Myer bit-parallel pattern matching

algorithm on GPU: [Chacón et al., 2014]

8SBAC-PAD´2014

Page 9: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

9SBAC-PAD´2014

Page 10: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Notations

• A text T of length n• A pattern P of length m• An alphabet Σ• A maximal edit distance k• A pattern bitmask B:

– |Σ| rows– B[a][i + 1] = 1 if and only if pi = a (a Σ)

• An bit array R:– k + 1 rows– Representation of the matching NFA– Once Ri,j(0 i k;1 < j m)is active the prefixp1p2 … pj is recognized with i errors

10SBAC-PAD´2014

Page 11: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

The Wu-Manber algorithm

11SBAC-PAD´2014

Initialization)0(10 11]0[ kjR jm

j

)1( | 1) ( | | ])[&)1((

][&)100 | )1((][1

]1[1

]1[1

]1[][

1-m]1[0

][0

ij

ij

iji

ij

ij

iii

RRRtBRRtBRR

For each

) ( 1 10&?

ipositionatmatchaforcheckR mik

)1( niTt i

(match) (insertion) (substitution) (deletion)

Computational complexity: Ο(n.w)

Page 12: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Example

12SBAC-PAD´2014

ACTGCATCTGA (deletion)

ACTGCATCTGCA (insertion)

ACTGCATCTGA (substitution)

Page 13: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Extended version of the Wu-Manber algorithm

• Vectorizations of the bit-wise operations: – AND – OR – SHIFT_LEFT

• Efficient check for a match

13SBAC-PAD´2014

Initialization)0(10 11]0[ kjR jm

j

)1( | 1) ( | | ])[&)1((

][&)100 | )1((][1

]1[1

]1[1

]1[][

1-m]1[0

][0

ij

ij

iji

ij

ij

iii

RRRtBRRtBRR

For each

) ( 1 10&?

ipositionatmatchaforcheckR mik

)1( niTt i

(match) (insertion) (substitution) (deletion)

Page 14: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Vectorization with 512-bit wide registers

14SBAC-PAD´2014

• Use of union: flexible change between intrinsic data format (__m512i) and an array of 16 integers

• Intrinsic bit-wise functions: the elements within the vector are processed independently • Bit-wise AND: __mm512_and_epi32• Bit-wise OR: __mm512_or_epi32• Bit-wise SHIFT_LEFT:

• The left most bit of vi+1 becomes the right most bit of vi

• Combination of 4 intrinsic functions:

#define REG_NUM 16union m512{__m512i m512;unsigned int v[REG_NUM] __attribute__((aligned(64)));

};

v0 v1 … v14 v15

v1 v2 … v15

v1 v2 … v15

31 31 31 31

v0 v2 … v14 v15

1 1 1 1OR

A

B__mm512_alignr_epi32

__mm512_srli_epi32

__mm512_slli_epi32

__mm512_mask_or_epi32

A <<= 1

Page 15: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Auto-vectorization• Uses an array of 16 uints to simulate a 512 bit machine words;

• Uses for – loop with directives:– simd assert

– vector aligned

– ivdep

15SBAC-PAD´2014

…/* save the right most bit */#pragma ivdep#pragma simd assertfor(i=1;i<REG_NUM;++i) B[i-1] = (A[i]>>31);/* shift left A by 1 position */#pragma vector aligned#pragma simd assertfor(i=0;i<REG_NUM;++i) A[i] <<= 1;…

Page 16: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Data parallelism on the many-core coprocessor

• Given: a pattern P, a collection of text {Ti}

• The matching search of P against any Ti and Tj

(i j) can be performed in parallel

16SBAC-PAD´2014

• Three multi-threaded versions by OpenMP

• wmIntr: Xeon Phi, intrinsic data and functions

• wmAutoVec: Xeon Phi, array of 16 uints, automatical vectorization

• wmHost: multicore CPU, array of 16 uints

Page 17: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

17SBAC-PAD´2014

Page 18: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Testing Environment and Data

• Intel Xeon Phi 5100P– 60 cores x 1.053 GHz– 8 GB RAM

• Host:– Intel Xeon E5-2670: 8 cores x 2.6 GHz– 64 GB RAM

• Compiler: Intel icc with –O3 option• Data:

– Human chromosome 21 (chr21), – Texts: 32x or 128x of chr21 (1.1 GB or 4.3 GB) – Pattern of length 511, extracted from chr21

• Serial time to evaluate speedups: wmHost with one thread

18SBAC-PAD´2014

Page 19: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Scalability with the number of cores

• Scale well with the number of cores of the Xeon Phi

• wmIntr is superior to wmAutoVec

19SBAC-PAD´2014

wmIntr wmAutoVec(Numbers of threads are the multiples of 59)

Page 20: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Scalability with the Levenstein distance

The advantage of the use of the intrinsic SIMD data and function of the Xeon Phi

20SBAC-PAD´2014

Page 21: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Scalability of wmIntrLinear increase with:• The Levenshtein distance• The size of input texts

21SBAC-PAD´2014

Page 22: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Comparisions to related work

• Our work: approximate matching with the pattern longer than the size of common machine words (32 or 64), using the Wu-Manber algorithm.

• [Külekci, 2009], [Faro and Külekci, 2012]: exact matching.• [Fredriksson, 2003], [Chacón et al., 2014]: the Myers algorithm

– Independent of the maximal edit distance (k);– Not easy to be extended to perform matching with wild cards and regular

expressions.

• [Li et al., 2011], [Tran et al., 2012]: focus on pattern length smaller than that of common machine words.

Not identical to compare our performance with the mentioned related works

22SBAC-PAD´2014

Page 23: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Outline

• Introduction

• Bit-parallel matching on the Xeon Phi Coprocessor– Vectorization with 512-bit wide vector registers

– Data-Parallelism on the many-core coprocessor

• Performance Evaluation

• Conclusions and Perspectives

23SBAC-PAD´2014

Page 24: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Conclusions and Perspectives

Conclusions• Simulation of long machine words on the Intel Xeon Phi

architecture;• Extended implementation of the Wu-Manber

algorithm;• Multi-threads versions of bit-parallel approximate

pattern matching: – Long pattern– High Levenshtein distance– Large target texts

• The source code can be downloaded at: http://xbitpar.sourceforge.net/

24SBAC-PAD´2014

Page 25: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Conclusions and Perspectives (cont.)

Perspectives• Matching with wildcard and regular expression• Mapping onto CUDA-enable GPUs (SIMD feature of a

“warp”)• Preprocessing step in bioinformatics sequencing

applications– Fast filtering– Seeding

• Other bit-parallel matching algorithms, such as the Myer algorithm.

• Other bit-parallel applications, such as finding the longest common subsequence (LCS).

25SBAC-PAD´2014

Page 26: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

Thank you for your attention!

26SBAC-PAD´2014

Page 27: Bit-Parallel Approximate Pattern Matching on the Xeon Phi ...€¦ · Xeon Phi Architecture • Cores interconnected by a high-speed bidirectional ring; • 512-KB L2-Cache per core

/ 27

References[Wu and Manber, 1992] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,” Communications of the ACM, vol. 35, no. 10, pp. 83–91, 1992.

[Li et al., 2011] H. Li, B. Ni, M. H. Wong, and K.-S. Leung, “A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching,” in SASP, 2011, pp. 74–77.

[Külekci, 2009] M. O. Külekci, “Filter Based Fast Matching of Long Patterns by Using SIMD Instructions,” in Stringology, 2009, pp. 118–128.

[Faro and Külekci, 2012] S. Faro and M. O. Külekci, “Fast Multiple String Matching Using Streaming SIMD Extensions Technology,” in Proceedings of the 19th International Conference on String Processing and Information Retrieval, ser. SPIRE’12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 217–228

[Fredriksson, 2003] K. Fredriksson, “Row-wise Tiling for the Myers’ Bit-Parallel Approximate String Matching Algorithm,” in String Processing and Information Retrieval, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2003, vol. 2857, pp. 66–79.

[Tran et al., 2011] T. T. Tran, M. Giraud, and J.-S. Varré, “Bit-Parallel Multiple Pattern Matching,” in Parallel Processing and Applied Mathematics, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, vol. 7204, pp. 292–301.

[Chacón et al., 2014] A. Chacón, S. Marco-Sola, A. Espinosa, P. Ribeca, and J. C. Moure, “Thread-cooperative, Bit-parallel Computation of Levenshtein Distance on GPU,” in Proceedings of the 28th ACM International Conference on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014, pp. 103–112.

27SBAC-PAD´2014


Recommended