Low-Power Vectorial VLIW Architecture for Maximum Parallelism ... · Dynamic Programming, Data...

Low-Power Vectorial VLIW Architecture for MaximumParallelism Exploitation of Dynamic Programming

Algorithms

Miguel Tairum Cruz

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Dr. Nuno Filipe Valentim Roma

Dr. Pedro Filipe Zeferino Tomas

Examination CommitteeChairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Nuno Filipe Valentim Roma

Members of the Committee: Dr. Joao Paulo de Castro Canas Ferreira

October 2014

Acknowledgments

The work presented herein was partially supported by national funds through Fundacao para a

Ciencia e Tecnologia (FCT) under project Threads (ref. PTDC/EEA-ELC/117329/2010).

First and foremost i would like to thank my parents and closest friends for their continued support

and motivation. I owe a huge debt of gratitude to my supervisors, Professors Nuno Roma and Pedro

Tomas for their continued support, guidance and motivation. A very special thanks goes to my colleague

Nuno Neves from the INESC-ID’s Signal Processing Systems group, for without his work mine would

have not been possible. I would also like to thank my colleague Joao Luıs Furtado from IST for his help

and insight in my work.

Abstract

Dynamic Programming algorithms are often used in many areas, to divide a complex problem into

several simpler sub-problems, with many dependencies. Typical approaches explore data level paral-

lelism, by relying on specialized vector instructions. However, the fully-parallelizable scheme is often

not compliant with the memory organization of general purpose processors, leading to a less optimal

parallelism, with worse performance. The proposed architecture exploits both data and instruction level

parallelism, by statically scheduling a bundle of instructions to several different vector execution units.

This achieves better performance than vector-only architectures, and has lower hardware requirements

and thus lower power consumption. Accordingly, performance and energy efficiency metrics were used

to benchmark the proposed architecture against a dual-issue, low power ARM Cortex-A9, a multiple-

issue, out-of-order high performance Intel Core i7 and a dedicated ASIP architecture. In a fair compar-

ison where all processors compute 128-bit vectors (or equivalent), the results show that the proposed

architecture can achieve up to 5.53x, 1.12x and 2.35x better performance-energy efficiency than the

ARM Cortex-A9, the Intel i7 and the dedicated ASIP, respectively, and a performance improvement of

up to 4.34x, 5.01x and 1.12x regarding the ARM, the dedicated ASIP and the Intel i7, respectively, for

the evaluated algorithm implementations.

Keywords

Dynamic Programming, Data Level Parallelism, Instruction Level Parallelism, VLIW, Low-power

iii

Resumo

Os algoritmos de programacao dinamica sao bastante usados em varias areas, dividindo um prob-

lema complexo em multiplos sub-problemas mais simples, com varias dependencias entre si. As abor-

dagens tıpicas exploram o paralelismo dos dados atraves de instrucoes vetoriais. No entanto, nos

processadores de uso geral, devido a organizacao da memoria existente, nao e possıvel paralelizar

completamente estes problemas eficientemente, resultando em piores desempenhos. A arquitetura

proposta explora tanto a paralelizacao dos dados como das instrucoes, agendando estaticamente um

conjunto de instrucoes para varias unidades de execucao diferentes. Isto permite alcancar um melhor

desempenho que as arquiteturas vetoriais, reduzindo os requisitos de hardware e levando a um menor

consumo de energia. Foram utilizadas metricas de desempenho e eficiencia energetica a fim de refer-

enciar a arquitetura proposta contra um ARM Cortex-A9 (com duplo-agendamento de instrucoes e baixo

consumo), um Intel Core i7 (com agendamento multiplo e alto desempenho) e uma arquitetura ASIP

dedicada. Atraves de uma comparacao justa com vetores de 128 bits, os resultados obtidos mostram

que a arquitetura proposta consegue alcancar uma relacao de desempenho e eficiencia energetica

ate 5,53x, 1,12x e 2,35x melhor que o ARM Cortex-A9, o Intel i7 e o ASIP dedicado, respetivamente.

Em termos de desempenho, a arquitetura proposta atinge resultados 4,34x, 5,01x e 1,12x superiores

aos do ARM, do ASIP dedicado e do Intel i7, respetivamente, para as implementacoes dos algoritmos

avaliados.

Palavras Chave

Programacao dinamica, Paralelizacao de dados, Paralelizacao de instrucoes, VLIW, Baixo consumo

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Dynamic Programming 5

2.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3.A Profile Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3.B Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Comparison between profile HMMs and single alignment algorithms . . . . . . . . 15

2.3 Implementation of DP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Data Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3 State of the Art Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3.A Programmable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.3.B Non-Programmable Architectures . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Proposed VLIW Architecture 23

3.1 Architecture Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Register Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.4 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii

4 DP Algorithm Implementations 41

4.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Viterbi (Profile HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Prototyping and Evaluation 53

5.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.1 Reference State-of-the-art Architectures . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.2 Application Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2.2.A Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2.2.B Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.3.A Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.3.B Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Performance and Energy Effiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.2 Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Conclusions and Future Work 69

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A Proposed Architecture Instruction Set 77

B Viterbi Pseudo-code 79

viii

List of Figures

2.1 Example of the NW algorithm and its respective traceback phase . . . . . . . . . . . . . . 8

2.2 Example of the SW algorithm and its respective traceback phase . . . . . . . . . . . . . . 9

2.3 Example of a Consensus Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Example of the construction of a Profile HMM . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 HMM for the optimal gapped global alignment . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Profile HMM for unithit local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Profile HMM for multihit local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Trellis diagram for a sequence of three observations in the Viterbi algorithm . . . . . . . . 14

2.9 Example of DP Cell parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.10 Comparison between a substitution score matrix with and without query profiling . . . . . 19

2.11 State of the art on SIMD implementations of the SW algorithm . . . . . . . . . . . . . . . 19

3.1 FU access comparison between Vector and VLIW architectures . . . . . . . . . . . . . . . 25

3.2 Example of two iterations of a DP algorithm in the proposed architecture . . . . . . . . . . 26

3.3 Execution Units with the respective independent register banks . . . . . . . . . . . . . . . 27

3.4 Register banks depicting the sniffing mechanism and the shared memory registers . . . . 28

3.5 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 4-Stage pipeline structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Processor DLP and ILP scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.8 FU conflict control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.9 Instruction words for the bundle and the composing units . . . . . . . . . . . . . . . . . . 33

3.10 Register Window example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.11 AXI Interconnection scheme between the RAM and the local fast memory in the proposed

architecture core and the GPP in the PS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.12 AXI Interconnection scheme between the instruction memory in the proposed architecture

core and the GPP in the PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.13 Interface scheme for the proposed architecture core . . . . . . . . . . . . . . . . . . . . . 39

4.1 SW processing scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Inner loop operations for the SW algorithm in the proposed architecture . . . . . . . . . . 44

4.3 Inner loop and outer loop operations for the SW algorithm in the proposed architecture . . 45

4.4 Critical section example for the SW algorithm implementation in the proposed architecture 46

ix

4.5 Comparison of example profiles for the HMMER platform and the proposed architecture . 48

4.6 Inner loop operations for the Viterbi algorithm in the proposed architecture . . . . . . . . . 49

4.7 Outer loop operations of the Viterbi algorithm in the proposed architecture . . . . . . . . . 50

4.8 Critical section example for the Viterbi algorithm implementation in the proposed architecture 51

5.1 Hardware scalability of the proposed architecture . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 Stripped pattern processing scheme and correspondent dependencies . . . . . . . . . . . 57

5.3 Comparison of the average number of clock cycles for the SW algorithm implementation . 61

5.4 Performance evaluation results for the SW algorithm implementation . . . . . . . . . . . . 62

5.5 Comparison of the average number of clock cycles for the Viterbi algorithm implementation 63

5.6 Performance evaluation results for the Viterbi algorithm implementation . . . . . . . . . . 64

5.7 Performance and energy evaluation results obtained for the SW algorithm implementation 65

5.8 Performance and energy evaluation results obtained for the Viterbi algorithm implementation 66

A.1 Full implemented instruction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

B.1 Complete pseudo-code for the Viterbi implementation in the proposed architecture. . . . . 80

x

List of Tables

3.1 Abridged implemented instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Hardware resources, operating frequency and power estimation for the proposed archi-

tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Average number of clock cycles for the SW algorithm when implemented in the considered

execution platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Average number of clock cycles for the Viterbi algorithm when implemented in the consid-

ered execution platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.4 Operation frequency and power estimation for all evaluation architectures . . . . . . . . . 65

xi

xii

List of Acronyms

AMBA Advanced Microcontroller Bus Architecture

ASIC application-Specific Integrated Circuit

ASIP Application-Specific Instruction-set Processor

AVX Advanced Vector Extension

AXI Advanced eXtensible Interface

DLP Data Level Parallelism

DP Dynamic Programming

DSU Data Stream Unit

FPGA Field-Programmable Gate Array

FU Functional Unit

GPP General Purpose Processor

GPU Graphics Processing Unit

HMM Hidden Markov Model

ILP Instruction Level Parallelism

IPC Instructions Per Cycle

ISA Instruction Set Architecture

MIMD Multiple-Instruction Multiple-Data

NW Needleman-Wunsch

PE Processing Element

PL Programmable Logic

PS Processing System

SIMD Single-Instruction Multiple-Data

xiii

SOC System On Chip

SSE Streaming SIMD Extension

SW Smith-Waterman

TLP Thread Level Parallelism

VLIW Very Long Instruction Word

xiv

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

Dynamic Programming (DP) is a common methodology for solving complex problems, by dividing

them into smaller sub-problems that are simpler to solve. DP is applied in a vast set of different appli-

cation domains, such as bioinformatics, where it can be used in sequence alignment problems, Hidden

Markov Models (HMMs), where it can be used to find a sequence of hidden states in the model or parser

algorithms, to name a few.

Due to its problem partitioning, DP methods can exploit Data Level Parallelism (DLP) to concur-

rently compute several problems and thus increase the performance of the algorithms, as long as those

problems remain independent from one another. This parallelism paradigm is commonly exploited by

vectorial extensions to the instruction set, which are usually offered by most of today’s General Purpose

Processor (GPP) architectures.

Over the years, the processors also started to introduce Instruction Level Parallelism (ILP) with

pipelined and superscalar architectures. However, ILP extraction in DP algorithms represents a harder

optimization paradigm than DLP and thus other strategies were explored.

With the appearance of multi-processor architectures, a new type of parallelization gained the spot-

light: Thread Level Parallelism (TLP). With it, greater performance can be achieved by executing multiple

threads and/or programs at the same time, in a single or in multiple cooperating processors. The latter

configuration usually requires less energy and power per processing unit than a single high-end GPP.

This attribute is rather desired since DP applications often require high performance in a low power

environment.

More dedicated architectures, often implemented in application-Specific Integrated Circuits (ASICs),

can also better meet the application requirements, when compared to GPPs. However they are not

flexible enough to support algorithmic changes and are much more expensive. Implementations in

Field-Programmable Gate Arrays (FPGAs) (e.g. Application-Specific Instruction-set Processors (ASIPs))

arise as an intermediate solution, by representing a trade-off between the flexibility of the overall less

expensive GPPs and the higher performance of the ASICs, filling the architectural spectrum between

the two [1].

1.1 Motivation

DP applications tend to be very computation-demanding with large datasets, often requiring high

performance in a low power environment. In bioinformatics, applications related with the biological

sequence processing, operate in sequence banks that grow larger at a very fast rate [2]. As an example,

the release of GenBank from February 2013 contains 150 × 109 base pairs from over 260000 formally

described species [3]. Most of these algorithms optimal solutions are obtained by applying DP methods,

such as the Smith-Waterman (SW) algorithm [4] or the Needleman-Wunsch (NW) algorithm [5], for

the local and global sequence alignment, respectively. Given the implied computational demands and

large data sets, these algorithms often require large runtimes in GPPs, often resorting to less precise

solutions (targeting heuristic-based algorithms), such as the BLAST [6] and FASTA [7]. However, these

alternatives are less accurate and thus not optimal.

2

In the field of HMMs the same paradigm is also observed. The Viterbi algorithm [8], which consists

in a DP algorithm used to find the most probable sequence in a HMM, presents high computational

demands in GPPs. Therefore, heuristic implementations like the HMMER [9] are usually preferred.

The main problem concerned with the current DP implementations in GPPs is that they are stuck to

their Instruction Set Architecture (ISA), limiting the range of optimization methods. Looking back at the

work done throughout the years to the sequence alignment SW algorithm, we see that some algorithm

changes were initially proposed by Gotoh [10] and later by Wozniak [11], who proposed a DLP scheme.

Further down the line, Farrar [12] and Rognes [13] improved on Wozniak’s work, optimizing the DLP.

The most recent addition for GPPs was made by Rognes [14], introducing multi-core processing to the

algorithm.

Due to their implementations on GPPs, none of these works focused on ILP extraction which, when

accompanied by DLP, can provide an increase in performance and in energy efficiency. The configurabil-

ity of FPGAs thus can result in the perfect environment to envisage such architecture. Furthermore, this

architecture would not be limited to an existing ISA, granting the potential to devise a much more efficient

processor than GPPs with different architectural paradigms (e.g. Very Long Instruction Word (VLIW)),

with added support for different families of DP algorithms.

1.2 Objectives

This thesis aims the development of a novel programmable processor architecture to be implemented

in a FPGA (or in an ASIC). The processor should have an ISA particularly optimized to compute different

families of DP algorithms, with a focus on sequence alignment algorithms like the SW and the Viterbi’s

HMM.

The objective is to design the architecture from scratch, exploiting both DLP and ILP to ensure max-

imum performance with a minimal power consumption, and design its ISA in order to guarantee DP

compatibility with an higher programmability (i.e., supporting general instructions, similarly to GPPs in-

struction sets), in order to target low power systems like biochips (for bioinformatic DP algorithms) or

other embedded systems. It it also expected to tackle the bottlenecks of the current available imple-

mentations, in order to, not only avoid them but, to find a better solution to them, and implement the

architecture.

A thorough performance and energy efficiency will be conducted, comparing the proposed architec-

ture to a set of state-of-the-art architectures from different domains: i) a mobile and low-power GPP; ii)

a high-performance GPP; iii) and a programmable ASIP.

Hence, this work hopes to fill the gap between GPPs and dedicated architectures, by bringing a more

flexible alternative than most existing dedicated implementations, with low power consumption and still

faster than most GPP software applications.

3

1.3 Main Contributions

Based on the evaluated sequence alignment DP algorithms, the SW [4] and Viterbi [8], and a careful

analysis of the state-of-the-art implementation for both algorithms (Farrar [12] and Rognes [13] for the

SW algorithm and HMMER [9] for the Viterbi algorithm), a novel VLIW architecture was developed,

providing a versatile and low-power platform for DP algorithms.

The ISA was designed to exploit both DLP and ILP, in order to not only increase the performance

of the implemented algorithms, but also to reduce the hardware requirements and guarantee a better

hardware usage, leading to a low power consumption. Furthermore, the adoption of a VLIW architecture

allowed the addition of a special unit to seamlessly access the memory in parallel to the algorithm

computations, while fully exploiting DLP, thus elimination the memory limitations often present in GPP

implementations for these types of algorithms, such as Wozniak’s [11] SW implementation.

This culminated in a low power architecture with several independent execution units, working at an

operating frequency of 98.5 MHz, while still providing high performance computing for DP algorithms.

As a result of the developed research, it was already published a manuscript in the HPCS 2014

international conference, where the main contributions have been reported:

• Miguel Tairum Cruz, Pedro Tomas and Nuno Roma. Low-Power Vectorial VLIW Architecture for

Maximum Parallelism Exploitation of Dynamic Programming Algorithms, In International Confer-

ence on High Performance Computing & Simulation (HPCS 2014), pp. 88-95, Bologna - Italy, July

2014.

1.4 Dissertation Outline

This document is structured in 6 chapters. The current chapter, Chapter 1, introduces the developed

work. The following chapter revises the DP paradigm, using as examples existing DP algorithms, as well

as their corresponding implementations in state-of-the-art architectures. It will also present parallelism

paradigms that are widely used in DP implementations. Chapter 3 describes the proposed architecture.

It starts by mapping the DP requirements, to fully depict the complete resulting design. Chapter 4 details

the DP algorithm implementations in the proposed architecture. It will focus on the evaluated SW and

Viterbi algorithms. Chapter 5 presents the prototyping of the proposed architecture, as well as the eval-

uations that have been conducted with it and the reference state-of-the-art architectures. It will detail the

algorithms implementations for the remaining architectures, and comment in the obtained test results.

Finally, chapter 6 provides a conclusion to the thesis, discussing the obtained results and providing an

analysis on the devised architecture, addressing its advantages and drawbacks. Furthermore, it is also

discussed the open research directions that can be realized over the proposed architecture.

4

2Dynamic Programming

Contents2.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 Comparison between profile HMMs and single alignment algorithms . . . . . . . . 15

2.3 Implementation of DP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Data Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 State of the Art Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5

2.1 Dynamic Programming Algorithms

DP is an algorithm methodology for solving complex problems, by dividing them into smaller sub-

problems that are simpler to solve. If these sub-problems are solvable and the optimal solution for

each sub-problem is found, the solution for the main problem can be realized through the sequence

of solutions of its sub-problems. This property is known as the optimal substructure property [15], and

problems that present it can be solved by DP. Another property that a problem must have to be solved as

a DP approach is that the space of sub-problems must be ”small”, in a sense that a recursive algorithm

for the problem solves the same sub-problems over and over, rather than always generating new sub-

problems. Contrary to recursive solutions, DP takes advantage of these overlapping sub-problems by

solving each sub-problem only once and then storing its solution. If the solution is later required, it

can be looked up instead of recomputed. DP thus uses additional memory to save computation time,

resulting in a time-memory tradeoff, where the savings often translate into an exponential-time solution

to be transformed into a polynomial-time solution.

There are usually two equivalent DP approaches that can be implemented: a top-down with memo-

ization and a bottom-up approach. The first uses a recursive method, by storing the intermediate results

corresponding to each sub-problem, returning the saved value when required (memoization), thus sav-

ing further computations at the given recursive level. The latter approach depends on the size of the

sub-problems, solving them in size order, with smallest first. Each sub-problem is solved only once, with

the guarantee that all the prerequisite (and smaller) sub-problems have already been solved. These two

approaches yield algorithms with the same asymptotic running times, with the bottom-up approach often

having much better constant factors, since it has less overhead for procedure calls.

DP algorithms are often represented in matrix form, where each cell corresponds to a sub-problem

depending on the adjacent cells (sub-problem dependencies). This results in a final matrix where the

last cell can only be computed after all the previous cells have been computed (optimal substructure

property). This representation allows for multiple independent cells to be processed in parallel, thus

increasing the performance.

These algorithms are then used in a wide variety of problems: Matrix chain multiplication, sequence

alignment, optimal binary search trees, shortest paths and others, as long as those problems present

optimal substructure and overlapping sub-problems. The following sections will detail specific DP prob-

lems and respective algorithms that were studied and used throughout the work.

2.2 Sequence Alignment

Bioinformatic applications have an essential role on molecular biology and related fields. Sequence

alignment algorithms, like the NW [5] or the SW [4], use DP methods to search for similarities between

DNA or protein sequences within large databases (eg. GenBank/EMBL/DDBJ [2]).

Depending on the type of alignment that is required, two algorithms can be used that apply DP

methods: SW, which outputs a local alignment; and the NW, which outputs a global alignment for any

6

two given sequences. A local alignment represents a region of greater similarity between the compared

sequences and is preferred when the query sequence (sequence to compare to a database) is smaller

than the database sequence. The global alignment method, on the other hand, spans the entire query

sequence in attempt to align every symbol in the sequence with the whole database sequence. This is

useful when comparing sequences of about the same size, that are known to be similar (DNA or protein

sequences with similar functions).

Besides the NW and the SW, there are also other sequence alignment algorithms based on HMMs.

HMMs are stochastic models that model a process where the future states of that given process depend

only on the present state and not on the complete sequence of states that preceded them. In addition,

the states are hidden from the observer, which has only the information regarding the observed outputs

that were generated by the hidden sequence of states.

In particular, the Viterbi algorithm [8] is a DP algorithm used to solve HMM problems, returning the

most probable state sequence that originated the observed sequence of outputs. Although being in a

different family of DP algorithms, Viterbi shares many properties with the sequence alignment algorithms

mentioned before (NW and SW)[16].

Although the alignment algorithms mentioned above result in optimal alignments (being global or

local), there are other commonly used tools in the field based on faster heuristic approaches (instead of

DP approaches) with reduced complexity, implemented in GPPs. Some examples are the the BLAST [6],

FASTA [7] and HMMER ([16], [9]) tools. However, these tools can only guarantee a good approximate

alignment and not always the best, often requiring a later passage of a more complex DP algorithm (like

the SW or Viterbi) for better results.

2.2.1 Needleman-Wunsch Algorithm

The NW algorithm [5] is a DP algorithm for computing the global alignment between a query and

database reference sequences. The resulting score represents the best alignment between the com-

pared sequences (a query sequence Q of size n and a database sequence D of size m) and is based on

a substitution score matrix Sm (which defines the scores given to substitution mutations), a gap penalty

α (corresponds to a negative score given to an insertion or deletion mutation) and a recurring relation

that computes the resulting score matrix H (see equation (2.2)). This algorithm takes O(nm) time to

complete.

H(i, 0) = α ∗ i

H(0, j) = α ∗ i(2.1)

Hi,j = max

Hi−1,j−1 + Sm(qi, dj)Hi−1,j + αHi,j−1 + α

(2.2)

From the equations above, it can be seen that each cell in the resulting H matrix has three depen-

dencies in its computation: the cell at its left position (horizontal dependency); the cell at its top position

7

(vertical dependency); and the cell at its top left position (diagonal dependency). The scores given by

the vertical and horizontal dependencies are subtracted by the gap cost and correspond to an insertion

or deletion in the alignment. The score given by the diagonal dependency is added to the substitution

score matrix for the current cell and corresponds to a match or mismatch in the alignment. The maxi-

mum of these 3 values will be the final cell value. Figures 2.1 (a) and (b) show an example of the NW

algorithm for two small DNA sequences.

(a) First iterationof the NWalgorithm.

(b) Last iterationof the NWalgorithm.

(c) First iteration ofthe tracebackphase.

(d) Last iteration ofthe tracebackphase.

Figure 2.1: Example of the NW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -ACC : CACT ].

After the H matrix is computed, the last cell entry (Hn,m) presents the maximum score among all

possible alignments. To compute the actual alignment, a traceback algorithm starting in this maximum

score cell is computed (see figures 2.1 (c) and (d)). This traceback algorithm compares the three de-

pendencies of the cell currently being computed, to see which one of them was the source to the current

cell result. The chosen cell then becomes part of the alignment sequence and the traceback algorithm

repeats this process, now for the chosen cell. When the first cell of the H matrix (H0,0) is reached, the

traceback ends.

Different alignment sequences can be found whenever there is more than 1 possible cell to choose

from during the traceback. This happens when, during the score computation in the NW algorithm, there

is more than one maximum result in the main recursion, ie., the cell has more than one source.

The NW algorithm is rarely used when the sequences under comparison have different sizes, since

the resulting alignment would be dominated by gaps. This happens because the global alignment tries

to align whole sequences, while the local alignment only tries to align similar regions, thus performing

much better with different sequence sizes. Since different sized sequences are commonly used, the NW

algorithm does not get as much exposure as the SW algorithm, resulting in less implementations for it.

2.2.2 Smith-Waterman Algorithm

The SW algorithm is a DP algorithm for computing the optimal local alignment score between a

query and a reference sequence. The resulting score represents the degree of similarity between the

sequences and, similarly to the NW algorithm, it is based on a substitution score matrix and a gap-

penalty function. The algorithm was proposed by Smith and Waterman [4] and was improved later by

8

Gotoh [10] for multiple sized gap penalties, having an O(nm) time complexity, where n and m are the

sizes of the query (Q) and reference (D) sequences, respectively.

Given a substitution score matrix Sm, a negative gap-open penalty α and a negative gap extension

penalty β, the score matrix H can be computed by the following recursive relations:

Hi,j = max

0Ei,j

Fi,j

Hi−1,j−1 + Sm(qi, dj)

(2.3)

H(i, 0) = H(0, j) = 0

The terms Ei,j and Fi,j are defined in equations (2.4) and (2.5), respectively. Ei,j corresponds to the

scores ending with a gap in the reference sequence (horizontal dependency), while Fi,j corresponds

to the scores ending with a gap in the query sequence (vertical dependency). Accordingly, Hi,j repre-

sents the local alignment score involving the first i symbols of Q and the first j symbols of D (diagonal

dependency).

Ei,j = max

{Ei,j−1 + βHi,j−1 + α

(2.4)

E(i, 0) = E(0, j) = 0

Fi,j = max

{Fi−1,j + βHi−1,j + α

(2.5)

F (i, 0) = F (0, j) = 0

These relations are very similar to the NW algorithm. In fact, each cell still has the three dependen-

cies in its computation (horizontal, vertical and diagonal) with the horizontal and vertical dependencies

representing insertions or deletions in the alignment, and the diagonal dependency representing a match

or mismatch between the sequence symbols.

The only major difference is the fact that the H cell values do not go below zero. This will result in

the maximum cell value in the H matrix not being necessarily the last position (Hn,m), as in the NW

algorithm. Figures 2.2 (a) and (b) show this example of the SW algorithm for two small DNA sequences.

(a) First iterationof the SWalgorithm.

(b) Last iterationof the SWalgorithm.

(c) First iteration ofthe tracebackphase.

(d) Last iteration ofthe tracebackphase.

Figure 2.2: Example of the SW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -AC : CAC ].

9

The traceback phase (see figures 2.2 (c) and (d)) will then start in the H matrix cell that has the

highest score value, and will continue its computation along the sources for each considered cell, until it

reaches a zero valued cell, instead of stopping only on the first position of the H matrix (H0,0), as in the

NW algorithm.

Both these two differences in the score computation and traceback parts of the algorithm, will result

in the most similar region between the two compared sequences, i.e., the local alignment.

2.2.3 Hidden Markov Models

A Markov model consists in a stochastic model, where the future states of a process depend only on

the present state and not on the complete sequence of states that preceded it. This particular property

can be expressed by equation (2.6) for a given state sequence {w1, w2, ..., wn} [18].

P (w1, ..., wn) =

n∏i=1

P (w1|wi−1) (2.6)

A particular Markov model is the Hidden Markov Model (HMM) [19] where some (or all) states are

hidden from the observer. In a HMM, the observer has only the information regarding the sequence of

outputs that were generated by a hidden sequence of states.

An alternative mathematical expression for the HMM can be deduced by applying Bayes’ rule for a

given state sequence {w1, w2, ..., wn} and an output (observations) sequence {u1, u2, ..., un}:

P (w1, ..., wn|u1, ..., un) =P (u1, ..., un|w1, ..., wn)P (w1, ..., wn)

P (u1, ..., un)(2.7)

P (u1, ..., un|w1, ..., wn) =

n∏i=1

P (ui|wi) (2.8)

where P (w1, ..., wn) is the probability of a given state sequence, P (u1, ..., un) is the prior probability of

seeing a particular sequence of outputs, P (u1, ..., un|w1, ..., wn) is the probability of observing the output

for a particular state and P (w1, ..., wn|u1, ..., un) is the probability of the future state, given the current

output observation and it is the one that HMM pretends to find.

Two different tasks, with different outputs, can be performed on the HMMs: decoding and generation.

The first outputs the path of states that is more likely to have generated a given output sequence, and

its corresponding probability. The latter presents the likelihood probability of a given sequence being

generated by the model. The decoding task is computed by the Viterbi algorithm [8], which computes the

most probable state to generate each new output observation, for all the available states. The generation

task is computed by a similar algorithm, the Forward algorithm, which calculates a progressive sum of

the probabilities of all previous state paths for each new observation, resulting in a final probability

consisting in the sum of all final probabilities of all the states.

Although the above description of HMMs refers to a process of single alignment (one sequence

against another), the algorithms mentioned above are used in real applications for searching similar

sequences in a database, and thus require a method to search and compare a group of sequences

10

against a database (instead of only one). This method is achieved by creating alignment profiles, which

highlight the family’s sequences common features and effectively model an entire sequence family (see

figure 2.3). These profiles are usually generated by an initial multiple alignment, followed by a proba-

bilistic breakdown of the elements present in each position.

With alignment profiles, a query can now be compared against a family of sequences (profile), thus

greatly reducing the computational cost. Furthermore, a profile gives a more correct representation of

the defining characteristics of a family, by weighing the elements in proportion to their actual frequency

(and thus importance) in the underlying family.

Alignment

A T C C A G C T

G G G C A A C T

A T G G A T C T

A A G C A A C C

A T G C C A T T

A T G G C A C T

Profi le

A 5 1 0 0 5 5 0 0

C 0 0 1 4 2 0 6 1

G 1 1 6 3 0 1 0 0

T 1 5 0 0 0 1 1 6

Consensus A T G C A A C T

Figure 2.3: Example of a Consensus Profile, derived from a multiple alignment of a family of similarsequences.

Since the Viterbi algorithm gives the most likely path of states to generate a given sequence and

its corresponding probability, it is suitable for computing sequence alignment problems. However, the

Forward algorithm can only indicate the likelihood of the query sequence belonging to a family of se-

quences. This way, and given that both algorithms are very similar, our work will focus on the Viterbi

Algorithm.

2.2.3.A Profile Hidden Markov Models

As previously referred, HMMs can be used to statistically model the distribution of sequence elements

in a profile, by determining the probability of each element in each position of the family’s sequences as

the emission probability in each state. Thus, a Profile HMM can be used to compute the probability of

database sequences being generated by a given query, ie., align the query sequence to a database.

The construction of the Profile HMM model starts by modeling a global alignment (with no gaps)

to a succession of consecutive matching states, where each state corresponds to each column of the

profile sequences. Each matching states is also accompanied with emission probabilities, since they

emit the alignment symbols. These probabilities are derived from the relative frequencies of symbols in

the family’s sequences, at each column (see figure 2.4 (a)).

Then, insertion states are added to the model to represent gaps, i.e., portions of sequences that

do not match anything in the previous model (with only the match states). Since insertions can occur

at any point in the model, there is an Insert state for every pair of Match states (see figure 2.4 (b)).

11

B EM1 M2 M3 M4

(a) Example of a HMM composed solely of matchingstates, allowing for ungapped global alignment.

B EM1 M2 M3 M4

I2 I3 I4I1

(b) Example of a HMM that allows arbitrary inser-tions.

Figure 2.4: Example of the construction of a Profile HMM, starting with the match states (a) and with theaddition of insert states (b), with the respective state connections.

Furthermore, in order to support the affine gap model, the insert states must also have a loop to allow

for long inserted regions. The probabilities for entering an Insert state the first time versus staying in the

Insert state can also be different and, since they are arbitrary, are usually set to equal to the background

probabilities in a profile.

Finally, Deletion states are added to the model. These states represent portions of the profile that

are not matched by the sequence, and thus do not emit any output symbol. Naturally, the Delete - Delete

state transition corresponds to gap-extend costs, thus completing the Profile HMM model [20] (see figure

2.5).

M1 M2

I1

D1

I2

D2

M3

I3

D3

I0

M4

I4

D4

B

E

Figure 2.5: HMM for the optimal gapped global alignment (additional transitions from insert states todelete states, and vice-versa, are included for the sake of correctness, although these transitions areusually very improbable and have a negligible effect).

The profile HMMs can also be extended to support local alignment. This can be done by adding

two special flanking states that delimit the sub-region of the local alignment (States B and E), and two

self-looping flanking states (states N and C) that precede or follow the flanking states [19] (see figure

2.6). The flanking regions correspond to the unmatched regions of the aligning sequence and so, in

order to capture a local alignment, it is only required to add these two regions as self-looping states

with transitions from and to each match state. These new states also emit tokens with a probability

distribution, which can be set to the background random distribution of the profile.

Finally, in order to support multihit alignments, ie., multiple local alignments, another special state

is required (state J), which connects the flanking states B and E. This new state has jump and loop

probabilities, in order to cover the unmatched region between two local alignments (see figure 2.7).

12

M1 M2

I1 I2

D2

M3

I3

D3

M4

I4

D4

B E T

M5

CNS

Figure 2.6: Profile HMM for unithit local alignment.

M1 M2

I1 I2

D2

M3

I3

D3

M4

I4

D4

B E T

M5

J

CNS

Figure 2.7: Profile HMM for multihit local alignment.

2.2.3.B Viterbi Algorithm

As previously stated, Viterbi’s algorithm [8] is a DP algorithm that finds the most likely sequence path

of hidden states in a HMM (or a Profile HMM), for a given sequence of observed outputs. The required

inputs for the algorithm are:

• State Space (S): Vector with all the possible states.

• Observation Space (O): Vector with all the observed outputs.

• Observation Sequence (Yi): Vector with the sequence of observed outputs.

• Initial Probabilities (πi): Vector of initial state probabilities.

13

• Transition probabilities (Ti,j): Matrix of transition probabilities from the state i to state j.

• Emission probabilities (Qi,j): Matrix with the probabilities of observing the output i given the state

j.

Given the inputs above, the algorithm can compute the most probable state sequence {x1, ..., xT }

that originated the observed outputs {y1, ..., yT }, by using the following recursive relations:

Vt,k =

{Qyt,k × πk t = 1Qyt,k ×maxx∈S(Tx,k ∗ Vt−1,x) t 6= 1

(2.9)

where Vt,k is the probability of the most likely hidden state sequence, responsible for the first t observa-

tions, having k as its final state.

From the relations in equation 2.9, it can be seen that during the first iteration of the algorithm (t = 1),

the probability of the first hidden state only depends on the initial probabilities and on the emission matrix,

since there are no previous states yet.

For the remaining iterations (t 6= 1), the computations for every hidden state k (which will become

the xt after the computation) will now require, for every x state in S, the probabilities of its preceding

state xt−1, as well as the emission probability Q of observing the output given the state k, and the T

probability of passing from the preceding state xt−1 to the current one being computed (k).

This wields a time complexity of O(TS2). Figure 2.8 shows an example of the Viterbi algorithm for a

HMM with 2 states.

Start

S00.2

S10.08

S00.098

S10.019

Observation 1 Observation 2

S10.1

S00.051

Observation 3

Figure 2.8: Trellis diagram for a sequence of three observations in the Viterbi algorithm. Each state hasthe correspondent probability value. The T and Q values are not depicted. The red path corresponds tothe most probable state sequence.

From a DP perspective, the main problem can be seen as the full sequence of hidden states that

Viterbi tries to find, while the sub-problems are the computations of the probabilities of every hidden

state along the sequence, since they only depend on their previous state.

Since the algorithm only stores information regarding the previous state, a back pointer for the pre-

ceding state should also be used, in order to retrieve the final state sequence path at the end of the

computation (traceback).

14

2.2.4 Comparison between profile HMMs and single alignment algorithms

The Viterbi algorithm, used for profile HMMs, is very similar to the SW algorithm, presented previ-

ously. In fact, when using profile HMMs for solving sequence alignment problems, both algorithms have

the same recursive dependencies, with little differences. The following equations represent the applica-

tion of the Viterbi algorithm using a notation similar to the one that was adopted for the SW algorithm.

Mi,j = log eMj(xi) +max

Bi−1 + log tBj−1Mj

Mi−1,j−1 + log tMj−1Mj

Ii−1,j−1 + log tIj−1Mj

Di−1,j−1 + log tDj−1Mj

(2.10)

Di,j = max

{Mi,j−1 + log tMj−1Dj

D1,j−1 + log tDj−1Dj

(2.11)

Ii,j = max

{Mi−1,j + log tMjIj

Ii−1,j + log tIjIj(2.12)

Where the M , D and I state values correspond to the the H, E and F values in the SW equations,

respectively. The B state corresponds to a special state that was previously explained, and will be

omitted here for comparison purposes.

The above equations are represented in log-space, not only to eliminate the multiplications, but also

to provide better accuracy. The log eMj(xi) and log tXjYj

thus correspond to the emission and transition

scores from Xj to Yj , respectively, which are already pre-computed scores present in the profile. The

transition values can be compared to the gap scores in the SW algorithm, but with an added delay,

since they depend on the position-specific transition states, requiring a previous look up. Additionally,

the M emission values roughly correspond to the substitution score matrix in the SW algorithm, varying

accordingly to the current Match state Mj and sequence symbol. Thus, they are already in a model-

specific profile, and can be re-used between sequences, essentially acting like a Query-Specific Profile

for the SW algorithm.

Apart from the differences stated above, the M state will require the Ii−1,j−1 and Di−1,j−1 states,

instead of the I and D states computed in the current iteration, as it happens for the SW in equation

(2.3). This will increase the amount of registers and memory that is required for the dependency values,

since all dependencies must now be stored to be later used in a future iteration. This will lead to a

reorganization of the algorithm computations, by having delayed loads and stores of the M state values,

where the dependencies required for the M state are only updated after the new M state values is

computed.

2.3 Implementation of DP Algorithms

When solving a problem using a DP-based approach, it is first necessary to decompose it into a set

of smaller sub-problems. This translates into computing the value of each cell in a n-dimensional matrix

15

by relying on the value of pre-computed adjacent cells. Given the usually large length of these matrices,

it is of utmost importance to implement additional methods to parallelize and to speed up DP algorithms.

Since the sub-problems are usually independent, DLP is often implemented in order to maximize the

number of independent cells (sub-problems) computed in each iteration of the algorithms. Additionally,

ILP can be also used with DLP in order to minimize the hardware impact brought by DLP, achieving

better performances with a lower power usage. In current architectures, however, it is not often possible

to conciliate both these parallelism paradigms at full extent, given the memory accesses or other pre-

existent structural architectural designs.

The following sections will cover both the DLP and ILP paradigms, as well as some state-of-the-art

architectures, both programmable and dedicated.

2.3.1 Data Level Parallelism

As previously mentioned, DLP is widely exploited in DP algorithms since most of the data to compute

is independent. This means that, in any given time during computations, it is possible to operate different

data elements simultaneously, i.e., operate over a vector of elements. Given the matrix-like represen-

tation of DP algorithms, this translates into a vector of cells. Using a 2D-matrix as an example, and

given the three data dependencies (vertical, horizontal and diagonal) found in the sequence alignment

algorithms previously mentioned, it is possible to see that the only possible vector, in order to have only

independent cells, is composed by the cells along the anti-diagonal, as depicted in figure 2.9. Any other

vector composition would result in data hazards, as the dependencies required for a given cell would

not be calculated before that cell. Since the sub-problems in a DP algorithm usually have the same set

of operations applied to them, each data vector would only require one set of operations applied to it, in

order to calculate all its containing cells. Ideally, this would result in a speedup equal to the number of

cells in each vector, in comparison to a single data architecture.

Figure 2.9: Example of DP Cell parallelism.

Most current processors already include vectorial instruction sets, like Streaming SIMD Extension

(SSE) or Advanced Vector Extension (AVX), making DLP an easy and viable option. For this reason,

and given the performance boost it gives, DLP is used in most GPP architectures for DP algorithms, as

well as in dedicated architectures.

16

2.3.2 Instruction Level Parallelism

When compared to DLP, ILP does not have a great impact in DP algorithms. Since the different sub-

problems often require the same set of operations to be solved, there is no particular need for different

operations being concurrently computed, as it even may lead to racing conditions and eventually data

or structural hazards. However, if there is a guarantee of no racing conditions (and thus no hazards)

while computing different cells at different operation steps of the computation, ILP can potentially reduce

the required hardware used by a DLP-only solution. In fact, vectorial DP algorithms do not make the

best usage of the available hardware, since, at any given time, only one type of operation is being

computed on a given vector. With the addition of ILP it is possible to have, at any given time, a different

subsets of cells computing different operations. This way, the length and number of functional units

in the architecture would be reduced, promoting a better hardware usage, since more functional units

would be working at the same time. The solution just described is the one implemented by VLIW

architectures, where a larger single instruction is issued, instructing different operations to different

data elements in parallel. Although this solution seems attractive from an hardware requirements point

of view, VLIW architectures are not that common. Most current processors have however different

ILP mechanisms, where the work of extracting the ILP is of the compiler’s responsibility. Instruction

pipelining, out-of-order execution or branch prediction are just some of the methods that are frequently

present in most processors, which are used alongside the DLP extensions to maximize the performance

of DP algorithms.

2.3.3 State of the Art Architectures

Hardware architectures can be divided into programmable and non programmable architectures.

Programmable architectures present greater flexibility when compared with the non programmable, since

they are easily adapted to new types of problems and algorithms. They are commonly found in GPPs,

which are used for working with a wide spectrum of applications.

Non programmable architectures, on the other hand, are usually implemented in dedicated hardware,

such as ASICs or FPGAs, and are commonly used to solve a specific task or family of similar tasks. This

type of architectures are mainly designed for speed and optimization, often resulting in high performance

with a low power consumption, but also with a higher complexity and implementation costs than the

programmable architectures.

In particular, FPGAs are regarded as a hardware alternative to the GPPs and ASICs, by balancing

the flexibility often found in GPPs with the performance often found in more dedicated architectures.

They allow reconfigurable designs with a smaller design cycle, although not achieving the high perfor-

mance offered by an ASIC nor the programability of the GPP. However, when compared to the other

architectures, the FPGA still offers a better performance-programability ratio value depending on the

applications to be implemented.

Due to their large computational times, DP algorithms require fast implementations in order to keep

up with the growing size of the sequences being considered in the sequence alignment problems. Al-

17

though the commonly used (but sub-optimal) tools for these types of problems (BLAST [6], FASTA [7],

HMMER [9]) are implemented in GPPs (due to its flexibility), many dedicated architectures have arisen,

bringing faster algorithm computations to the table.

The following sections will give an overview of the several programmable and non programmable

architectures for the implementation of sequence alignment DP algorithms that were presented in the

previous sections.

2.3.3.A Programmable Architectures

Vector architectures exploit data level parallelism by implementing high-level operations that work on

linear arrays of data instead of individual data items. The vector elements do not have dependencies

between them, ensuring that no data hazards occur.

Nowadays, most commercial processors have support for vector instructions, like the SSE or AVX ex-

tensions in Intel processors [21], containing dedicated registers and functional units for those particular

instructions. These instructions are classified as Single-Instruction Multiple-Data (SIMD) instructions.

Due to their ability to exploit parallelism, this type of architectures is often used for DP algorithm

implementations, which usually require a large quantity of parallelizable computations.

Smith-Waterman

By reviewing the implementation of the SW algorithm presented before, it is possible to observe that, for

the computation of the final score matrix H, the only cells that do not have dependencies between them

are the ones along the anti-diagonal (see figure 2.11(a)). This allows an inner loop parallel processing

of vectors composed by the anti-diagonal values of the H matrix and was first proposed by Wozniak

[11]. Although the loops are fully parallelizable, this parallelization scheme has the drawback of difficult

memory acess patterns, introducing large overheads in data manipulation when implemented on GPPs.

Rognes and Seeberg [13] improved on Wozniak work by pre-computing a query profile (figure 2.10

(a)) once for the entire database sequence. This query profile indexes a modified substitution score ma-

trix by the query sequence position and the database sequence symbol, instead of the original matrix by

the query sequence symbol and the database sequence symbol (figure 2.10 (b)). For a given database

symbol, the resulting score for matching it with all the query sequence symbols is stored sequentially in

one column of the matrix with other columns corresponding to other database symbols.

When implemented with Intel’s AVX/SSE instructions, the vector elements are composed of cells

parallel to the query sequence, instead of cells along the anti diagonals (see figure 2.11(b)). This vali-

dates the use of the query profile, but has the disadvantage of introducing data dependencies between

the cells of the vector. It also introduces conditional branches in the inner loop for the computation of the

F term (see equation (2.5)) when data dependencies occur. SWAT optimization [22] of this procedure

tries to minimize the impact caused by these inter-vector dependencies. This optimization assumes that

the E and F terms are often equal to zero, hence not contributing to the score value H. In fact, it was

demonstrated that as long as H is not larger than the threshold α + β (respectively, the gap open and

gap extensions penalties), E and F will remain zero along the column and row of the matrix, eliminating

18

C A C T

A -1 2 -1 -1

C 2 -1 2 -1

C 2 -1 2 -1

Database Sequence

Query Sequence

(a) Query-profiled substitution scorematrix.

A C T G

A 2 -1 -1 -1

C -1 2 -1 -1

T -1 -1 2 -1

G -1 -1 -1 2

Database Symbols

Query Symbols

(b) Substitution score matrix.

Figure 2.10: Comparison between a substitution score matrix with and without query profiling for theDNA sequences [AAC] and [CACT]. Note that the depicted DNA sequences may be composed by 4different symbols (A,C,T and G).

data hazards in the parallel computation of the vector elements. When this verification is not true, data

dependencies may arise and the cells will take a more time consuming computation process.

Farrar [12] tackles this problem by also organizing the SIMD registers in parallel to the query se-

quence (just as Rognes), but accessing them in a striped pattern (See figure 2.11(c)). This modified

access pattern moves the conditional branches of the vertical dependencies to a lazy loop, executed

outside the inner loop of the algorithm. This way, the conditional branches only have to be taken into

account once for every database symbol. After the completion of the inner loop, a first pass is made to

check the values of F , for each of the query segments against the values of H for the given database

symbol. A second pass - the lazy loop - is only needed when the values of F are greater than the values

of H.

(a) Wozniak [11] (b) Rognes [13] (c) Farrar [12] (d) Rognes [14]

Figure 2.11: State of the art SIMD implementations of the SW algorithm (extracted from [23]).

More recently, Rognes [14] presented a modified implementation of the SW algorithm that exploits

multithreading capabilities in multicore processors, by comparing several database sequences with a

19

single query sequence in parallel (see figure 2.11(d)). This implementation achieves higher speedups

but diverges from the previous implementations in a way that it does not solve the same single-reference

single-query alignment problem, solving instead multi-reference single-query alignment problems.

The implementation in [23] also takes a different approach and tries to optimize Farrar’s striped pat-

tern scheme by expanding the processor’s ISA.

At this point, it is important to note that the NW algorithm can use the same SIMD architecture of

the SW algorithm presented above. This is possible since both algorithms are based on a very similar

recursion. In fact, as it was stated in a previous section, the only difference in the main recursion is the

fact that the computed scores in the SW algorithm cannot go below 0, unlike the NW algorithm. Some

Multiple-Instruction Multiple-Data (MIMD) architectures were also exploited in [24] and in [25].

Viterbi

Viterbi’s algorithm also takes advantage of SIMD architectures, since the involved procedure computes

multiple cells (states) that only depend on the previous iteration. By sharing many similarities with the

SW and NW algorithms, most of their optimizations can also be used for this algorithm. In fact, the

commonly used tool for HMM problems, HMMER [9], uses the SSE instruction set extension and adopts

Farrar’s striped pattern [12].

HMMER mainly uses the multihit model for the profile HMM (see figure 2.7), corresponding to a

local alignment, just like the SW algorithm. As a result of applying Farrar’s method, the only major

differences to the SW algorithm are in the treatment of the Delete (D) states. In Viterbi’s algorithm, these

D state values correspond to a dependency in a previous column, while in the SW they correspond to a

dependency in the same column (vertical dependency). This requires a larger memory to save all the

additional D values, but it will come with the advantage of a simplified lazy loop, since the HMM topology

allows the lazy-F loop to correct only the D values of the current iteration, since both the Match (M) and

Insert (I) state values are computed with the D values of a previous iteration.

One particularity in the main recursion of Viterbi algorithm that was not yet addressed in this overview

is the fact that it uses multiplication operations (see equation (2.9)). However, since the probability values

used in the computations are below 1, and given the large number of computations, this can result in

very small numbers and, in more heavy computations, it could even result in an underflow. In order to

guarantee a better accuracy, the probabilities used in Viterbi algorithm are usually changed to logarithmic

probabilities. This will also replace the multiplications by sums, resulting in a more efficient computation,

since sum operations are frequently much faster to compute than multiplication operations (as seen in

equations (2.10), (2.11) and (2.12)). The Viterbi implementation in HMMER [9] tool uses this, as the

majority of other implementations of the algorithm, such as Intel’s evaluation in [26].

2.3.3.B Non-Programmable Architectures

Despite very flexible, the previous GPP implementations can hardly exploit the hardware at its full

potential, since they are limited to the processor’s ISA. ASICs do not have this problem and can of-

20

ten achieve a much lower computational time ([23], [27]), although they still lack flexibility and come at

higher cost. As a result, more flexible solutions for the implementation of specialized architectures are

often pursued, such as FPGAs, which can adapt to the targeted applications while still providing a good

performance, since they are reconfigurable ([28], [29]).

Most non-programmable dedicated architectures that have been proposed to implement DP algo-

rithms are based on ASIC and FPGA implementations. Usually, they consist in a systolic array structure,

composed by multiple Processing Elements (PEs), that perform the necessary computations for many

cells in parallel. The scalability of these implementations is often associated to an increase of the number

of PEs present in the array, resulting in very dense arrays with very high throughput.

Smith-Waterman

For the SW algorithm, the implementations usually parallelize the cells computation along the anti diag-

onal, in order to avoid unnecessary computations brought by data hazards. The implementation in [30]

uses this scheme, while also exploring the optimization of the traceback phase and the evaluation of the

n-Best alignments of a given sequence pair. This parallelization scheme is also used in [31] for the NW

algorithm. Here, the algorithm is implemented both in a FPGA and in a Graphics Processing Unit (GPU)

for comparison. In [32], both the NW and SW are implemented and a common architecture is used to

measure the performance of both algorithms.

Viterbi

Some dedicated architectures have also been proposed to implement certain HMMER procedures that

compute the Viterbi algorithm and optimize them. The FPGA implementation in [33] presents a systolic

array architecture with 4 stage pipelined PEs, performing multiple searches in parallel. A Viterbi imple-

mentation based on Rogne’s inter-sequence SIMD parallelisation [14] is also proposed in [34] with an

additional exploitation of cache locality for an even faster throughput.

The implementation in [29] presents a parallelization scheme based in a polyhedral model using

different linear space-time mappings. This last architecture is composed of a linear array of PEs that

also compute multiple instances of the viterbi algorithm in parallel.

2.4 Summary

DP can be used to solve complex problems by partitioning those problems in simpler and smaller

ones. These sub-problems, often independent of one another, enable data parallelism exploitation,

by computing several sub-problems at the same time. This greatly accelerates the execution of the

algorithms that implement such methods, like the SW for sequence alignment problems or the Viterbi

algorithm for HMM and Profile HMM problems.

Given the current support in most everyday processors, DLP can be easily exploited in most archi-

tectures through SIMD extensions, and thus most state of the art architectures and DP algorithms make

use of this paradigm. Additionally, it was seen that ILP could also be used in conjunction to DLP in order

21

to reduce the hardware requirements and to maximize their utilization. However, the ideal solution would

fall in the field of VLIW architectures, which do not have that much compatibility with current compilers,

and thus are usually avoided. Other ILP mechanisms, like out-of-order executing or multiple issuing, are

instead present in most modern GPP and, although not made specifically for DP problems, can help in

boosting those algorithm performances.

When looking for a compromise between DLP and ILP, the design of a programmable processor

architecture with an optimized ISA, capable of computing different families of DP algorithms, would

come closer to the flexibility of a common GPP for this type of algorithms, while also maintaining a great

level of optimization and consequently faster processing speeds. Furthermore, when implemented in a

FPGA device, no restrictions would apply (besides the FPGA hardware resources limitations), permitting

using and studying different architectures for the processor, such as the VLIW paradigm that is rarely

pursued in common GPPs, given the lack of software optimization and compilation support.

The architecture that will be proposed in this thesis aims just that: to provide a middle-ground for DP

algorithm computation, where high performances can be achieved with an higher degree of programma-

bility than usually found on high performance GPP architectures. Furthermore, this architecture should

present itself scalable enough to add new instructions and other dedicated/optimized PEs later down the

road, in order to expand the supported families of DP algorithms, while maintaining good performances.

22

3Proposed VLIW Architecture

Contents3.1 Architecture Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Register Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.2 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.4 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

23

3.1 Architecture Requirements

The proposed architecture targets the simultaneous exploitation of DLP and ILP paradigms, in order

to position itself as a faster solution than the current DP solving architectures, and to support a broader

range of algorithms.

As it was showed in the previous chapter, DP problems can be translated into a n-dimensional matrix,

where each sub-problem corresponds to a cell in the matrix, with adjacent cells as prerequisite sub-

problems. In a 2D matrix, this typically results in horizontal, vertical and diagonal data dependencies

from the left, top and top-left cells, respectively, for each cell computation. Accordingly, to maximize

the processing efficiency and to minimize the number of dependencies, cell computations should be

performed in parallel along the anti-diagonal (see figure 2.9).

However, exploiting the DLP along the anti-diagonal brings two problems: harder memory organiza-

tion/access (visible in Wozniak’s [11] implementation of the SW algorithm) and larger hardware require-

ments. While the former can be solved by implementing specialized memory-access units to gather cell

values in non-adjacent memory positions, the latter requires the consideration of a different type of par-

allelism. In fact, vector-only solutions will always result in low Functional Unit (FU) usage. For example,

consider that vector processing is used to compute the value of N cells in parallel, which requires a

total of M vector instructions. Assuming the inexistence of some inherent data dependency and that

only one of these operations is a square root, an utilization of 1/M is expected for all N parallel square

root FUs. Naturally, it is possible to reduce the number of FUs (hardware requirements) by serializing

the operation on the different vector elements. However this solution trades performance for hardware

requirements, hence it is not ideal.

The alternative is to also explore ILP, by assigning different operations to be executed on different

parts of the vector. This results into having multiple independent units, each computing their own vector

operation, in a given part of the vector, which not only increases the potential for additional parallelism,

but also reduces the hardware requirements and increases the utilization rate. This is the paradigm

used by VLIW architectures, and can be seen side by side to a vector-only approach, as it is illustrated

in figure 3.1. In this figure, although it takes one more clock cycle for the VLIW architecture to compute

the 2 instructions for all the 4 elements, it must be taken into account that the example only depicts two

instructions. In fact, the ILP introduced here will only have an impact in the initialization of the algorithm,

resulting, for the stationary phase of the algorithm, in the same number of clock cycles as the vector-

only approach. The use of ILP is also supported by the set of common steps usually included in a DP

algorithm. Usually, these steps consist in dependency loads followed by cell computations, finalizing with

the store of the results. Assigning these different steps of the algorithm to different cells in the matrix

(along the anti-diagonal) validates the ILP (different instructions operating over different cells) while also

maintaining data coherence, given the independence between the cells. The only control requirement

is to guarantee that the cell dependencies are always computed in advance to the the cells that require

them, in order to avoid data hazards.

To efficiently support DP algorithms and to simultaneously explore DLP and ILP, the proposed archi-

24

SqrtSqrtSqrtSqrt

Functional Units

...

Vector VLIW

+/-

SumSumSumSum

+/- +/- +/-

√√ √ √

Logic Logic Logic Logic

x x x x

SqrtSqrtSumSum

Functional Units

...

+/-

SumSum

+/-

√√

Logic Logic

x x

SqrtSqrt

Figure 3.1: Comparison between a Vector architecture composed of 4 elements, and an equivalent VLIWarchitecture with 4 units composed of 1 element each. Both examples compute a square root operation,followed by a sum operation. In the Vectorial approach, two clock cycles are required to compute thetwo instructions, which use 4 FUs each (colored FUs). The VLIW approach takes 3 clocks cycles tocompute the two instructions, but only requires a maximum of two FUs per instruction (colored FUs).This is achieved by delaying the two last units in order to reduce the number of FUs.

tecture must then comply with the following requisites: Independent execution units to compute indepen-

dent instructions in parallel, issued from an instruction bundle; and a Data Stream Unit (DSU) to access

the memory concurrently with the execution units (to reduce the latency brought by non-adjacent mem-

ory accesses). Each execution unit will then be assigned a different vector of cells and an independent

register bank, in order to operate independently of the other units. This also enables the exploitation of

memoization, a technique where the current algorithm iteration re-uses the results obtained in the previ-

ous iteration, which are stored in the register bank. This technique is specially useful in DP algorithms,

where the sub-problems only depend on the result of previous sub-problems, which are represented by

the previous iteration (or series of iterations) results. This also reduces the number of required memory

accesses and thus increases the performance. However, there is still the need to ensure the communi-

cation between the different execution units, either because of a data dependency that is in a different

unit, or simply because it would greatly improve an algorithm to have shared values between units. The

usage of memory to share values would prove inefficient. Accordingly, this will be achieved with the ad-

dition of a small group of shared registers and sniffing mechanisms between a small subset of registers

in the register bank of each execution unit.

Finally, and given the amount of data that is required to be loaded and stored from memory, the

architecture will also present a RAM memory, as well as a smaller and local memory to store constants

that are often used during algorithm computations. The existence of these two memories can further

help to reduce memory congestion, specially in memory heavy algorithms, like Viterbi.

25

3.2 Proposed Architecture

As it was previously referred, there are several ways to explore ILP alongside DLP. In the proposed

architecture, static ILP is explored, since it requires less control hardware, thus accomplishing better

energy efficiency. This is achieved by issuing an instruction bundle that is composed of several different

instructions, each operating over a vector of independent elements (DLP) in different execution units.

This way, instead of using a single large vector computing the same instruction (as it is typical in vec-

tor architectures), the architecture has several smaller vectors, each effectively computing a different

instruction.

In DP algorithms, this will correspond to the parallel processing of cells that are in different steps

of the algorithm, in order to maximize the parallelism and thus, to reduce the required hardware. This

parallelism must be done cautiously, since it is prone to introduce data race conditions if certain con-

ditions are not met. However, only two conditions are required to be met, in order to avoid data race

conditions: all cells currently being processed must be independent; and if there are cells that are being

processed in advance (computing a further instruction in the algorithm), there must never be dependen-

cies to the cells that still are in previous processing steps of the algorithm. By using the anti-diagonal

parallelism as an example, this second condition is met by ensuring the cells being processed at the

most down-left section of the anti-diagonal, are in advance regarding the cells at the top-right section

of the anti-diagonal, since the dependencies propagate from the left to the right and from the top to the

bottom (see figure 3.2).

1

2 1

3 2 1

1 3 2 1

2 1 3 2

3 2 1 3

3 2 1

3 2

3

Qu

ery

i+1

Qu

ery

i

Unit 0 Unit 1 Unit 2 Unit 3

Ref 0 Ref 1 Ref 2 Ref 3

1

2 1

3 2 1

1 3 2 1

2 1 3 2

3 2 1 3

3 2 1

3 2

3

Qu

ery

i+1

Qu

ery

i

Unit 0 Unit 1 Unit 2 Unit 3

Ref 0 Ref 1 Ref 2 Ref 3

Clock Cycle

Computation of one Cell

Figure 3.2: Example of two iterations of a DP algorithm with 3 instructions per iteration and 4 cells beingprocessed along the anti-diagonal. Each cell is processed in a different execution unit with a differentreference symbol, which are represented by the columns. Each group of 3 instructions correspondsto one cell computation, for a given query-reference pair. The dependencies between symbols arerepresented by the arrows, while the rows represent clock cycles. Unit 0 is the most advanced unit.

26

To compute each instruction that integrates the instruction bundle, the architecture presents inde-

pendent execution units. Each one of these units operates a different vector of cells, and has its own

register bank to locally store all the intermediary results that are generated by the algorithms (useful for

memoization in DP algorithms), as well as any other value or dependency required in the immediate

computations, thus reducing memory access operations and improving the processing performance,

while maintaining an organized data structure (see figure 3.3). However, there will be situations where

an execution unit will require data values from a different unit (e.g. the execution units that are in advance

regarding the computation steps of an algorithm), which will generate dependencies that are required by

a different and delayed unit. Thus, a sniffing mechanism, operating in a small subset of registers in each

register bank, is implemented. These special registers will sniff others in an adjacent unit to ensure the

commitment of the required dependencies, and access them as they were in the same register bank,

thus maintaining the independence of the register banks while keeping data coherence and avoiding

unnecessary memory accesses.

Unit 0Unit 1...Unit n

RegisterBanks

Figure 3.3: Execution Units with the respective independent register banks.

In addition to sniffing, there is an additional way to share data between execution units (which are

not adjacent) without resorting to memory. This is specially important in DP algorithms, since there are

often dependency values that are used to compute all cells for a given number of iterations, being later

updated and repeating the process. Since there is a constant need to load these values from memory,

and given the independence between execution units, this would lead to redundant memory accesses,

as different execution units would have to fetch the same values from memory. This problem is solved

by adding a set of shared memory registers to each execution unit. These registers can be accessed

by all execution units and thus can be used to store dependencies that are required by several units,

removing the redundant memory accesses (see figure 3.4).

Although the register banks are mainly used to store dependencies between iterations, it is still often

required for DP algorithms to store some of their dependencies in memory, especially when they are re-

quired in a much later iteration of the algorithm. In order to minimize the impact of the resulting memory

accesses on the processing performance, a DSU is used to perform the necessary memory loads and

stores in parallel to the execution units. This way, while the execution units are computing the main steps

27

Memory

registers

Dual Port Memory

Memory

registersMemory

registers

Memory

registers

Unit 0Unit 1...Unit n

RegisterBanks

Sniffing SniffingSniffing

Figure 3.4: Register banks with the sniffing mechanism and the additional shared memory registers ineach execution unit (with the respective connection to the memory).

of an algorithm, the DSU is pre-storing or pre-loading cell values at the same time, which will only be

required at a later iteration. Here, contrary to common VLIW architectures, all the existing units (both the

execution units and the DSU) can access the memory, requiring a priority access list to avoid conflicts.

Since the DSU main function is memory access operations, it has the top priority over all other units.

DataIStream(DSU)

ExecutionUnit 0

ExecutionUnit 1

...Execution

UnitIn

RegisterBanks

ScratchpadMemory

InstructionMemory

JumpControl

PC

(Shared) VectorIFunctionalIUnits

Sniffing Sniffing Sniffing

Memory

registers

DualIPortIMemory

Memory

registersMemory

registers

Memory

registers

Figure 3.5: Proposed architecture.

To further ease the memory access delay problem, a local fast (scratchpad) memory is also included,

which is used to store constant values, required in several DP algorithms. These constant values are

28

InstructionMemory

Fetch

Decode

Decoder

Execution

Write-Back

Re

gisterUBan

ks

InstructionUBundle

JumpControl

U0U1...Un

MemoriesFunctional

Units

PC

EXECForwarding

WBForwarding

U0U1...Un

U0U1...Un

Figure 3.6: 4-Stage pipeline structure.

pre-fetched at the beginning of the computation by the DSU, and can only be accessed by the execution

units (with an access priority list to decide between them).

These specifications result in the architecture presented in figure 3.5. The architecture also presents

a 4-stage pipeline: a FETCH stage, where the next instruction is loaded from the instructions memory;

a DECODE stage, where the fetched instructions are decoded in all units; an EXECUTE, stage where the

FUs and memory operate the instructions; and a WRITE-BACK stage, where the results are written to the

register banks. The pipeline also includes stalling and data forwarding mechanisms to prevent hazards

and to minimize the number of stalls on the processor, respectively, and can be seen in figure 3.6.

U3U2U1U0 U7U6U5U4U3U2U1U0

U3U2U1U0

2n

n

n

DLPScalability

ILPScalability

Figure 3.7: Processor scalability: This example doubles the number of processed cells by doubling thevector size from n to 2n (DLP scalability to the left) and by doubling the number of available units from 4to 8 (ILP scalability).

By considering the set of characteristics listed above, the proposed architecture can also be easily

scalable in two distinct ways (see figure 3.7): by increasing the length of each execution unit and thus

increasing the vector length (DLP); and by increasing the number of different execution units, and thus

29

increasing the number of parallel instructions (ILP). The first solution would mainly require an increase

of the vector size that is processed in the functional units, while the second solution would require an

increase in the number of functional units. Both these solutions can be applied together, in order to

provide a better balance between both parallelism paradigms.

3.2.1 Register Banks

Each execution unit has its own private register bank of 28 registers, as well as a small set of 4 shared

memory registers, achieving a total of 32 registers (illustrated in figure 3.5). Although the presence

of private registers in each execution unit results into reduced register access times, better structural

organization and thus, better performance, it would be advantageous to have the possibility to share

values between units without resorting to copy the value to a shared register. Specifically, and using

the 2D matrix to represent the processing pattern of many DP algorithms as an example, horizontal and

diagonal dependencies between the cells in the edges of the execution units would require, in every

iteration, values to be passed from the adjacent unit to the one that requires those dependency values.

Given that these dependencies occur very frequently (every iteration), the delay caused by copying the

values would be very significant. To circumvent this, the previously referred sniffing mechanism is used

(see figures 3.4 and 3.5).

This mechanism affects a very small number of private registers (in the 2D example, only two reg-

isters in each register bank would require sniffing), and it consists on mirroring those registers to the

adjacent execution unit’s register bank. Accordingly, whenever an update is made on the registers being

sniffed, that same update is reproduced to the adjacent execution unit. Given that the typical types of

dependencies in DP algorithms follow a top-down and left-right approach, the sniffing mechanism is only

required from one unit to the one at its right, with the last unit (the one that computes the most left cells)

not requiring sniffing.

The existence of a sniffing mechanism does not exclude, however, the need for shared memory

registers. These registers are mainly used by a DSU to communicate with the memory. This is done

in order to separate the parallel memory operations that are handled by the DSU from the intermediary

results of the main algorithm computations, issued by the execution units, and thus avoiding register

access conflicts or data hazards. To further avoid conflicts, the sharing privileges between execution

units only covers load accesses, with the writing being exclusive to the register owner unit and to the

DSU. In case of a writing conflict between the DSU and one execution unit, a priority list is used, with

the DSU having the top priority. These memory registers also serve the purpose of reducing the number

of memory accesses, in situations where a dependency value, loaded by one execution unit, is required

by other units. Instead of being retrieved multiple times from memory to multiple execution units, these

dependencies can be loaded to only one unit and then be used by all units or, if the dependency requires

to be constantly updated, it can even be shifted to the other units memory registers, with the help the

DSU.

The proposed architecture also supports different word sizes, resulting in multiple words being stored

in each register if a multiple of the maximum word size is used. Accordingly, if the word size is half the

30

CMP+++

SUM SUM CMP CMP

Functional Units

...

HOLD

NOP+NOPNOP

SUM SUM CMP CMP

Functional Units

...

First Clock Cycle Second Clock Cycle

Figure 3.8: FU conflict control. During the first clock cycle, 4 execution units try to compute 3 sumoperations and 1 comparison. Since there are only 2 FUs capable of doing sums, the 3rd unit will holdits instruction and the previous pipeline stages will be stalled. In the second clock cycle, the FUs arenow free to operate the 3rd unit’s sum operation (the remaining execution units will not compute anyinstruction), finalizing all instructions in the bundle, and resuming the normal processor operation.

maximum value, each register will then store two different words. If the word size is a quarter of the

maximum value size, each register will store four different words, and so on. This design paradigm allows

for different accuracy ranges, improving algorithm performance if higher accuracy is not required (more

cells computed simultaneously, with the same hardware resources), while still supporting problems that

require a higher level of precision.

3.2.2 Functional Units

The FUs that are present in the architecture are shared between all the execution units. This is done

in order to optimize the resource usage, and to reduce the hardware requirements. However, this design

option will also require a conflict control, to manage those situations when multiple execution units try to

access more FUs than those available. Therefore, whenever there are execution units trying to access

a busy FU, a stall is generated, and the instruction will be held until the required FU is free, taking such

instruction additional clock cycles to compute (see figure 3.8).

The order how each execution unit gets assigned a FU follows a priority list, where the units that

process the left-most cells have higher priority than those that process the right-most cells. In order to

reduce the conflicts probability, more FUs could be added, leading to an increase in both hardware and

power requirements. To reach an optimal solution, the number of FUs should be the minimum possible in

order to not cause conflicts, leading to a better resources/usage ratio. In order to tweak the most suitable

number of available FUs, the DP algorithms must be characterized both in terms of the amount and type

of operations. Considering that most DP algorithms use simple operations like sums or subtractions,

shifts and logic operations, there is a requirement for multiple units of these types. By looking at the

subset of algorithms that are considered in this work, the considered set of available FUs in the devised

architecture are: Sum, Maximum, Shift, Logic (AND, OR, XOR) and Comparison units.

Bearing in mind the adopted 4-stage pipeline, the registers whose values are required to be com-

puted during the EXECUTE stage (where the FUs operate) may not have been yet updated, since the

31

WRITE-BACK stage of the pipeline only occurs after the EXECUTE stage. When this occurs, a data forward-

ing mechanism is implemented, pushing back to the entrance of the FUs the yet to be updated value in

a later pipeline stage, instead of using the current register value. This mechanism is also implemented

for memory accesses, where a data vector that has not yet been written to memory is forwarded to the

entrance of the FUs.

The FUs are also prepared to operate with different amounts of vector elements. As previously

stated, the considered register structure supports multiple word sizes, storing more or less words de-

pending on the defined word size. Therefore, if the vector is composed by only one large word, the FUs

behave like scalar units, each operating over single data elements. If the vector is composed by several

smaller words, the FUs will then behave like vector units, each operating over multiple data elements.

3.2.3 Memories

There are 3 different memories in the devised architecture: an instructions memory, which stores

all the instructions to be computed by the processor; a dual-port RAM memory, which stores values

required throughout algorithm computations; and a local fast memory, which serves the purpose of

storing constant values to be used during algorithm computations.

The instructions memory consists in a read-only memory where the instructions to be computed by

the processor are stored. A program counter controls the issuing of instructions, updating the memory

address accordingly (either by incrementing or branching), while the memory constantly provides a new

instruction at every cycle. This occurs in the first stage of the pipeline, the FETCH stage. If a stall is

generated by the processor in a later pipeline stage, the program counter stalls the current instruction,

not updating the memory address until the stall is resolved.

The two remaining memories, a dual port RAM memory and a local fast memory, have two inde-

pendent ports for write-only and read-only operations. They serve different purposes: the former is a

larger memory for storing large data sets, as well as intermediary values of algorithm computations (that

cannot be stored in the registers banks); the latter is a smaller memory that only stores constant values

used throughout algorithm computations. Furthermore, the RAM memory is accessed by both the DSU

and the execution units for write and read operations (with the DSU having higher priority over the exe-

cution units), while the local fast memory can only be written by the DSU and can only be accessed for

read operations by the execution units. This way, during algorithm computations, it is possible to have

both the DSU loading values from the RAM memory to the memory registers, and the execution units

computing their algorithm iterations by loading values from the local fast memory, thus minimizing the

delay that is introduced by concurrent memory accesses, while also promoting a better data organiza-

tion by separating the constant data values of the algorithms from the constant changing intermediary

results.

The data width of both the RAM and local memory corresponds to the maximum register width, with

multiple words being stored or loaded in one data vector if the word length is set to be a multiple of the

maximum data width, as it was previously explained in the register bank section. These memories have

an access latency of 2 clock cycles: one cycle to index the correct address, and another cycle to load

32

the value in the previous indexed address. As it will be seen, there are two instructions to compute both

steps of a memory load, enabling a parallel usage of a memory load instruction between two units: one

indexing an address, and the other effectively loading a data vector, achieving a throughput of 1 clock

cycle with a latency of 2 clock cycles, when loading a value from memory.

3.2.4 Instruction Set Architecture

The ISA of the proposed architecture is structured around a large bundled instruction, composed of

several smaller instructions, each one computed in its respective unit. The instruction bundle is then

divided into several execution units and one DSU, as it can be seen in figure 3.9(a). Each execution unit

instruction is encoded with 32 bits (see figure 3.9(b)), while the DSU instruction has a variable width,

depending on the number of executions units that arepresent (see figure 3.9(c)).

163 - 131 130 - 99 98 - 64 63 - 67 66 - 35 34 - 0

Unit n Unit n-1 … Unit 1 Unit 0 Data Stream

(32) (32) (32) (32) (32) (36)

(a) Instruction Bundle.

WE Td Rd Ta Ra Tb Rb Opcode OpControl

31 30 29 - 25 24 23 - 19 18 17 - 13 12 - 6 5 - 0

(1) (1) (5) (1) (5) (1) (5) (7) (6)

(b) Execution unit instruction.

ShiftEN Left/Right shiftAddr localWE MWE Unit Madd Radd AddrEN regWE Unit Madd Radd

35 34 33 - 32 31 30 29 - 28 27 - 18 17 - 16 15 14 13 - 12 11 -2 1 - 0

(1) (1) (2) (1) (1) (2) (10) (2) (1) (1) (2) (10) (2)

Shift Bits Memory Write Bits memory Load Bits

(c) DSU instruction for an architecture with 4 execution units.

Figure 3.9: Instruction words for the bundle and the composing units.

As it can be seen in figure 3.9(b), the encoding of the execution unit’s instructions comprehends

the common register address fields Ra, Rb and Rd (which correspond to the first and second operand

addresses, and to the destiny address, respectively), the WE field, that indicates when a register should

write a given value, and by the operation encoding fields, namely the Opcode field, which selects be-

tween the different types of instructions (arithmetic/logical, control/branch and memory access) and the

OpControl field, that identifies a certain modifier to the instructions (e.g., usage of immediate values in

arithmetic/logical operations, usage of inequality comparisons for control operations). Three more spe-

cial control fields are also present in this encoding: Td, Ta and Tb. The Td field enables a broadcast write

(enabling a 3-way register write, relevant for DP algorithms that have up to 3 dependencies), while bits

Ta and Tb are used to specify which part of the data is to be loaded or written to registers, for memory

instructions that operate with dividable parts of data.

The summarized implemented instruction set can be seen in Table 3.1 and the full instruction set can

be seen in Appendix A.1. The instruction set presents commonly used arithmetic and logic instructions

33

(e.g. addition, subtraction and logic OR, AND and XOR with their respective immediate counterparts)

as well as a special Maximum and Move (MAXMOV) instruction. This instruction performs the maximum

operation while moving a register in parallel, proving useful in DP algorithms that present dependencies

for the iteration after the next (e.g. diagonal dependencies in the SW algorithm). The SUM, SUB and MAX

instructions are the only ones that can be modified by the Td field, to enable the broadcast write, as

can be seen in Appendix A.1. The MAX and MAXMOV instructions can also be modified by the OpControl.

When the bit 5 of OpControl is active, both instructions perform gap register comparisons for the SW

algorithm. The MAX instruction also concatenates the result of the maximum operation with a different

register (e.g. concatenation with a sniffing register).

The memory instructions are responsible for the load and store operations on the RAM and local

fast memories. The load operations require a previous indexation, encoded by the instructions INDEX

MADDR (for the RAM memory) and INDEX SPADDR (for the local fast memory). The different sized loads

and stores are controlled by the Ta and Tb instruction fields (see Appendix A.1). The INDEX SPADDR

instruction can also be modified by the OpControl in order to perform a comparison between two values

to find the correct address to index (requiring a comparison FU). This is useful for alignment algorithms,

which present substitution scores dependent on the aligning symbols.

The control instructions consist solely in delayed branches, where the instruction following the branch

is still computed before branching. The instruction set is thus composed by a simple branch instruction

as well as common conditional branches (e.g. not equal, less than, greater than) and their immediate

counterparts.

The DSU has a different instruction format than the execution units and it is depicted in figure 3.9(c).

This instruction format also explores ILP, since it computes three distinct and parallel operations: a

memory load/register write (bits 15 - 0); a memory write (bits 31 - 16); and a register shift (bits 35 - 31).

It is worth noting that the DSU instruction’s length will depend on the number of execution units that are

present in the architecture. In fact, the depicted figure 3.9(c) shows the case where 4 execution units

are present, including 2 bits required to address each one of the 4 units in both Unit fields. The memory

load operations are responsible for loading a value from memory (addressed by Madd), to one of the

memory registers (addressed by the the two-bit field Radd, since there are only 4 memory registers per

execution unit) in one of the existent execution units (field Unit). Since the load instructions require a

preliminary indexation before the actual load, the bits AddrnEN and regWE will identify the index operation

and the load operation, respectively. In order to optimize the throughput of memory load instructions,

these two bits also enable simultaneously indexation and load operations. In such situation, both the

Unit and Radd fields identify the register to store the data loaded from memory, while the Madd field

identifies the new memory address to be index for a later load operation. The memory write operations

are very similar to the load operations, with the exception of only requiring one enable flag (MWE) for

allowing writing access to the memory. There is also a localWE field that chooses between the RAM

memory and the local fast memory, since the DSU is the only unit that can also write to the local memory.

The register-shift operation is responsible for creating, along with a memory read or write, a register

34

Table 3.1: Abridged implemented instructed set. The full instruction set is depicted in Appendix A.1.INSTRUCTION MNEMONIC

Arithmetic and Logic Instructions

Add, Subtraction SUM, SUB

Maximum, Maximum and Move MAX, MAXMOV

Comparison CMP

Arithmetic and Logic Right and Left Shift SRA, SRL, SLA, SLL

Logic OR, AND, XOR OR, AND, XOR

Memory Instructions

Index Memory address INDEX MADDR

Load Byte, Half-word, Data LB, LH, LD

Index local memory address INDEX SPADDR

Local Memory Load SPAD LD

Store Byte, Half-word, Data SB, SH, SD

Control Instructions

Delayed Branch BRD

Delayed Branch Equal, Not Equal BEQD, BNED

Delayed Branch Less Than, Less Than Equal BLTD, BLTED

Delayed Branch Greater Than BGTD

window mechanism integrating all the memory registers. This mechanism is depicted in figure 3.10 and

can reduce the impact of memory accesses, by pre-loading a data value that will be required in future

iterations of the computation or by pre-storing a value to be later used in future iterations. These memory

accesses are done in parallel to the computations, without overwriting any values that have yet to be

used, in order to prevent any data hazards. The registers to be shifted are chosen by the ShiftAddr

bit-mask, from one of the periphery execution units (unit 0 or unit n) to the other, with the direction being

chosen by the Left/Right field. An enable flag (ShiftEN) activates the shift operation.

It is also important to notice that the shift operation that is implemented by the DSU operates in-

dependently to the FUs. Moreover, given the higher priority of the DSU over the execution units, the

shift operation will always overwrite the target memory register when an execution unit tries to update

that register during the same clock cycle. For this reason, the memory registers should mainly be used

by execution units for accessing the stored values and not for updating them, as that is the DSU main

functionality.

As it was previously mentioned, more execution units can be easily added to the architecture by

widening the instruction bundle and adding register banks to the new units. Furthermore, these exe-

cution units can also be expanded to accommodate more words, by increasing their vector width. This

would also require some modifications in the FUs and in the memory accesses, in order to maintain

compatibility. The former scalability solution is better suited for algorithms that require many instructions

per iteration, while the latter has a better use by algorithms that require less instructions and work with

higher volumes of data.

35

Memory

Register fileof Unit 0

Register fileof Unit 1

Register fileof Unit n-1

Register fileof Unit n

Figure 3.10: Register Window example. In this example, the third register in each array is shifted to theregister of the array on its right, while the left-most array is loaded with a new value from memory.

3.3 Interface

The proposed architecture is envisaged to act as an accelerator element highly interconnected with

an off-the-shelf GPP, where the non-regular and less complex parts of the algorithms (e.g. control and

management structures) will be executed. Accordingly, it was decided to extend the design of the pro-

posed architecture to its interface with the outside world. In particular, it is envisaged an interfacing

structure that aims to be suited to implementations supported either in ASIC or FPGA technologies.

Naturally, a greater emphasis will be given to FPGA-based implementations, due to its greater availabil-

ity in the lab.

System On Chip (SOC) processing structures are usually formed by heterogeneous aggregates of

processing elements. In particular, they commonly include a set of GPP elements and several accel-

erating processing structures. The GPP elements typically comprehend a processor/microcontroller,

together with the cache, the RAM and all the corresponding interconnections and input/output periph-

eral ports. A popular example of such a SOC structure based on FPGA technology is the Xilinx Zynq

FPGA, comprehending a Processing System (PS) and Programmable Logic (PL) sections. The latter

section is frequently used to create custom designs and integrate them with the processor in the PS.

The proposed architecture is then particularly suited to be integrated as a core located in the PL section

of the FPGA.

In this section, it will be presented an interfacing structure to the proposed VLIW processor based

on the Advanced Microcontroller Bus Architecture (AMBA), according to its Advanced eXtensible Inter-

face (AXI). These specifications are adopted by some FPGA vendors (e.g. Xilinx) and are considered to

be the de-facto standard for 32-bit embedded processors, due to being well documented and royalty free.

After analyzing the proposed architecture, previously presented in this chapter, three main structures

were identified as requiring communications with the GPP element: i) the instruction memory, ii) the

RAM memory and iii) the local fast memory. The GPP only requires write access to all these memories,

since they are only used by the VLIW core.

When integrated with the GPP, all the required data to be computed in the proposed architecture core

is stored in the system’s RAM, requiring it to be loaded to the memories inside the core. The GPP is

thus responsible to select and send the correct data to the correct memories, depending on the algorithm

36

that is being processed. Ideally, the data is transferred in parallel to the algorithm computations, with a

controller unit monitoring the data transfer to guarantee coherence. However, the VLIW core memories

only have 2 access ports (a write-only and a load-only port, as previously detailed) and, with exception

of the instruction memory, both ports are already used by the core, preventing the access in parallel by

the GPP, due to structural conflicts. To solve this, a multiplexer at the entrance of the write ports for the

RAM and local fast memory is required. This multiplexer thus chooses between the proposed core or the

GPP for writing access. The multiplexer selection is done by an additional control unit, located outside

the proposed core and inside the PL (see figure 3.11). This control unit must then be able to recognize

the current algorithm phase to switch the multiplexer accordingly and to enable the memory writes. This

can either be done by also sending the instructions that are being processed by the proposed core to

the control unit, or by using a feedback system, where the VLIW core communicates the current state of

the operations.

ProgrammableULogic (PL)

AXIUMemoryController

AXIUMemoryController

ProposedUArchitectureUCore

Datapath

RAMLocalUFastMemory

ControlUUnit

….

ControlUUnit

AXIUInterconnect

ProcessingUSystem (PS)

GPP

Figure 3.11: AXI Interconnection scheme between the RAM and the local fast memory in the proposedarchitecture core and the GPP in the PS.

As opposed to these two memories, the instruction memory has only one port being used to load

the instructions to the different units, inside the proposed core. By connecting the remaining free port to

the GPP, we can seamlessly transfer the new instructions to the VLIW core at the same time that other

instructions are decoded in the core, without the need for a multiplexer (see figure 3.12). However, given

37

the difference between the data transfer frequency of the GPP to the VLIW core and the core’s operating

frequency, structural hazards can occur, and thus a control unit is required. Accordingly, this unit must

be able to monitor the memory, ensuring a correct data transfer. Therefore, the control unit requires the

knowledge of the current instruction being computed in the VLIW core (similarly to the control units for

the other two memories), as well as the control of the memory port signals, in order to appropriately

enable the writing access and choose the addresses for the data transfers.

AXIaMemoryController

InstructionaMemory

DSUUnitN ... Unit1 Unit0

AXIInterconnect

ProcessingaSystem (PS)

GPP

….

ProgrammableaLogic (PL)

ControlaUnit

ProposedaArchitectureaCore

Figure 3.12: AXI Interconnection scheme between the instruction memory in the proposed architecturecore and the GPP in the PS.

In order to connect the memories inside the VLIW core to the GPP, AXI controllers are required.

These units provide the interface to connect the memories to a central AXI Interconnect, which in turn

completes the communication bridge to the GPP in the PS. Figure 3.13 depicts the full interfacing

structure scheme.

The AXI follows a handshake process to transfer both the address, control and data information,

where the master (GPP) asserts and holds a VALID signal when data is available to transfer, and the

slaves (memories inside the VLIW core) respond with a READY signal when they are able to accept the

data. When both signals are active, the transfer occurs. The AXI supports data bursts, which are nec-

essary for the memories in the VLIW core. The instruction memory requires multiple instructions to be

stored prior to the start of the algorithm, which must be sent in long bursts to reduce the stall time.

Similarly, the RAM and local fast memory will also require long bursts of data, in order to prolong the

algorithm computations without stalling the core, since their write ports can only be accessed either by

38

ProcessingLSystem RPSF

GPP

CacheLSystem 1 Control

MemoryLInterfaceCentral

Interconnect

I/OPeripherals

AXILInterconnect

ProgrammableLLogic RPLF

AXILMemoryController

InstructionLMemory

DSUUnitN ... Unit1 Unit0

ControlLUnit



RAMLocalLFastMemory

Datapath

ControlLUnitControlLUnit

ProposedLArchitectureCore

Figure 3.13: Interface scheme for the proposed architecture core.

the core or the GPP at a given time. Using the SW algorithm as an example, and due to the large

length of the reference and query sequences, the RAM memory can only store a limited number of

sequences. In sequence alignment algorithms, it is common to perform multiple query alignments to

the same reference sequence. Therefore, every time that a fixed number of query sequences is aligned

to the reference sequence, a new set of queries must be sent from GPP. During this time, the VLIW

core will be stalled until all the new queries are stored in the RAM memory for the new alignments. By

maximizing the burst length of query sequences, the time that the core is stalled can be minimized, thus

increasing performance.

39

An important problem that was not yet addressed is the number of input/output pins in the proposed

core. The number of pins could significantly reduce the operating frequency of the core, due to an

increase in routing complexity. In order to address this problem, it is necessary to know the width of the

data being transferred to and from the proposed core, and how to reduce those widths.

The instructions length for the VLIW core varies with the total number of units (execution units and

DSU) that are present. The encoding corresponding to each execution unit has a length of 32 bits,

and the DSU has a length varying with the number of execution units present. As an example with 4

execution units and one DSU, the full instruction length would then be 164 bits. Adding a 32-bit RAM

and local fast memory on top of that, the required total number of bits to be transferred to the core would

raise to 228 bits. This excludes the outputs of the proposed core that are necessary to send information

to the control units, as well as the algorithm results back to the GPP. In order to reduce these input

widths, the transferred data should be shortened and sent in more frequent and smaller bursts. For the

instructions, the adopted width should match the width of each unit. Therefore, each instruction sent

from the GPP would be divided in the number of units present in the core. For the previous example

with 4 execution units and 1 DSU, it would be required one 36-bit and four 32-bit data transfers (five data

transfers in total) for the full instruction to be available in the proposed core.

For the remaining memories, a similar solution can be used. Since the proposed core allows the word

size to be a multiple of a maximum data width, the inputs for these memories could have the same width

as the word size, with multiple word-sized transfers being required for the full data to be transferred. The

same can be applied to the solution output, which also has the same width as these memories.

Finally the control signals sent by the VLIW core to the control units previously introduced, should

only consist in small flags and thus should not require any additional modifications.

3.4 Summary

This chapter listed all the necessary requirements for the proposed architecture, and gave a detailed

description of all the architecture structures, including an interfacing structure proposal.

Exploiting both DLP and ILP, the resulting architecture consists in a VLIW architecture with multiple

execution units and DSU. Each execution unit is responsible for the operation of an independent data

vector, while the DSU takes care of parallel memory accesses. In order to enable communications be-

tween the execution units, shared register sets and sniffing mechanisms are implemented in the register

banks. Additionally, the existence of two distinct memories (RAM and local fast memory) helps reducing

the conflicts between the units when accessing the memory, reducing the existence of delays and pro-

moting a better structural organization. All these characteristics not only result in an optimized processor

for DP algorithms, but also in a programmable architecture with potential for broader compatibility.

The interfacing structure to connect the proposed architecture to a GPP is discussed in the last

section of the chapter. Although some techniques and considerations are taken for this interface, the

proposed interface was not implemented in our work, due to time constraints.

40

4DP Algorithm Implementations

Contents4.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Viterbi (Profile HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

41

This chapter describes the two algorithm implementations made for the proposed architecture: the

SW and the Viterbi algorithms. It focus on the processing scheme used by the algorithms, as well

as the necessary instructions to compute them in the proposed architecture, together with any special

mechanisms and considerations used.

The considered architecture for the implementations will consist in 4 execution units and 1 DSU with

32-bit vectors. The SW implementation will present 8-bit words, processing 4 words (cells) per execution

unit, while the Viterbi implementation will have 16-bit words, processing 2 words (cells) per execution

unit.

4.1 Smith-Waterman

As explained in the second chapter, the SW algorithm computes the local alignment between a query

and a reference sequence. With the help of a substitution score matrix and gap penalty scores (affine

model) that indicate, respectively, the weight of matches/mismatches and insertions/deletions in the

alignment, the algorithm fills the resulting score matrix, from the upper left to the bottom right. This filling

operation respects the three dependencies that are present in the computations of every cell: the left,

top and top-left cell dependencies, resulting in parallelism extraction along the anti-diagonal, as it was

previously seen.

In addition to the anti-diagonal parallelism extraction, the algorithm will also follow a processing along

the query sequence (see figure 4.1). This processing scheme results in two distinct algorithm loops: an

inner loop, where a small reference sub-sequence is compared against the full query sequence; and an

outer loop, where a new reference sub-sequence is loaded, restarting the inner loop.

Although the processor is configurable to admit other setups, the described implementation uses, in

each of its 4 execution units, 32-bit vectors, each composed of 4 8-bit words, resulting in 16 8-bit cells

being simultaneously computed in all units.

A L I G N I N G

ALGORITHM

Que

ry S

eque

nce

Reference SequenceSub-Sequence 1Sub-Sequence 0

U0 U1 U2 U3

Figure 4.1: SW processing scheme along the query sequence and extracting parallelism along theanti-diagonal.

Revisiting the SW main equations, it is possible to observe that, in order to compute the result for the

cell (i, j), the negative gap values (β or α) are added to the vertical (cell (i − 1, j) (4.3)) and horizontal

42

(cell (i, j − 1) (4.2)) dependencies, and the substitution score is added to the diagonal dependency (cell

(i− 1, j − 1) (4.1)).

Assuming that both sequences and the substitution matrix are already stored in memory, and the

gap and dependency values are already stored in the register banks, the required algorithmic steps in

the inner loop of the algorithm can be broke-down to the following: an indexation and respective loads

of the query symbols and substitution scores; the 3 dependency sums with the substitution and gap

scores; and two maximum evaluations in order to find the final cell result.

Hi,j = max

0Ei,j

Fi,j

Hi−1,j−1 + Sm(qi, dj)

(4.1)

Ei,j = max

{Ei,j−1 + βHi,j−1 + α

(4.2)

Fi,j = max

{Fi−1,j + βHi−1,j + α

(4.3)

Due to its length, the query sequence is stored in the RAM , while the substitution score matrix is

stored in the local fast memory. Therefore, the query sequence memory accesses can be performed by

the DSU, allowing the execution units to load the substitution symbols in parallel, taking a total of two

clock cycles per iteration (see clock cycles 1 and 2 for the Unit 0 in figure 4.2). Since the substitution

score load requires the current query symbol in order load the correct value (by performing a comparison

between the query symbol and the reference symbol), the query symbol that is being loaded in parallel

will be used in the next iteration, with the current query symbol being already present in the register file.

Following the substitution score load, the 3 main sums can now be computed, since all the 3 depen-

dencies and gap scores are already stored in the register banks. These sums can be encoded to one

single sum instruction if the Td flag is activated, as it was seen in the architecture’s instruction set in the

previous chapter. Therefore, these 3 sums will only take 1 clock cycle to compute (see clock cycle 3 for

Unit 0 in figure 4.2).

Finally, the maximum operations will find the final result, which corresponds to the maximum value

of the three previous sum results. Two maximum instructions are necessary, thus taking 2 clock cycles

to finish (see clock cycles 4 and 5 for Unit 0 in figure 4.2). At the same time, the query symbols in each

execution unit (which are stored in the memory registers) are shifted to the adjacent unit, in order to be

reused during the next iteration. This can be done since the parallelism along the anti-diagonal and the

processing along the query sequence are exploited. The query symbol pre-loading during a previous

clock cycle, together with the symbol shifting corresponds to a register windows scheme, similar to the

one depicted in figure 3.10.

After the final cell value is computed, the inner loop restarts. The table in figure 4.2 details the inner

loop for an example with 4 execution units and 1 DSU.

The ILP is exploited in the SW implementation by having an offset of one instruction computation

between adjacent execution units. Due to the processing along the query sequence, the most advanced

43

Data…Stream…Unit Unit…0 Unit…1 Unit…2 Unit…3

1 Index crit. dep. (Unit 0) INDEX SPADDR (i+3,j) 1

2 Load crit. dep. (Unit 0) | Index crit. gap (Unit 0) SPAD LD INDEX SPADDR (i+2,j) 2

3 Load crit gap (Unit 0) | index query symbol (Unit 0) SUM (Td = 1) SPAD LD INDEX SPADDR (i+1,j) 3

4 Store cell result (Unit 3) | Load query symbol (Unit 0) MAXMOV SUM (Td = 1) SPAD LD INDEX SPADDR (i,j) 4

5 Store gap result (Unit 3) | Shift query symbols (u0 to u3) MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) SPAD LD 5

6 … … MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) 6

7 … MAX (Opcontrol(5) = 1) MAXMOV 7

8 … MAX (Opcontrol(5) = 1) 8

Clo

ck…C

ycle

s

Figure 4.2: Main iteration (inner loop) operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSU.

unit will correspond to the unit that is aligning the latest query symbol. Also, given the anti-diagonal

parallelism and the dependency propagation from the top-left to the bottom right, the most advance unit

will also correspond to the left-most unit, as can be seen in figure 4.1. Accordingly, due to the anti-

diagonal parallelism and the number of existent units, these computational offset will not introduce any

conflicts, as was seen in the previous chapter (see figure 3.2).

The ILP exploitation greatly reduces the required number of FUs. From the table in figure 4.2, it is

possible to see that, with 4 execution units, there are never more than 1 SUM, 1 INDEX SPADDR, and 2

maximum instructions (MAXMOV and MAX) being computed during the same clock cycle. Therefore, the

SW algorithm implementation will only require 3 SUM/SUB units (since the sum instruction refers to a 3-

way broadcast sum), 2 MAXIMUM units and 1 COMPARISON unit (for the INDEX SPADDR instruction). If more

execution units were present, the required number of FUs would be higher, or it could remain the same

at the cost of adding stalls due to the rise of conflicts.

The outer loop of the SW algorithm consists only in the load of new reference symbols and occurs

every time the end of the query is reached by an execution unit. These symbols are stored in the memory

registers, and thus can be loaded in parallel by the DSU, similarly to the query symbols. The table in

figure 4.3 depicts the instructions in the DSU and execution units for the outer loop. As we can see from

figure 4.3, the outer loop will not introduce any additional clock cycles since it can be fully performed in

parallel by the DSU.

Due to the partitioning of the reference sequence, some problems will arise when solving cell de-

pendencies between execution units, specifically the horizontal and diagonal dependencies. Since the

processing scheme follows the query sequence, thus adopting a top-down anti-diagonal parallelism

approach, the computed cell values will be stored in the register banks and be used as vertical de-

pendencies during the next algorithm iteration. In the following iteration, the register with the vertical

dependency value is overwritten with the new values. The same happens for the diagonal and hori-

zontal dependencies. Inside the same unit, these dependencies are rapidly retrieved, since they are all

located in the same register bank. However, the dependencies between units require the use of sniffing

mechanisms. These mechanisms are used by a unit to access the dependency cells from the adjacent

execution unit to its left (unit in advance), in order to use them in the next iteration, as they were stored

in its own register bank. Contrary to the other dependencies, the diagonal dependency requires two

44

DatapStreampUnit Unitp0 Unitp1 Unitp2 Unitp3

Load crit. dep. 6Unit 07

Index crit. gap 6Unit 07

Load crit gap 6Unit 07

Index query symbol 6Unit 07

Store cell result 6Unit 37 | Index ref symbol (Unit 0)

Load query symbol 6Unit 07

Store gap result 6Unit 37 |Load ref symbol (Unit 0)

Shift query symbols 6u0 to u37

Load crit. dep. 6Unit 07

Index crit. gap 6Unit 07

Load crit gap 6Unit 07

Index query symbol 6Unit 07

Store cell result 6Unit 37 | Index ref symbol (Unit1)

Load query symbol 6Unit 07

Store gap result 6Unit 37 |Load ref symbol (Unit 1)

Shift query symbols 6u0 to u37

1313 … MAX 6Opcontrol657 = 17

SUM 6Td = 17 11

12 … MAX 6Opcontrol657 = 17 MAXMOV 12

11 … … MAX 6Opcontrol657 = 17 MAXMOV

8

Ou

terp

Loo

p

(Un

itp1

) 9 MAXMOV SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 9

10 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 10

8 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-17,j7 MAX 6Opcontrol657 = 17

SUM 6Td = 17 6

7 SPAD LD INDEX SPADDR 6i+N,j7 MAX 6Opcontrol657 = 17 MAXMOV 7

6 Index crit. dep. 6Unit 07 INDEX SPADDR 6i,j+17 MAX 6Opcontrol657 = 17 MAXMOV

SPAD LD INDEX SPADDR 6i+6N-37,j7 4

5 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 5

Ou

terp

Loo

p

(Un

itp0

) 4 MAXMOV SUM 6Td = 17

3 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 3

1 Index crit. dep. 6Unit 07 INDEX SPADDR 6i+N,j7 1

2 SPAD LD INDEX SPADDR 6i+6N-17,j7 2

Figure 4.3: Inner loop and outer loop operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSUand 2 algorithm iterations. The outer loop for each execution unit is comprised by two DSU instructions.

registers in each unit. This is due to the fact that an anti-diagonal scheme is used, and therefore, the

computed cell value will only be used as a diagonal dependency two iterations after the current one (thus

being necessary to store the value to be used in the next iteration and two iterations after the current

iteration).

However, for the most advanced unit (that is aligning the most-left symbols of the reference sub-

sequence), the horizontal and diagonal dependencies cannot be retrieved from its adjacent unit, since

there is no adjacent unit in advance to it. These dependencies are computed in the previous reference

sub-sequence, and therefore should be stored in memory. In fact, as it can be seen in the tables of

figures 4.2 and 4.3, the most delayed unit (that is aligning the most-right symbols of the reference sub-

sequence) will have its final cell values stored to the memory by the DSU, in order to be retrieved later

on by the most advanced unit (with the help of the DSU).

These critical sections (see figure 4.4) only occur between these two execution units that are com-

puting the edges of the reference sub-sequences, and do not introduce any additional clock cycles,

since the memory loads and stores are done in parallel by the DSU. Therefore,their processing can be

seen as a window register scheme, where the new reference symbols are loaded just before they are

required. The sniffing mechanism cannot be applied in the critical sections due to the large length of

the query sequence and the fact that the units are not adjacent. Since the processing follows the query

sequence, the most delayed unit would need to store all of its computed cell values until the end of the

query sequence, which would prove impossible due to the number of existing registers when compared

to query length.

The affine gap model will also require a mechanism similar to the horizontal and vertical dependen-

45

A L I G N I N G

ALGORITHM

Que

ry S

eque

nce


U0 U1 U2 U3 U0 U1 U2 U3

Figure 4.4: Critical section between two sub-sequences of the reference sequence for an example casewith 4 execution units. Each color/symbol represents a different iteration, with 4 iterations being depicted.The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory.

cies. Since this model takes into account two distinct gap values (an initialization value and an extension

value in case there are several gaps in a row), all execution units will have two registers in their register

bank with both gap values constantly stored. During the maximum operations of the SW algorithm, an

auxiliary register will store the information regarding which dependency originated the max result. If it is

a vertical or horizontal dependency, the auxiliary register will compare its previously stored value, and

check if the new result is a gap extend or initialization, updating its value accordingly. This way, during

the sum operations in the following iteration, the correct gap value to be used is already stored in the

register bank.

For the most advanced execution unit, the auxiliary register that indicates the type of gap for the

horizontal dependencies belongs to the most delayed unit in a previous iteration. Given that the required

value is computed in a former iteration of the algorithm, it is stored in memory to be later loaded by

the execution unit in advance, similarly to the horizontal dependency (see DSU instructions in the table

in figure 4.3). Also, just like the horizontal dependencies, the adjacent units share these auxiliary gap

registers by using the sniffing mechanism.

4.2 Viterbi (Profile HMMs)

The Viterbi algorithm can find the most likely sequence path of hidden states in a HMM for a given

sequence of observed outputs. As previously mentioned, this algorithm is well suited for solving se-

quence alignment problems, with the help of profile HMMs. These HMMs take into account a family of

similar sequences (profile), thus enabling an alignment between a query sequence and the whole family

(which can be seen as the reference sequence) at once. They will also require additional states not

46

present in normal HMMs, as depicted in figure 2.7. These special states are specific for the multihit

local alignment, achieving several local alignments between the compared sequences. This model was

chosen to facilitate the comparison to the GPP implementation of the Viterbi present in the next chapter,

as well as to enable a comparison to the previously explained SW algorithm.

The considered Viterbi algorithm implementation will follow the same anti-diagonal parallelism and

processing scheme along the query sequence as the SW algorithm (see figure 4.1). It will have 16-bit

words, resulting in 8 cells being computed at every iteration, two per execution unit. This translates into

8 query and reference symbols being compared every iteration.

This implementation will require, in addition to the query sequence, a profile corresponding to the

reference sequence, with transition and emission values between all the existing states, all stored in

memory. The profile should follow the optimizations made by the HMMER [9] application, since it will

be used as a comparative study in the following chapter. This optimized profile has the transition and

emission values aligned to the algorithm’s access pattern, resulting in faster accesses for these values.

However, the access pattern implemented by the HMMER application consists in a stripped pattern along

the query sequence, based on Farrar’s [12] implementation for the SW algorithm (see figure 4.5(a)). As

a result, the profile must be modified to adapt to the anti-diagonal access pattern used in the proposed

architecture.

Given that two symbols are being compared in each unit, the profile should then group the emission

and transition scores in pairs, in order to both scores being retrieved in a given unit with only 1 load

instruction. In fact, after analyzing the profile in the HMMER tool, it was observed that each combination

of query-reference symbols will only require a total of two different emission/transition scores, instead

of a different score for every state. This occurs due to score overlapping between different states.

Furthermore, given the fact that two cells are computed in each unit, it will result in 2 load instructions

per unit, for a total of 8 load instructions at every iteration. Figure 4.5(b) depicts the emission/transition

score pattern that should be used for the implemented architecture. It is important to notice that these

memory accesses will not have any influence in the algorithm throughput, since they can be computed

exclusively by the DSU, in parallel to the main algorithm operations (see Appendix B.1).

Furthermore, both the emission and transition, ordered accordingly to the query sequence, have

new values being retrieved for each new reference symbol. Given the potentially large size of the query

sequences, the storage of all emission and transition scores in the proposed architecture cannot be

accommodated in the registers banks, and so, similarly to the SW algorithm, only the smaller required

subset of scores will be available at any given instant, with the rest being stored in memory. Effectively,

for every sequence symbol being computed in the proposed architecture, there must be a different set

of emission and transition scores, of which a small subset of those scores must be retrieved at every

iteration.

The operations required to compute a pair of cells in one execution unit are listed in figure 4.6. Both

the sequence and query symbols, as well as the respective emission/transition scores required for any

47

cN Decomposition in stripedlines. Each segment ispadded.

rallelism

A’ L’ I’ G’ N’ I’ N’ G


rallelism

A’ L’ I’ G’ N’ I’ N’ G


rallelism

ad Gmostly from the Lazy-F

A’ L’ I’ G’ N’ I’ N’ G


g N


A’ L’ I’ G’ N’ I’ N’ G

cN Decomposition in stripedlines. Each segment is

g N


A’ L’ I’ G’ N’ I’ N’ G

cN Decomposition in stripedlines. Each segment is

values may only have any

better results, it is also less

o run banded alignmentsN.


A’ L’ I’ G’ N’ I’ N’ G

sed later, in a second inner





A’ L’ I’ G’ N’ I’ N’ G






A’ L’ I’ G’ N’ I’ N’ G

nce the same element j of

ndencies across ’segment






A’ L’ I’ G’ N’ I’ N’ G

possible to parallelize the








A’ L’ I’ G’ N’ I’ N’ G

possible to parallelize the








A’ L’ I’ G’ N’ I’ N’ G

Q1

Q2

Q3

Q4

Q1

Q2

Q3

Q4

Q1

Q2

Q3

Q4

Q1

Q2

Q3

Q4

Striped blocks of 14 M emissions1,5,9,13 2,6,10,14 3,7,1,x 4,8,2,x

Transition Costs1,5,9,13 4,8,2,x 4,8,2,x 4,8,2,x 1,5,9,13 1,5,9,13 1,5,9,13

[tBMk] [tMM] [tIM] [tDM] [tMD] [tMI] [tII]

2,6,10,14 1,5,9,13 1,5,9,13 1,5,9,13 2,6,10,14 2,6,10,14 2,6,10,14


3,7,1,x 2,6,10,14 2,6,10,14 2,6,10,14 3,7,1,x 3,7,1,x 3,7,1,x


4,8,2,x 3,7,1,x 3,7,1,x 3,7,1,x 4,8,2,x 4,8,2,x 4,8,2,x


Lazy 1,5,9,13 2,6,10,14 3,7,1,x 4,8,2,x

Loop [tDD] [tDD] [tDD] [tDD]

Q = 1

Q = 2

Q = 3

Q = 4

(a) Profile example for the HMMER [9] platform (left), with the respective stripped pattern (right).Each cell in the transition costs matrix has the 4 costs for all the 4 cells computed in parallel. Foreach cell, only two different transition scores are used for all the 7 transitory states (representedin grey). The last row represents the transition scores that are necessary for the lazy loopsresulted from the stripped pattern.

ARLRIRGRNRIRNRG

ALGORITHM

AaR Decomposition in antiOdiagonalsH The lowerN

d i ht id

cellm1 cellm2 cellm1 cellm2 cellm1 cellm2 cellm1 cellm2

diagm1 1,5 9,13 2,6 10,14 3,2 4,11 5,7 11,3

diagm2 2,4 5,14 2,4 2,7 9,9 12,4 2,1 1,1

diagm3 5,12 12,11 11,1 10,9 8,4 6,8 3,11 2,13

diagm4 9,1 10,2 7,6 3,6 5,16 4,2 8,7 6,9

Unitm3Unitm2Unitm1Unitm0U0 U1 U2 U3

Firstmtransitionmcost:m[tBMk]Nm[tMD]Nm[tMI]Nm[tII]Secondmtransitionmcost:m[tMM]Nm[tIM]Nm[tDM]

Formeachmcell:

(b) Profile example (with random costs) for the proposed architecture (left), with the respective anti-diagonal pattern (right). Each unit computes 2 cells, which results in 4 transition/emission costsnecessary for each unit. The 2 costs per cell cover all the transitory states. The cells in each unithave their correspondent in the diagonal pattern matched by the colored circles.

Figure 4.5: Comparison of example profiles for the HMMER platform [9] (computing 4 cells in parallel)and the proposed architecture (computing 8 cells in parallel). The way that the scores are orderedaccordingly to the used processing pattern is highlighted in both examples.

given iteration, are stored in the respective registers banks, prior to any of the cell operations being

computed. Just like the SW algorithm, the main operations for the three main states (M , I and D)

consist in sums/subtractions and maximum operations. The same also applies for the special states B,

E and J .

The dependencies required for the M state will differ from the SW algorithm, since they will now

require both the diagonal dependency of the states I and D, whereas, on the SW, only the M diagonal

48

M state computation

E state computation

D+1 state computation

I state computation

J state computation

B state computation

Unit 1 Unit 2 Unit 3 Unit 41 SUM xBv SUM xBv

2 SUM xBv SUM xBv

3 SUM mpv SUM mpv

4 MAX sv MAX sv SUM mpv SUM mpv

5 SUM ipv SUM ipv MAX sv MAX sv

6 MAX sv MAX sv SUM ipv SUM ipv

7 SUM dpv SUM dpv MAX sv MAX sv

8 MAX sv MAX sv SUM dpv SUM dpv

9 SUM sv SUM sv MAX sv MAX sv

10 MAX xEv MAX xEv SUM sv SUM sv

11 SUM dcv SUM dcv MAX xEv MAX xEv

12 SUM dcv SUM dcv

13 SUM mpv SUM mpv

14 SUM mpv SUM mpv

15 SUM ipv SUM ipv

16 MAX sv MAX sv SUM ipv SUM ipv

17 SUM xJ SUM xJ MAX sv MAX sv

18 SUM xJ SUM xJ

19 MAX xJ MAX xJ SUM xJ SUM xJ

20 SUM xJ SUM xJ

21 SUM xB SUM xB MAX xJ MAX xJ

22 MAX xB MAX xB SUM xB SUM xB

23 MAX xB MAX xB

Clo

ck C

ycle

s

Figure 4.6: Main iteration (inner loop) operations for the Viterbi algorithm in the proposed architecture.Only the execution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.

dependency and the current I and D states were required. This will result in a delayed load/store

scheme by using additional registers to store the previous and current scores, similar to the solution

used for the diagonal dependencies in the SW implementation. The remaining dependencies for the

I and D states are implemented in the same way as their SW counterpart (see the equations (2.10),

(2.11) and (2.12) in chapter 2).

The special states B, E and J are also required to be updated at every iteration, since the B state

dependency is used in the computation of the M state score, while depending itself on the J state. In

turn, the J state depends on the E state (see figure 2.7).

The E score will correspond to the current maximum score for the corresponding sequence symbols

in any execution unit. Accordingly, it has to be constantly updated at every iteration and propagated

to the computation of the states J and B. In turn, the J state compares the fixed transition cost of

moving from the updated E state to the cost of remaining in the J state. Finally, the B score takes into

account the newly updated J score and compares its cost to the cost of moving from state N to state B.

This special state N will be computed in the outer loop, since it only depends on the current reference

sequence symbol. The different loop and move transition costs are constant throughout the algorithm

computations and thus they are pre-stored in the register banks for faster access. These special states

will then introduce additional sums and maximum operations in the inner loop of the algorithm, which

can be seen in figure 4.6.

The remaining special state C is only updated in the outer-loop, just like the N state. While the N

state only depends on the current reference sequence symbol, the C state corresponds to the maximum

49

cell value in the respective execution unit. Therefore, after all execution units reach the end of the

query sequence and before they start computing a new sub-sequence of the reference sequence, the

maximum C score must be found between all units. This is possible by storing the C scores in the

shared registers, which makes them available for all units. The final C score is then stored in the first

execution unit and a new sub-sequence of symbols can start its computation (see figure 4.7).

UnitF1 UnitF2 UnitF3 UnitF4

1 OR dcv OR dcv SUM dcv SUM dcv

2 OR xEv OR xEv SUM xEv SUM xEv

3 OR mpv OR mpv SUM mpv SUM mpv

4 OR ipv OR ipv SUM ipv SUM ipv

5 OR dpv OR dpv SUM dpv SUM dpv

6 SUM xNv SUM xNv

7 SUM xNv SUM xNv

8

9

10 SUM xN SUM xN

11 SUM xN SUM xN

12 SUM xEv SUM xEv

13 SUM xEv SUM xEv

14 SUM xC SUM xC

15 MAXFxC MAXFxC SUM xC SUM xC

16 MAXFxC MAXFxC

17 MAXFxC MAXFxC

18 MAXFxC

Clo

ckFC

ycle

s

InnerFLoop

14 SUM xC SUM xC


16 MAXFxC MAXFxC

17 MAXFxC MAXFxC

18 MAXFxC

Vectors(set

to(-infinity

gOR(and(SUM

instructions

are(used(to

avoid(adding(FUsF

OuterFLoopFInitialization

OuterFLoopFFinalization

(additionalFspecialFstates)

7 SUM xNv SUM xNv

8

9

10 SUM xN SUM xN

11 SUM xN SUM xN

12 SUM xEv SUM xEv

13 SUM xEv SUM xEv

14 SUM xC SUM xC


16 MAXFxC MAXFxC

17 MAXFxC MAXFxC

18 MAXFxC

Vectors(set

to(-infinity

gOR(and(SUM

instructions

are(used(to

avoid(adding(FUsF

Clo

ckFC

ycle

s

OuterFLoopFInitialization

OuterFLoopFFinalization

(additionalFspecialFstates)

InnerFLoop

Figure 4.7: Outer loop pseudo-code of the Viterbi algorithm in the proposed architecture. Only theexecution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.

The fact that all execution units reach the end of the query sequence before starting the alignment

of a new sub-sequence introduces a small delay that was nonexistent in the SW implementation, since

now there will be 3 initialization and finalization iterations, at the beginning and at the end of the query

sequence, respectively, for every new sub-sequence of the reference sequence, as can be seen in fig-

ure 4.8. Additionally, the processing scheme will also have critical sections just like those seen in the

SW algorithm (see figure 4.8). To solve them, a similar register window scheme is used, where the

dependencies generated in the last execution unit are stored in memory after they are computed, and

the dependencies required by the first unit are loaded before they are needed. This is also consolidated

by the delayed load/store scheme, mentioned above.

The ILP that is adopted in the Viterbi’s implementation will also differ from the one that was observed

with the SW algorithm. Previously, all execution units were 1 instruction in advance regarding their

adjacent unit, resulting in the most advanced unit being 4 instructions in advance, regarding the most

delayed unit. In Viterbi’s implementation, the delay between instructions only occurs in pairs, resulting in

the first two units being in advance 1 instruction in comparison to the last two units. This can be seen in

figures 4.6, 4.7 and in Appendix B.1 (where the instructions appear in pairs). This was done in order to

keep the same number of FUs that were used for the SW algorithm implementation. If an identical ILP

extraction was used, the required number of FUs would be greater, but it would come with an increase

50

A L I G N I N G

ALGORITHM

Que

ry S

eque

nce


U0 U1 U2 U3 U0 U1 U2 U3

Figure 4.8: Critical section of the Viterbi implementation between two sub-sequences of the referencesequence for an example case with 4 execution units. Each anti-diagonal/color/symbol represents adifferent iteration, with 7 iterations being depicted (four in sub-sequence 0 and three in sub-sequence 1).Sub-sequence 1 can only start its computations after all units finish their computations in sub-sequence0. The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory and arerepresented by the red arrows.

in performance.

The implementation of Viterbi’s algorithm in the proposed architecture thus results in a stationary

phase (inner loop) comprising an average of 23 instructions for an execution unit to complete an iteration

of the algorithm, with two cell being updated simultaneously in each unit. After all units reach the end of

the query sequence, the outer-loop will take 18 cycles until a new sub-sequence starts being aligned.

4.3 Summary

This chapter described the algorithm implementations for the SW and Viterbi algorithms, in the pro-

posed architecture.

These algorithms compute the sequence alignment of a reference sequence against a query se-

quence, exploiting an anti-diagonal parallelism. This processing scheme avoids any dependency be-

tween the cells being processed, thus increasing performance. The algorithms also take advantage of

the available proposed processor mechanisms, such as the sniffing mechanism, the shared registers

and the DSU, which parallelizes memory accesses.

Finally, the pseudo-codes for both algorithms are also presented in this chapter.

51

52

5Prototyping and Evaluation

Contents5.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.1 Reference State-of-the-art Architectures . . . . . . . . . . . . . . . . . . . . . . . 555.2.2 Application Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Performance and Energy Effiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3.2 Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

53

This chapter will detail the reference state-of-the-art architectures used to evaluate the benchmarked

algorithm applications: the SW and Viterbi algorithms. The implementation of these applications in the

evaluated architectures will also be detailed, together with their respective dataset.

A performance evaluation will be held for the presented architectures, followed by a performance and

energy efficiency evaluation, completing the evaluation tests.

5.1 Hardware Prototype

The proposed architecture was prototyped in a Zynq SoC 7020 FPGA [35]. The implemented config-

uration architecture issues, at each clock cycle, one bundle of instructions to four 32-bit execution units

and one DSU, each using vectorial instructions to process multiple cells in parallel. This results in a

128-bit wide VLIW, allowing the computation of the 16 8-bit (for the SW algorithm) or 8 16-bit (for the

Viterbi algorithm) cells in parallel, used by the considered benchmark algorithms. The register banks

and memories share the same width as the execution units, thus being composed by several cells in

each register and memory block.

The synthesis and place-&-route of the architecture was performed by using the Xilinx ISE 14.4

tool. The reported amount of occupied resources are presented in table 5.1. As can be observed, the

proposed architecture uses 6% of the Slice Registers, 50% of the Slice LUTs, and 5% of the BRAMs

available on the Zynq SoC 7020, achieving a maximum post-route operating frequency of 98.5 MHz.

By using the Xilinx Power Estimation tool [36], we further estimated the power consumption of the

proposed processor. Assuming worst-case conditions for flip-flop and memory updates, it results in a

power consumption of 0.584 W.

Table 5.1: Hardware resources, operating frequency and power estimation for the proposed architecture.Hardware Resources Used Total Utilization

Slice Registers 7135 106400 6%

Slice LUTs 26725 53200 50%

36-bit Block RAMs 7 140 5%

Frequency 98.5 MHz

Power 0.584 W

The amount of used Slice LUTs corresponds to 50% of the total LUTs available, and thus will be the

limiting factor of the processor scalability, when increasing the number of execution units or the vector

lengths. In fact, a scalability evaluation of the proposed architecture was performed, showcasing the

hardware requisites. Such study was conducted by changing the size of the vector in all execution units

from 32 to 40 bits (increase in DLP), and by including an additional execution unit (increase in ILP).

The increase of the vector width results in a 21.4% and 24.6% increase of slice registers and LUTs,

respectively, while the addition of one execution unit resulted in an increase of 23.3% and 29.9% in slice

registers and LUTs. The number of Block RAMs is only affected by the changes of the vector width,

increasing by one unit for every 16 bits added to the length of the vector.

54

Although the increase in hardware, the estimated power drops to 0.504 W (13.7%) when the vector

width increases to 40 bits, and to 0.563 W (4%) with the addition of an execution unit. This can be

explained by the significant drop in the operating frequency for both situations. Figure 5.1 summarizes

the hardware scalability results.

SliceRegisters

Slice3LUTs BRAMsFrequency

[MHz]Power3[W]

43328bit3units37baselinearchitecture1

7135 26725 7 9895 G9584

434G8bit3units 8662 333GG 7 6694 G95G4

53328bit3units 8795 34726 7 74 G9563

FPGA3Total 1G64GG 532GG 14G

G91

1

1G

1GG

1GGG

1GGGG

1GGGGG

1GGGGGG

Figure 5.1: Hardware scalability of the proposed architecture. The considered evaluations includedincreasing the width of the vectors, as well as increasing the number of execution units. The obtainedhardware resources, operating frequency and power estimation are presented.

5.2 Performance Evaluation

This section details the reference state-of-the-art architectures and compare them to the proposed

architecture, by using performance evaluation metrics. It also presents the application benchmarks and

respective used datasets.

5.2.1 Reference State-of-the-art Architectures

The proposed architecture was evaluated against three distinct state-of-the-art architectures rep-

resenting three distinct domains: i) mobile and low-power GPP; ii) high-performance GPPs; iii) pro-

grammable ASIP.

ARM Cortex-A9: Consists in a low-power GPP running at an operating frequency of 533 MHz. It

is integrated within the Zynq SoC 7020 FPGA (the same board used for the proposed architecture),

consisting in the PS of the SOC. Its architecture support out-of-order execution, with dual instruction

issue and 128-bit SIMD extensions. This allows issuing up to 2 instructions per clock cycle. In order to

take full benefit of all vector capabilities of the ARM processor, the processor’s SIMD extension (NEON

intrinsics [37]) is used.

55

Intel Core i7 3820: Consists on a high-performance GPP, running at a maximum frequency of 3.6

GHz. This processor uses a complex control structure capable of multiple instruction issue with out-of-

order and speculative execution (issuing up to 6 micro-ops per clock cycle [38]), achieving an average

of 2 Instructions Per Cycle (IPC) for the evaluated algorithms and respective datasets. The SSE2 SIMD

extension [38] was used with 128-bit wide vectors.

Bioblaze [23]: Consists in a dedicated ASIP, running at a frequency of 158 MHz. It uses a 128-bit

adapted SIMD extension ISA, and it was implemented in the same Zynq FPGA, for a fair comparison.

The SW algorithm was implemented in all architectures, while the Viterbi algorithm was only imple-

mented in the first two.

5.2.2 Application Benchmark

The benchmark applications consist in the previously introduced DP algorithms: the SW and Viterbi

algorithms. Both were implemented to solve sequence alignment problems between a query and refer-

ence sequences.

5.2.2.A Smith-Waterman

As it was described in the previous section, the considered implementation of the SW algorithm uses

8 bits for all symbols and scores. Given that the vector lengths are dimensioned to a maximum width of

128-bits, it results in a total of 16 8-bit cells being processed in parallel (4 cells per execution unit in the

proposed architecture).

The considered SW algorithm implementation was already detailed in chapter 4. To summarize, the

algorithm is parallelized along the anti-diagonal (in order to avoid data dependencies) and processed

along the query sequence, aligning smaller reference sub-sequences at a time. During the steady state

of the algorithm, this processing scheme results in only 5 clock cycles to process any cell of the DP

scoring matrix. In practice, and given the 16-cell parallelism, it results in 3.2 cells being computed each

clock cycle. This is made possible by the DSU, which parallelizes the memory accesses, removing

their impact from the inner loop of the algorithm. Nevertheless, the considered processing scheme

presents critical sections where it is required an extra memory access to migrate to a new reference

sub-sequence. However, these memory accesses can also be computed by the DSU, thus elimination

any performance impact caused by the outer loop.

For the remaining benchmarked architectures, the implemented SW algorithm follows Farrar’s im-

plementation [12], by using their SIMD ISA extensions with an equivalent vector length to the proposed

architecture (128-bits), to guarantee a fair comparison. Furthermore, only one core of each architecture

is used.

This implementation adopts a stripped access pattern processing scheme, along the query sequence

direction, where the computations are carried out in several separate F stripes that cover different parts

56

(a) Memory layout for the query profile.The vectors run parallel to the querysequence in a stripped pattern.

(b) Data dependencies between the last F vectorand the first.

Figure 5.2: Stripped pattern processing scheme and correspondent dependencies (Figures taken from[12]).

of the query sequence. Accordingly, the query is divided into F p-length segments, where p is given by

the number of vector elements that can be simultaneously accommodated in a SIMD register (see figure

5.2(a)). This results in a value of p equal to 16, for 8-bit data elements and 128-bit SIMD registers.

However, the data elements in this processing scheme are not fully independent, since the F seg-

ments have vertical dependencies to each other (see figure 5.2(b)). Hence, after all segments are

processed, a lazy loop is required to be executed in order to verify if any data hazards have occurred. If

there is a need for correction, a second pass of the loop is required to correct the errors, before a new

reference symbol is loaded for alignment. Although this loop is done in the outer loop of the algorithm

(after the query sequence is fully swiped), its performance impact is still very relevant, specially when

compared to the anti-diagonal processing scheme where no data dependencies occur and, therefore,

no lazy loops are required.

The extended SIMD-ISA offered by the Bioblaze ASIP is specially tailored for the SW algorithm.

Therefore, it results in an accelerated versions of the original Farrar’s implementation, since an efficient

fine-grain parallelism exploitation can be extracted. However, it still remains the same algorithm, with its

stripped processing scheme and the lazy loops.

Dataset

To benchmark the SW algorithm, a DNA dataset composed of several reference sequences (ranging

from 128 to 16384 elements) and a set of query sequences with length ranging from 20 to 2276 ele-

ments was used. The reference sequences correspond to twenty indexed regions of the Homo sapiens

breast cancer susceptibility gene 1 (BRCA1gene) (NC 000017.11). The query sequences were obtained

57

from a set of 22 biomarkers for diagnosing breast cancer (DI183511.1 to DI183532.1) and a fragment,

with 68 base pairs, of the BRCA1 gene with a mutation related to the presence of a Serous Papillary

Adenocarcinoma (S78558.1).

5.2.2.B Viterbi

The considered implementation of the Viterbi algorithm adopts a representation with 16 bits for all

symbols and scores. The vector lengths are dimensioned to a maximum width of 128-bits, which results

in a total of 8 16-bit cells being processed in parallel (2 cells in each execution unit for the proposed

architecture), corresponding to double the cell size that was adopted in the SW algorithm. This is due to

the higher precision requirements of the Viterbi algorithm versus the SW.

The Viterbi algorithm implementation on the proposed architecture was already described in detail

in chapter 4. Just like the SW, the algorithm is parallelized along the anti-diagonal and along the query

sequence, partitioning the reference sequence in smaller sub-sequences. As a result, during the steady

state of the algorithm, any given cell takes an average of 23 clock cycles to be computed (effectively

taking 2.875 clock cycles to compute a cell, given the 8-cell parallelism). This is made possible by the

DSU, which parallelizes the high number of memory accesses, removing their impact from the inner

loop of the algorithm. Additionally, the processing scheme presents critical sections whenever the end

of the query is reached. These critical sections will introduce a small computational delay (nonexistent

in the SW algorithm) in order to ensure the commitment of the data dependencies. Therefore, the outer

loop accounts for 3 additional inner loop iterations, or 69 clock cycles.

For the remaining evaluation platforms, it was used HMMER’s [9] Viterbi implementation. This im-

plementation follows a processing scheme very similar to Farrar’s implementation of the SW algorithm,

with the required modifications to suite the Viterbi algorithm. As such, the implementation follows the

same stripped access pattern processing scheme along the query sequence, where the computations

are carried out in several separate F stripes that cover different parts of the query sequence.

Differently to the SW algorithm, all dependencies for the match states in Viterbi’s algorithm depend

on scores from a previous row and column states (as seen in chapter 4). Therefore, HMMER’s im-

plementation uses a delayed load/store scheme that only stores the new values after the preemptive

load of the previous values. Although this algorithm inherently has more instructions than the SW, this

instruction reordering helps minimizing the number of required instructions at a cost of more storage.

Additionally, the lazy loops will still exist in the outer loop of the algorithm (whenever the end of the

query is reached). However, unlike its SW counterpart, the lazy loops in Viterbi’s algorithm are more

simple, with a lower impact on the resulting performance.

Dataset

To evaluate Viterbi’s algorithm implementation, a sample of 28 HMMs from the Dfam database of Homo

Sapiens DNA [39] were used. The adopted model lengths vary from 60 to 3000, increasing by a step of

roughly 100 model states. These models were created by the HMMER3.1b1 tool [9] and their complete

58

list is presented below (their length is prefixed to the model name):

M0063-U7 M0700-MER77B M1409-MLT1H-int M2204-CR1 Mam

M0101-HY3 M0804-LTR1E M1509-LTR104 Mam M2334-L1M2c 5end

M0200-MER107 M0900-MER4D1 M1597-Tigger6b M2434-L1MCa 5end

M0301-Eulor9A M1000-L1MEg2 5end M1727-L1P3 5end M2532-L1MC3 3end

M0401-MER121 M1106-L1MD2 3end M1817-REP522 M2629-L1MC4a 3end

M0500-LTR72B M1204-Charlie17b M1961-Charlie4 M2731-Tigger4

M0600-MER4A1 M1302-HSMAR2 M2101-L1MEg 5end M2858-Charlie12

A query sequence (generated by the HMMER tool) with a length of 10000 symbols was used to eval-

uate the alignment against all the above reference sequences. Additionally, in order to study the impact

of both the query and reference lengths in the algorithm performance, a sample of 17 generated query

sequences, with lengths ranging from 20 to 10000, was used to evaluate the algorithm’s performance in

the alignment against the longest reference sequence with a length of 2991 symbols.

5.2.3 Performance Evaluation

In the proposed architecture, both the RAM and the local fast memory are pre-loaded with the refer-

ence and query sequence (RAM), together with all the necessary constants and cost/score values (both

memories) required by the evaluated algorithms. Therefore, only the algorithm steps are accounted for

in the performed evaluations.

Accurate clock cycle measurements of the required time to execute each biological sequences anal-

ysis in the proposed platform were achieved by using the Xilinx ISim [40]. In the Bioblaze, the clock

cycle measurements were achieved by using Modelsim SE 10.0b [41]. In the ARM Cortex-A9 and the

Intel Core i7, the system timing functions were used to determine the total execution time of the DNA

sequence alignment. To improve the measurement accuracy, several repetitions of the same alignment

were done. The obtained values were subsequently divided by the number of repetitions and the pro-

cessor clock frequency.

The performance evaluation will then consist in two metrics: the number of Clock Cycles per Cell

Update (CCPCU) and the number of Cell Updates Per Second (CUPS).

5.2.3.A Smith-Waterman

Table 5.2 depicts the average number of clock cycles to complete the DNA sequence alignment in all

evaluated architectures, for the previously presented dataset. The resulted clock cycle ratios between

the reference architectures and the proposed architecture can be observed in the respective columns

(relating the observed differences in terms of clock cycles), which account for the affine model of the

algorithm.

The charts in figure 5.3 were drawn in order to study how number of clock cycles is affected by the

length of each sequence (both query and reference sequences). The plot in figure 5.3(a) represents the

number of clock cycles of the Bioblaze and the proposed architecture for an alignment between a fixed

59

Table 5.2: Average number of clock cycles for different DNA query sequences matched against a 4092element reference sequence, using the SW algorithm and the considered execution platforms, with therespective clock cycle ratios.

Clock Cycles [×106]

Query Size 20 68 74 85 94 685 1861 2276

Proposed Architecture 0.026 0.087 0.095 0.109 0.120 0.876 2.380 2.911

Clock Cycles (c.c.) [×106]

Query ARM CortexA9 c.c. BioBlaze c.c. Intel c.c.

Size (NEON) ratio [23] ratio Core i7 3820 ratio

20 1.154 44.990 0.307 11.969 0.384 14.769

68 1.373 15.776 0.555 6.377 0.606 6.966

74 1.339 14.139 0.543 5.734 0.429 4.516

85 1.470 13.515 0.631 5.801 0.480 4.404

94 1.373 11.415 0.627 5.213 0.487 4.058

685 6.303 7.195 3.375 3.853 1.504 1.717

1861 16.262 6.833 8.848 3.718 3.530 1.483

2276 19.491 6.696 10.744 3.691 4.163 1.430

query sequence (with 64 symbols) and multiple references (ranging from 128 to 16384 symbols). The

plot in figure 5.3(b) represents the number of clock cycles of the same architectures for an alignment

between a fixed reference sequence (with 4096) and multiple queries (ranging from 20 to 2276 symbols).

Both graphics are accompanied by the respective clock cycle ratios between the reference architecture

and the proposed architecture.

As it can be observed, the variation of the length of both sequences does not have any significant

impact on the resulting gains obtained for the proposed architecture. In fact, the clock cycle ratio tends

to stabilize around 8.15 for large reference sequences aligned to a fixed query sequence composed of

64 symbols, and around 3.7 for large queries aligned to a fixed reference composed of 4096 symbols.

The results shown in figure 5.3(a) (where the query reference is matched against increasing reference

sequences) also demonstrate that the percentage of lazy loops occurrences remains almost constant

throughout all the alignments, supported by the speedup stabilization.

The executions times for the remaining architectures (ARM Cortex-A9 and Intel i7) are not depicted in

these graphics to ensure a better clarity. In fact, since they present the same algorithm implementation

as the BioBlaze (Farrar’s implementation), they wield very similar instructions, resulting in similar plots

when compared to the proposed architecture.

In figure 5.4, it is presented a more convenient performance metric, by using the clock cycles per

cell update (CCPCU) (lower is better). These values were obtained by dividing the total number of

clock cycles (c) by the product between the length of the reference and query sequence (m and n

respectively): - c/(m × n). As it can be seen, the proposed architecture achieves a number of CCPCU

13.7x lower than the ARM Cortex-A9, even though the latter processor can issue two instructions per

clock cycle. When compared with the Bioblaze and the Intel i7, a CCPCU of 5.44x and 4.32x lower,

60

0,00

1,00

2,00

3,00

4,00

5,00

6,00

7,00

8,00

9,00

0,001

0,01

0,1

1

10

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

C.C

.nRat

io

Clo

cknC

ycle

sn[x

10

^6]

ReferencenLength

BioBlaze VLIW C.C.nRatio

(a) Average number of clock cycles for the SW algorithm implementation using the Bioblaze and theproposed architecture, when considering a fixed query sequence composed of 64 symbols andmultiple reference sequences.

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

0,01

0,1

1

10

100

0 500 1000 1500 2000 2500

C.C

.gRat

io

Clo

ckgC

ycle

sg[x

10

^6]

QuerygLength

BioBlaze VLIW C.C.gRatio

(b) Average number of clock cycles for the SW algorithm implementation using the Bioblaze andthe proposed architecture, when considering a fixed reference sequence composed of 4096symbols and multiple query sequences.

Figure 5.3: Comparison of the average number of clock cycles for the SW algorithm implementationusing the Bioblaze and the proposed VLIW architecture, with different query and reference widths.

respectively, is achieved. This proves that the SW algorithm has a much better raw performance in the

proposed architecture than in the other architectures, showcasing the advantages of a better data-level

parallelism along the anti-diagonal.

In addition to the CCPCU comparison, the attained raw throughput (evaluated in Cell Updates per

Second (CUPS)) was also evaluated (see figure 5.4(b)). This metric accounts for the total number of

cells (given by the length of the query sequence (m) times the length of the reference sequence (n))

that are updated in a corresponding runtime (t), in seconds (accounting for the maximum operating

frequency in each implementation platform: (m × n)/t. Therefore, the higher the CUPS, the better the

performance.

The analysis of the MCUPS metric demonstrates that despite using a considerable lower operating

frequency than the other architectures, the proposed architecture achieves a throughput superior to both

the ARM (2.54x) and the Bioblaze (5.01x). However, as it would be expected, the Intel i7 achieves a

61

4,29

1,71,35

0,31

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

ARM-Cortex-A9 BioBlaze

Intel-Core-i7-3820 Proposed-Architecture

(a) Clock Cycles per Cell Update (CCPCU)

124,24

62,94

2274,07

315,18

1

10

100

1000

10000

ARMsCortex-A9 BioBlaze

IntelsCoresi7s3820 ProposedsArchitecture

(b) Mega Cell Updates per Second (MCUPS)

Figure 5.4: Performance evaluation results for the SW algorithm implementation in all evaluation archi-tectures.

much superior throughput (7.2x over the proposed architecture) given its much higher operating fre-

quency (31.17x over the proposed architecture).

5.2.3.B Viterbi

The average number of clock cycles for the Viterbi algorithm to execute the DNA sequence alignment

in the considered architectures are presented in table 5.3. This table depicts the results obtained for an

alignment between selected reference sequences from the dataset and a fixed query sequence with

a length of 10000 symbols. It also includes the respective clock cycle ratios between the reference

architectures and the proposed architecture (relating the observed differences in terms of clock cycles).

Table 5.3: Average number of clock cycles for different DNA reference sequences matched againsta 10000 element query sequence using the Viterbi algorithm, when implemented in the consideredexecution platforms.

Clock Cycles (c.c.) [×106]

Reference Proposed ARM CortexA9 c.c. Intel Core c.c.

Size Architecture (NEON) ratio i7 3820 ratio

200 6 130 22.624 46.05 7.68

472 14 311 22.911 54.17 3.29

900 26 565 21.822 95.17 3.66

1305 38 817 21.754 141.22 3.72

1727 50 1188 23.885 190.34 3.81

2204 63 1513 23.847 239.46 3.80

2532 73 1771 24.297 251.11 3.28

2991 86 2117 24.588 288.58 3.36

Similarly to what was done with the SW algorithm, figure 5.5 depicts additional plots representing the

average number of clock cycles and the corresponding variation for the Viterbi algorithm implementation

for several query-reference sets, when considering the proposed architecture and the ARM Cortex-A9.

The Intel architecture is not presented in order to provide better clarity. Additionally, since it implements

62

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

40,00

45,00

1

10

100

1000

10000

0 500 1000 1500 2000 2500 3000

C.C

.hRat

io

Clo

ckhC

ycle

sh[x

10

^6]

ReferencehLength

ARM VLIW C.C.hRatio

(a) Average number of clock cycles for a fixed query sequence composed of 10000 symbols andmultiple reference sequences.

0,00

5,00

10,00

15,00

20,00

25,00

30,00

1

10

100

1000

0 2000 4000 6000 8000 10000

C.C

.nRat

io

Clo

cknC

ycle

sn[x

10

^6]

QuerynLength

VLIW ARM C.C.nRatio

(b) Average number of clock cycles for a fixed reference sequence composed of 2991 symbols andmultiple query sequences.

Figure 5.5: Comparison of the average number of clock cycles between the ARM Cortex-A9 and theproposed VLIW architecture, when executing the Viterbi algorithm with different query and referencewidths.

the same algorithm as the ARM Cortex-A9, it wields very similar instructions, resulting in a very similar

plot (after accounting for the performance differences).

The plot in figure 5.5(a) refers to the average number of clock cycles of a fixed query sequence

(composed by 10000 symbols) aligned against multiple references, while the plot in figure 5.5(b) refers

to the average number of clock cycles of a fixed reference sequence (with a length of 2991 symbols)

aligned against multiple query references. As it can be observed, the increase of the reference sequence

length leads to a very slow stabilization of the clock cycle ratio of the proposed architecture over the

ARM, reaching a value of 25. When varying the query sequence length, the clock cycle ratio stabilizes

63

very fast with the length increase, at a value of 24.6. These results demonstrate that the impact caused

by the critical sections in the outer loop of the algorithm implementation in the proposed architecture

is negligible, when compared to the other architectures, due to the slow rate of the clock cycle ratio

stabilization in the plot in figure5.5(a).

Figure 5.6(a) depicts the CCPCU metric. By following a trend entirely similar to the previously pre-

sented results, the proposed architecture achieves a number of CCPCU 23.4x lower than the ARM and

3.45x lower than the Intel i7.

67,48

9,94

2,88

1

10

100

ARMuCortex-A9 InteluCoreui7u3820 ProposeduArchitecture

(a) Clock Cycles per Cell Update (CCPCU)

7,9

308,85

34,26

1

10

100

1000

ARMhCortex-A9 IntelhCorehi7h3820 ProposedhArchitecture

(b) Mega Cell Updates per Second (MCUPS)

Figure 5.6: Performance evaluation results for the Viterbi algorithm implementation in all evaluated ar-chitectures.

Additionally, figure 5.6(b) presents the attained raw throughput, measured in CUPS. The graphic

shows a speedup of the proposed architecture of 4.34x over the ARM Cortex-A9. However, given Intel’s

considerably higher operating frequency, the proposed architecture loses to it, with the Intel having a

speedup of 9.01x over the proposed architecture.

5.3 Performance and Energy Effiency

The considered architectures were also evaluated regarding their energy efficiency and performance-

energy efficiency. The adopted energy efficiency metric is the Cell Updates per Joule (CUPJ), given by

the total number of processed cells, divided by the total consumed energy. Naturally, the higher the

CUPJ, the better the energy efficiency.

The adopted performance-energy efficiency metric is given in Cell Updates per Joule-Second (CUPJS)

and it can be regarded as an inversion and normalization of the commonly used Energy-Delay Product

(EDP) metric. In fact, while the EDP is generally given by the product of the total energy consumption

and the corresponding runtime, the adopted CUPJS is obtained by inverting the EDP and multiplying it

by the total number of processed cells. Just like the previous metrics, the higher the CUPJS, the better

the performance-energy efficiency. It is important to notice that the architecture with the best results

in terms of the CUPJS metric is not necessarily the architecture with the highest performance and the

lowest power consumption. In fact, a given platform can have extremely high performances with a high

energy cost, and still achieve a better performance-energy ratio than an architecture with lower perfor-

64

mance and very low energy consumption. The final conclusion must always have the target application

domain in mind, which often presents strict power requirements, thus not always corresponding to the

architecture with solely the best performance-energy efficiency.

Table 5.4 depicts the operating frequency and power estimation for all evaluated architectures. The

power corresponding to the proposed architecture, the ARM and the Bioblaze, was estimated with the

Xilinx Power Estimation tool [36], assuming worst-case conditions for flip-flop and memory updates. The

Intel i7 power was estimated by measuring the Intel performance counters, available in the processor.

Table 5.4: Operation frequency and power estimation for all evaluation architectures.Proposed Architecture ARM BioBlaze Intel i7

Frequency 98.5 MHz 533 MHz 158 MHz 3070 MHz

Power 0.584 W 0.95 W 0.3 W 38 W

5.3.1 Smith-Waterman

The obtained CUPJ and CUPJS results for the SW implementation are presented in figures 5.7 (a)

and (b), respectively.

130,78

331,93

59,84

539,69

0

100

200

300

400

500

600



(a) Mega Cell Updates per Joule (MCUPJ)

127,47

175,64

368,9

412,43

0

50

100

150

200

250

300

350

400

450



(b) Peta Cell Updates per Joule.Second(PCUPJS)

Figure 5.7: Performance and energy evaluation results obtained for the SW algorithm implementation inall the evaluated architectures.

The proposed architecture achieves an energy efficiency 4.13x, 1.63x and 9.02x greater than the

ARM Cortex-A9, the BioBlaze and the Intel i7, respectively. This was expected, given the lower power

consumption of the proposed architecture against the ARM and the Intel. In fact, although the Bioblaze

has a lower power consumption, the higher throughput of the proposed architecture results in a better

energy efficiency.

Regarding the performance-energy metric, the proposed architecture achieved results 3.24x and

2.35x greater than the ARM Cortex-A9 and the Bioblaze, respectively. These results were also expected,

given the higher raw throughput and better energy efficiency of the proposed architecture. Regarding

the Intel i7, and despite its very high computational performance, the proposed architecture, with its low-

65

power consumption, manages to achieve a performance-energy efficiency 1.12x higher, compensating

the lower performance with higher energy savings.

These results make the proposed architecture a well suited candidate for mobile low power envi-

ronments. Furthermore, they demonstrate that the performance of a low power architecture can be

compared to the performance of a high end GPP, when also account for energy efficiency.

5.3.2 Viterbi

The obtained CUPJ and CUPJS results for the Viterbi implementation are represented in figures

5.8(a) and (b), respectively.

8,31 8,13

58,66

1

10

100


(a) Mega Cell Updates per Joule (MCUPJ)

8,1

50,1 44,83

1

10

100


(b) Peta Cell Updates per Joule.Second(PCUPJS)

Figure 5.8: Performance and energy evaluation results obtained for the Viterbi algorithm implementationin all the evaluated architectures.

As it can be seen, the proposed architecture presents an energy efficiency that is 7.06x and 7.22x

greater than the ARM Cortex-A9 and the Intel i7, respectively. These results were both expected given

the lower power consumption and high raw performance observed in the proposed architecture when

compared to the other evaluated architectures.

For the performance-energy metric, it is possible to observe a gain of about 5.53x over the ARM.

However, when compared to the Intel, the proposed architectures has a worse performance-energy, with

the Intel having a gain of 1.12x. This is mainly due to the very higher operating frequency of the Intel

i7 processor (and thus to its very high throughput), compensating the much higher energy consumption

when compared with the proposed architecture, in the long haul.

However, as it was seen for the SW algorithm, the high energy efficiency of the proposed architecture

makes it the perfect candidate for mobile low power environments.

5.4 Summary

Two widely used DP algorithms (SW and Viterbi algorithms) were evaluated in the proposed archi-

tecture and in 3 alternative state-of-the-art architectures: the ARM Cortex-A9, representing a low-power

GPP; the Intel Core i7 3820, representing a high-performance GPP; and a low-power dedicated ASIP,

specifically tailored for the SW algorithm.

66

For both evaluated algorithms implementations, the proposed architecture manages to achieve a

better raw performance than all the reference architectures, with the exception of the Intel Core i7, given

its much higher operating frequency (31.17x the operating frequency of the proposed architecture). It

also achieved a better energy efficiency than all other architectures, validating the proposed architecture

for low-power embedded environments. A performance-energy metric was also evaluated, where the

proposed architecture managed to surpass all evaluated architectures with the exception of the Intel

Core i7 for the Viterbi implementation (although the intel i7 only achieved a gain of 1.12x). This exception

can be explained by the fact that no optimized instructions (for the Viterbi algorithm) were added to the

ISA in the proposed architecture, as well as the difference in operating frequencies.

These results also demonstrate that the proposed programmable architecture, specially tailored for

DP algorithms, can compete with higher-end GPPs (performance wise), in low-power environments,

such as in embedded systems (e.g. biomarker detection SOCs).

67

68

6Conclusions and Future Work

Contents6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

69

6.1 Conclusion

The proposed processor is based on a VLIW architecture, composed by several independent vector

execution units and a DSU to parallelize memory accesses. The architecture exploits DLP by computing

vector instructions in each execution unit, and exploits ILP by issuing a bundle of instructions to the

parallel execution units.

The custom ISA was specially adapted for DP algorithms, allowing a high level parallelism, with

reduced and more efficient hardware requirements. Each execution unit has its own register bank,

assuring a convenient data organization and avoiding structural hazards. Additionally, the architecture

presents two distinct memories (a RAM and a local fast memory) and shared FUs, thus reducing the

impact caused by memory accesses and reducing the hardware requirements, respectively. It also

presents special mechanisms to access data on neighboring cells, such as sniffing and register window

mechanisms, together with shared memories, that further help reducing the number of clock cycles in

the algorithm implementations.

Two benchmark algorithms (the SW and the Viterbi algorithms) were implemented in the proposed

architecture and the reference state-of-the-art architectures, which consist in: i) a mobile low power

ARM Cortex-A9 GPP; ii) a high performance Intel Core i7 3820 GPP; iii) and a dedicated ASIP, the

Bioblaze. These algorithm implementations made use of the corresponding vector extensions in all

architectures, with 128-bit data vectors (8-bit words for the SW implementation and 16-bit words for the

Viterbi implementation).

From the obtained performance results, the proposed architecture achieved a better throughput than

most architectures (maximum speedup of 5.01x over the Bioblaze, for the SW algorithm, and 4.34x

over the ARM Cortex-A9, for the Viterbi algorithm), only losing to the Intel i7 (in both algorithm imple-

mentations) due to its very high operating frequency (31.17x). However, the energy efficiency results

demonstrate that the proposed architecture is well suited for low power platforms. In fact, it achieved

an energy efficiency superior to all the reference architectures, reaching gains as high as 9.02x over

the Intel i7, for the SW algorithm implementation. When accounting for a performance-energy efficiency

metric, the proposed architecture still achieved better results than most reference architectures, only los-

ing to the Viterbi implementation on the Intel i7 (which has 1.12x better performance-energy efficiency),

due to its very high throughput, compensating for the high power consumption.

Therefore, the presented results confirm that the devised architecture is a viable solution not only

for DP applications, but also for restricted low power environments, such as embedded systems. In

addition, the high performance-energy efficiency results demonstrate that it can even surpass state-of-

the-art GPPs, filling a gap in programmable low power and high performance architectures.

6.2 Future Work

The proposed VLIW architecture was implemented in a Zynq FPGA. In chapter 3, an interfacing

structure for the architecture was proposed, but it was not fully implemented. By completing the design

and implementation of the interface, it would ensure that real world applications could interact with the

70

architecture, allowing the study of different integration technologies such as ASICs, FPGAs or low power

systems like biochips (for bioinformatic DP algorithms).

The scalability of the proposed architecture should also be further analyzed, in order to find the best

ratio between the vector width/number of execution units and the maximum performance and energy

efficiency.

Additionally, a broader set of algorithms such as matrix chain multiplication, Dijkstra’s shortest path

or even non-DP algorithms should be considered and implemented in the architecture, due to the fact

that its ISA can be easily modified to extend the algorithm support. This study would consolidate the

architecture as a programmable and high energy efficient architecture.

71

72

Bibliography

[1] K. Shibu, Introduction to Embedded Systems, 1st Edition. McGraw-Hill Education, June 2009.

[2] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, “Genbank,”

Nucleic acids research, vol. 28, no. 1, pp. 15–18, 2000.

[3] D. A. Benson, M. Cavanaugh, I. Karsch-Mizrachi, D. J. Lipman, and J. Ostell, “Genbank,”

Nucleic acids research, vol. 41, no. D1, pp. D36–D42, January 2013.

[4] T. F. Smith and M. S. Waterman, “Identification of Common Molecular Subsequences,” Journal of

molecular biology, vol. 147, no. 1, pp. 195–197, 1981.

[5] S. B. Needleman and C. D. Wunsch, “A general Method Applicable to the Search For Similarities

in the Amino Acid Sequence of Two Proteins,” Journal of molecular biology, vol. 48, no. 3, pp.

443–453, 1970.

[6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment Search

Tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990.

[7] W. R. Pearson and D. J. Lipman, “Improved Tools For Biological Sequence Comparison,”

Proceedings of the National Academy of Sciences, vol. 85, no. 8, pp. 2444–2448, 1988.

[8] A. Viterbi, “Error Bounds For Convolutional Codes And an Asymptotically Optimum Decoding Algo-

rithm,” IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260–269, 1967.

[9] S. R. Eddy, “Accelerated Profile HMM Searches,” PLoS computational biology, vol. 7, no. 10, p.

e1002195, 2011.

[10] O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,” Journal of molecular

biology, vol. 162, no. 3, pp. 705–708, 1982.

[11] A. Wozniak, “Using Video-Oriented Instructions to Speed Up Sequence Comparison,” Computer

applications in the biosciences: CABIOS, vol. 13, no. 2, pp. 145–150, 1997.

[12] M. Farrar, “Striped Smith–Waterman Speeds Database Searches Six Times Over Other SIMD Im-

plementations,” Bioinformatics, vol. 23, no. 2, pp. 156–161, 2007.

[13] T. Rognes and E. Seeberg, “Six-fold Speed-up of Smith–Waterman Sequence Database Searches

Using Parallel Processing on Common Microprocessors,” Bioinformatics, vol. 16, no. 8, pp. 699–

706, 2000.

73

[14] T. Rognes, “Faster Smith-Waterman Database Searches With Inter-Sequence SIMD Parallelisa-

tion,” BMC bioinformatics, vol. 12, no. 1, p. 221, 2011.

[15] C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen, Introduction to Algorithms. The MIT

press, 2001.

[16] S. R. Eddy, “Profile Hidden Markov Models,” Bioinformatics, vol. 14, no. 9, pp. 755–763, 1998.

[17] N. Casagrande, “B.A.B.A. - Basic-Algorithms-of-Bioinformatics Applet,” Last Accessed on 12

August, 2014. [Online]. Available: http://baba.sourceforge.net/

[18] E. Fosler-Lussier, “Markov Models and Hidden Markov Models: A Brief Tutorial,” International

Computer Science Institute Technical Report TR-98-041, 1998.

[19] R. Durbin, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-

bridge university press, 1998.

[20] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov Models in Compu-

tational Biology: Applications to Protein Modeling,” Journal of molecular biology, vol. 235, no. 5, pp.

1501–1531, 1994.

[21] Intel. (2013, Sep.) Intel R© 64 and IA-32 Architectures Software Developer’s Manual. Last Accessed

on September 10, 2014. [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/

documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

[22] P. Green. (1996) Swat Documentation. Last Accessed on September 10, 2014. [Online]. Available:

http://www.phrap.org/phredphrap/general.html

[23] N. Neves, N. Sebastiao, A. Patricio, D. Matos, P. Tomas, P. Flores, and N. Roma, “BioBlaze: Multi-

core SIMD ASIP for DNA Sequence Alignment,” in Application-Specific Systems, Architectures and

Processors (ASAP), 2013 IEEE 24th International Conference on. IEEE, 2013, pp. 241–244.

[24] W. Martins, J. del Cuvillo, F. Useche, K. B. Theobald, and G. R. Gao, “A Multithreaded Paral-

lel Implementation of a Dynamic Programming Algorithm For Sequence Comparison,” in Pacific

Symposium on Biocomputing, vol. 6, 2001, pp. 311–322.

[25] E. W. Edmiston, N. G. Core, J. H. Saltz, and R. M. Smith, “Parallel Processing of Biological Se-

quence Comparison Algorithms,” International Journal of Parallel Programming, vol. 17, no. 3, pp.

259–275, 1988.

[26] Intel. (2000) Using the Streaming SIMD Extensions 2 (SSE2) to Evaluate a Hidden Markov

Model with Viterbi Decoding. Last Accessed on September 10, 2014. [Online]. Available:

http://software.intel.com/sites/default/files/m/d/4/1/d/8/17679 ap-946 w hmm viterbi.pdf

[27] L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, “Design of a Super-pipelined Viterbi Decoder,” in Circuits

and Systems, 1999. ISCAS’99. Proceedings of the 1999 IEEE International Symposium on, vol. 1.

IEEE, 1999, pp. 133–136.

74

http://baba.sourceforge.net/

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf


http://www.phrap.org/phredphrap/general.html

http://software.intel.com/sites/default/files/m/d/4/1/d/8/17679_ap-946_w_hmm_viterbi.pdf

[28] N. Sebastiao, N. Roma, and P. Flores, “Scalable Accelerator Architecture for Local Alignment of

DNA Sequences,” 2010, unpublished.

[29] S. Derrien and P. Quinton, “Hardware Acceleration of HMMER on FPGAs,” Journal of Signal

Processing Systems, vol. 58, no. 1, pp. 53–67, 2010.

[30] N. Sebastiao, N. Roma, and P. Flores, “Integrated Hardware Architecture for Efficient Computation

of the n-Best Bio-Sequence Local Alignments in Embedded Platforms,” IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, vol. 20, no. 7, pp. 1262–1275, 2012.

[31] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach, “Accelerating Compute-intensive Applications

With GPUs and FPGAs,” in Application Specific Processors, 2008. SASP 2008. Symposium on.

IEEE, 2008, pp. 101–107.

[32] K. Benkrid, Y. Liu, and A. Benkrid, “A Highly Parameterized and Efficient FPGA-based Skeleton

for Pairwise Biological Sequence Alignment,” IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, vol. 17, no. 4, pp. 561–570, 2009.

[33] A. C. Jacob, J. M. Lancaster, J. D. Buhler, and R. D. Chamberlain, “Preliminary Results in Accel-

erating Profile HMM Search on FPGAs,” in Parallel and Distributed Processing Symposium, 2007.

IPDPS 2007. IEEE International. IEEE, 2007, pp. 1–8.

[34] M. Ferreira, N. Roma, and L. M. Russo, “Cache-Oblivious Parallel SIMD Viterbi Decoding For Se-

quence Search in HMMER,” BMC Bioinformatics, vol. 15, no. 1, p. 165, 2014.

[35] Xilinx. (2013) Xilinx DS190 Zynq-7000 All Programmable SoC Overview. Last Accessed

on September 10, 2014. [Online]. Available: http://www.xilinx.com/support/documentation/

data sheets/ds190-Zynq-7000-Overview.pdf

[36] xilinx. (2014) Power Estimator User Guide. Last Accessed on September 10,

2014. [Online]. Available: http://www.xilinx.com/support/documentation/sw manuals/xilinx2014 2/

ug440-xilinx-power-estimator.pdf

[37] ARM. (2014) ARM R© NEONTM Intrinsics Reference. Last Accessed on September 10,

2014. [Online]. Available: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A

arm neon intrinsics ref.pdf

[38] Intel. (2014) Intel R© 64 and IA-32 Architectures Software Developer’s Manual. Last Accessed

on September 10, 2014. [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/

documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

[39] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R. Eddy, A. Heger, K. Hether-

ington, L. Holm, J. Mistry et al., “Pfam: The Protein Families Database,” Nucleic acids research, p.

gkt1223, 2014.

[40] Xilinx. (2012) ISim User Guide. Last Accessed on September 10, 2014. [Online]. Available:

http://www.xilinx.com/support/documentation/sw manuals/xilinx14 1/plugin ism.pdf

75

http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf

http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_2/ug440-xilinx-power-estimator.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_2/ug440-xilinx-power-estimator.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf



http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/plugin_ism.pdf

[41] M. Graphics. (2012) ModelSim R© User’s Manual. Last Accessed on September 10, 2014. [Online].

Available: http://www.microsemi.com/document-portal/doc view/131619-modelsim-user

76

http://www.microsemi.com/document-portal/doc_view/131619-modelsim-user

AProposed Architecture Instruction Set

77

Figure A.1: Full implemented instruction set.Opcode OpControl

&]RER, :RER[

ArithmeticRandRLogicRInstructions

NoRoperation [[[[[[[ [[[[[[

Sum [[[[[&[ [[[[[[

SumRwithRimmediate [[[[[&[ [&[[[[

Subtraction [[[[&[& [[[[[[

SubtractionRwithRimmediate [[[[&[& [&[[[[

Maximum [[[[&[[ [[[[[[

MaximumRandRMove [[[[[&& &[[[[[

Comparison [[&[[[[ [[[[[[

ArithmeticRShiftRRight [[[&[[[ [[[[[[

ArithmeticRShiftRLeft [[[&[[& [[[[[[

LogicRShiftRRight [[[&[&[ [[[[[[

LogicRShiftRLeft [[[&[&& [[[[[[

ArithmeticRShiftRRightRIimmediatey [[[&[[[ [&[[[[

ArithmeticRShiftRLeftRIimmediatey [[[&[[& [&[[[[

LogicRShiftRRightRIimmediatey [[[&[&[ [&[[[[

LogicRShiftRLeftRIimmediatey [[[&[&& [&[[[[

LogicROR [[[&&[[ [[[[[[

LogicRAND [[[&&[& [[[[[[

LogicRXOR [[[&&&[ [[[[[[

LogicRORRwithRimmediate [[[&&[[ [&[[[[

LogicRANDRwithRimmediate [[[&&[& [&[[[[

LogicRXORRwithRimmediate [[[&&&[ [&[[[[

MemoryRInstructions

IndexRMemoryRaddress &&[[&[& [[[[[[

LoadRByte &&[[[[& [[[[[[

LoadRhalfEword &&[[[&[ [[[[[[

LoadRData &&[[&[[ [[[[[[

IndexRlocalRmemoryRaddress &&[&[[& [[[[[[

LocalRmemoryRLoad &&[&[[[ [[[[[[

StoreRByte &&&[[[& [[[[[[

StoreRHalfEword &&&[[&[ [[[[[[

StoreRData &&&[&[[ [[[[[[

ControlRInstructions

DelayedRBranch &[&[[[[ [[[[[[

ImmediateRDelayedRBranch &[&[[[[ [&[[[[

DelayedRBranchREqual &[&[[[& [[[[[[

ImmediateRDelayedRBranchREqual &[&[[[& [&[[[[

DelayedRBranchRNotREqual &[&[[&[ [[[[[[

ImmediateRDelayedRBranchRNotREqual &[&[[&[ [&[[[[

DelayedRBranchRLessRthan &[&[[&& [[[[[[

ImmediateRDelayedRBranchRLessRthan &[&[[&& [&[[[[

DelayedRBranchRGreaterRthan &[&[&[[ [[[[[[

ImmediateRDelayedRBranchRGreaterRthan &[&[&[[ [&[[[[

BGTD

BGTID

BNEID

BLTD

BLTID

BEQD

BEQID

BNED

BRD

BRID

SB

SH

SD M[RaDRb]R4ERRd[<&"[]

INDEXRMADDR IMRR4ERIndexIRaDRby

INDEXRSPADDR IMSR4ERindexIRaHRRbySPADRLD RdR4ERSpad[IMS]

LB

LH

LD Rd[<&"[]R4ERM[IMR]

ANDI

XORI

AND

XOR

ORI

SRLI

SLLI

OR

SLL

SRAI

SLAI

SRA

SLA

SRL

CMP

MAX

MAXMOV

SUB

SUBI

SUM

SUMI

ImmediateR=RRbRbits Instruction SemanticsR/RNotes

NOP NoRoperation

�� = � → �� ← �� + Rb�� = � → ��1 ← ��1 + ��1 ��2 ← ��2 + ��2 ��3 ← ��3 + ��3

�� ← �� > �� ← � ��

�� = � �� ← �� > �� = � ��1,��2,��3 ← �� > ��

�� ← �� ← �� ← ��

�� ← �� ≫ �(��)�� ← �� ≪ �(��)

�� ← �� ≫ �(��)�� ← �� ≪ �(��)

�� ← �� ≪ �(��)�� ← �� ≪ �(��)

�� ← �� ≪ �(��)�� ← �� ≫ �(��)

Rd[7"[]R 4E M[IMR]R if TaDTbRzRL[[LRd[&:"T]R 4E M[IMR]R if TaDTbRzRL[&LRd[]<"&,]R4E M[IMR]R if TaDTbRzRL&[LRd[<&"]3]R4E M[IMR]R if TaDTbRzRL&&L

Rd[&:"[]R 4E M[IMR]R if TaDTbRzRL[[LRd[<&"&,]R4E M[IMR]R if TaDTbRzRL&&L

�� = � → �� ← �� + Imm�� = � → ��1 ← ��1 + �� 2 ← ��2 + �� 3 ← ��3 + ��

�� = � → �� ← �� + �� + 1�� = � → ��1 ← ��1 + ��1 + 1 ��2 ← ��2 + ��2 + 1 ��3 ← ��3 + ��3 + 1

�� = � → �� ← �� + �� + 1�� = � → ��1 ← �� + ��1 + 1 ��2 ← �� + ��2 + 1 ��3 ← �� + ��3 + 1

�� ← �� ← �� ← ��

If�� = � E TheRresultRisRconcatenatedRtoRtheRsniffRregister

�� ← �� + �� ← �� + �� ← �� + �� = 0�� ← �� + �� = 0�� ← �� + �� ≠ 0�� ← �� + �� ≠ 0

�� ← �� + �� ≤ 0

�� ← �� + �� ≥ 0

�� ← �� + �� ≤ 0

�� ← �� + �� ≥ 0

TheRremainderRisRpaddedRwithRzeroes

�� ← �� + �� + 1�� ← 0 �� ≥ �� 1

M[RaDRb]R4E Rd[7"[]R if TaDTbRzRL[[LM[RaDRb]R4E Rd[&:"T]R if TaDTbRzRL[&LM[RaDRb]R4E Rd[]<"&,]R if TaDTbRzRL&[L

M[RaDRb]R4E Rd[&:"[]R if TaDTbRzRL[[LM[RaDRb]R4E Rd[<&"&,]R if TaDTbRzRL&&L




�� ← (�� & ��) �� > �� (�� & ��)

78

BViterbi Pseudo-code

79

Figure B.1: Complete pseudo-code for the Viterbi implementation in the proposed architecture.Data Stream Unit Unit 1 Unit 2 Unit 3 Unit 4

1 INDEX *tsc++ | STORE xC

2 LOAD *tsc++e| INDEX *tsc++

3 LOAD *tsc++e| INDEX *tsc++ OR dcv OR dcv SUM dcv SUM dcv

4 LOAD *tsc++e| INDEX *tsc++ OR xEv OR xEv SUM xEv SUM xEv

5 LOAD *tsc++e| INDEX *tsc++ OR mpv OR mpv SUM mpv SUM mpv

6 LOAD *tsc++e| INDEX *tsc++ OR ipv OR ipv SUM ipv SUM ipv

7 LOAD *tsc++e| INDEX *tsc++ OR dpv OR dpv SUM dpv SUM dpv

8 LOAD *tsc++e| INDEX *tsc++ SUM xNv SUM xNv

9 LOAD *tsc++ SUM xNv SUM xNv

10 INDEX xJ SUM xBv SUM xBv

11 LOAD xJeIonceeforeunit0)e| INDEX xB SUM xBv SUM xBv

12 LOAD xBeIonceeforeunite0) SUM mpv SUM mpv

13 MAX sv MAX sv SUM mpv SUM mpv

14 INDEX mpv SUM ipv SUM ipv MAX sv MAX sv

15 LOAD mpve| INDEX dpv MAX sv MAX sv SUM ipv SUM ipv

16 LOAD dpve | INDEX ipv SUM dpv SUM dpv MAX sv MAX sv

17 LOAD ipv MAX sv MAX sv SUM dpv SUM dpv

18 SUM sv SUM sv MAX sv MAX sv

19 STORE DMXo MAX xEv MAX xEv SUM sv SUM sv

20 STORE MMXo SUM dcv SUM dcv MAX xEv MAX xEv

21 SUM dcv SUM dcv

22 SUM mpv SUM mpv

23 SUM mpv SUM mpv

24 INDEX *tsc++ SUM ipv SUM ipv

25 LOAD *tsc++e| INDEX *tsc++ MAX sv MAX sv SUM ipv SUM ipv

26 LOAD *tsc++e| INDEX *tsc++ SUM xJ SUM xJ MAX sv MAX sv

27 LOAD *tsc++e| INDEX *tsc++ | STORE IMXo SUM xJ SUM xJ

28 LOAD *tsc++e| INDEX *tsc++ MAX xJ MAX xJ SUM xJ SUM xJ

29 LOAD *tsc++e| INDEX *tsc++ SUM xJ SUM xJ

30 LOAD *tsc++e| INDEX *tsc++ SUM xB SUM xB MAX xJ MAX xJ

31 LOAD *tsc++e| INDEX *tsc++ MAX xB MAX xB SUM xB SUM xB

32 LOAD *tsc++ MAX xB MAX xB

33 INDEX xC SUM xN SUM xN

34 LOAD xC SUM xN SUM xN

35 STORE xJeIfromeu3) SUM xEv SUM xEv

36 STORE xBeIfromeu3) SUM xEv SUM xEv

37 SUM xC SUM xC

38 MAX xC MAX xC SUM xC SUM xC

39 MAX xC MAX xC

40 MAX xC MAX xC

41 MAX xC

Ou

ter Lo

op

Ou

ter Lo

op

Inn

er Lo

op

[DSU]FIndexationFandFscoresFloadingF

[ExecutionFUnits]FSettingFtheFvectorFregistersFtoF-infinityF(ORFandFSUMFinstructionsFareFusedFdueFtoFtheFFUsFavailability)

OneFtimeFloadsFforFtheFspecialFstatesFinFunitF0

DelayedFload/storeFscheme

IndexationFandFscoresFloading

SpecialFstateFcomputationFandFdependencyFstores

80

Date post:	23-Jul-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Low-Power Vectorial VLIW Architecture for Maximum Parallelism ... · Dynamic Programming, Data...

Documents