Low-Power Vectorial VLIW Architecture for MaximumParallelism Exploitation of Dynamic Programming
Algorithms
Miguel Tairum Cruz
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Dr. Nuno Filipe Valentim Roma
Dr. Pedro Filipe Zeferino Tomas
Examination CommitteeChairperson: Dr. Nuno Cavaco Gomes HortaSupervisor: Dr. Nuno Filipe Valentim Roma
Members of the Committee: Dr. Joao Paulo de Castro Canas Ferreira
October 2014
Acknowledgments
The work presented herein was partially supported by national funds through Fundacao para a
Ciencia e Tecnologia (FCT) under project Threads (ref. PTDC/EEA-ELC/117329/2010).
First and foremost i would like to thank my parents and closest friends for their continued support
and motivation. I owe a huge debt of gratitude to my supervisors, Professors Nuno Roma and Pedro
Tomas for their continued support, guidance and motivation. A very special thanks goes to my colleague
Nuno Neves from the INESC-ID’s Signal Processing Systems group, for without his work mine would
have not been possible. I would also like to thank my colleague Joao Luıs Furtado from IST for his help
and insight in my work.
Abstract
Dynamic Programming algorithms are often used in many areas, to divide a complex problem into
several simpler sub-problems, with many dependencies. Typical approaches explore data level paral-
lelism, by relying on specialized vector instructions. However, the fully-parallelizable scheme is often
not compliant with the memory organization of general purpose processors, leading to a less optimal
parallelism, with worse performance. The proposed architecture exploits both data and instruction level
parallelism, by statically scheduling a bundle of instructions to several different vector execution units.
This achieves better performance than vector-only architectures, and has lower hardware requirements
and thus lower power consumption. Accordingly, performance and energy efficiency metrics were used
to benchmark the proposed architecture against a dual-issue, low power ARM Cortex-A9, a multiple-
issue, out-of-order high performance Intel Core i7 and a dedicated ASIP architecture. In a fair compar-
ison where all processors compute 128-bit vectors (or equivalent), the results show that the proposed
architecture can achieve up to 5.53x, 1.12x and 2.35x better performance-energy efficiency than the
ARM Cortex-A9, the Intel i7 and the dedicated ASIP, respectively, and a performance improvement of
up to 4.34x, 5.01x and 1.12x regarding the ARM, the dedicated ASIP and the Intel i7, respectively, for
the evaluated algorithm implementations.
Keywords
Dynamic Programming, Data Level Parallelism, Instruction Level Parallelism, VLIW, Low-power
iii
Resumo
Os algoritmos de programacao dinamica sao bastante usados em varias areas, dividindo um prob-
lema complexo em multiplos sub-problemas mais simples, com varias dependencias entre si. As abor-
dagens tıpicas exploram o paralelismo dos dados atraves de instrucoes vetoriais. No entanto, nos
processadores de uso geral, devido a organizacao da memoria existente, nao e possıvel paralelizar
completamente estes problemas eficientemente, resultando em piores desempenhos. A arquitetura
proposta explora tanto a paralelizacao dos dados como das instrucoes, agendando estaticamente um
conjunto de instrucoes para varias unidades de execucao diferentes. Isto permite alcancar um melhor
desempenho que as arquiteturas vetoriais, reduzindo os requisitos de hardware e levando a um menor
consumo de energia. Foram utilizadas metricas de desempenho e eficiencia energetica a fim de refer-
enciar a arquitetura proposta contra um ARM Cortex-A9 (com duplo-agendamento de instrucoes e baixo
consumo), um Intel Core i7 (com agendamento multiplo e alto desempenho) e uma arquitetura ASIP
dedicada. Atraves de uma comparacao justa com vetores de 128 bits, os resultados obtidos mostram
que a arquitetura proposta consegue alcancar uma relacao de desempenho e eficiencia energetica
ate 5,53x, 1,12x e 2,35x melhor que o ARM Cortex-A9, o Intel i7 e o ASIP dedicado, respetivamente.
Em termos de desempenho, a arquitetura proposta atinge resultados 4,34x, 5,01x e 1,12x superiores
aos do ARM, do ASIP dedicado e do Intel i7, respetivamente, para as implementacoes dos algoritmos
avaliados.
Palavras Chave
Programacao dinamica, Paralelizacao de dados, Paralelizacao de instrucoes, VLIW, Baixo consumo
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Dynamic Programming 5
2.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3.A Profile Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3.B Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Comparison between profile HMMs and single alignment algorithms . . . . . . . . 15
2.3 Implementation of DP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Data Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 State of the Art Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3.A Programmable Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3.B Non-Programmable Architectures . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed VLIW Architecture 23
3.1 Architecture Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Register Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
vii
4 DP Algorithm Implementations 41
4.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Viterbi (Profile HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Prototyping and Evaluation 53
5.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Reference State-of-the-art Architectures . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Application Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2.A Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2.B Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.3.A Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.3.B Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Performance and Energy Effiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.2 Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Conclusions and Future Work 69
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A Proposed Architecture Instruction Set 77
B Viterbi Pseudo-code 79
viii
List of Figures
2.1 Example of the NW algorithm and its respective traceback phase . . . . . . . . . . . . . . 8
2.2 Example of the SW algorithm and its respective traceback phase . . . . . . . . . . . . . . 9
2.3 Example of a Consensus Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Example of the construction of a Profile HMM . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 HMM for the optimal gapped global alignment . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Profile HMM for unithit local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Profile HMM for multihit local alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Trellis diagram for a sequence of three observations in the Viterbi algorithm . . . . . . . . 14
2.9 Example of DP Cell parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Comparison between a substitution score matrix with and without query profiling . . . . . 19
2.11 State of the art on SIMD implementations of the SW algorithm . . . . . . . . . . . . . . . 19
3.1 FU access comparison between Vector and VLIW architectures . . . . . . . . . . . . . . . 25
3.2 Example of two iterations of a DP algorithm in the proposed architecture . . . . . . . . . . 26
3.3 Execution Units with the respective independent register banks . . . . . . . . . . . . . . . 27
3.4 Register banks depicting the sniffing mechanism and the shared memory registers . . . . 28
3.5 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 4-Stage pipeline structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Processor DLP and ILP scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 FU conflict control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Instruction words for the bundle and the composing units . . . . . . . . . . . . . . . . . . 33
3.10 Register Window example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 AXI Interconnection scheme between the RAM and the local fast memory in the proposed
architecture core and the GPP in the PS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.12 AXI Interconnection scheme between the instruction memory in the proposed architecture
core and the GPP in the PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.13 Interface scheme for the proposed architecture core . . . . . . . . . . . . . . . . . . . . . 39
4.1 SW processing scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Inner loop operations for the SW algorithm in the proposed architecture . . . . . . . . . . 44
4.3 Inner loop and outer loop operations for the SW algorithm in the proposed architecture . . 45
4.4 Critical section example for the SW algorithm implementation in the proposed architecture 46
ix
4.5 Comparison of example profiles for the HMMER platform and the proposed architecture . 48
4.6 Inner loop operations for the Viterbi algorithm in the proposed architecture . . . . . . . . . 49
4.7 Outer loop operations of the Viterbi algorithm in the proposed architecture . . . . . . . . . 50
4.8 Critical section example for the Viterbi algorithm implementation in the proposed architecture 51
5.1 Hardware scalability of the proposed architecture . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Stripped pattern processing scheme and correspondent dependencies . . . . . . . . . . . 57
5.3 Comparison of the average number of clock cycles for the SW algorithm implementation . 61
5.4 Performance evaluation results for the SW algorithm implementation . . . . . . . . . . . . 62
5.5 Comparison of the average number of clock cycles for the Viterbi algorithm implementation 63
5.6 Performance evaluation results for the Viterbi algorithm implementation . . . . . . . . . . 64
5.7 Performance and energy evaluation results obtained for the SW algorithm implementation 65
5.8 Performance and energy evaluation results obtained for the Viterbi algorithm implementation 66
A.1 Full implemented instruction set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
B.1 Complete pseudo-code for the Viterbi implementation in the proposed architecture. . . . . 80
x
List of Tables
3.1 Abridged implemented instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Hardware resources, operating frequency and power estimation for the proposed archi-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Average number of clock cycles for the SW algorithm when implemented in the considered
execution platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Average number of clock cycles for the Viterbi algorithm when implemented in the consid-
ered execution platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Operation frequency and power estimation for all evaluation architectures . . . . . . . . . 65
xi
xii
List of Acronyms
AMBA Advanced Microcontroller Bus Architecture
ASIC application-Specific Integrated Circuit
ASIP Application-Specific Instruction-set Processor
AVX Advanced Vector Extension
AXI Advanced eXtensible Interface
DLP Data Level Parallelism
DP Dynamic Programming
DSU Data Stream Unit
FPGA Field-Programmable Gate Array
FU Functional Unit
GPP General Purpose Processor
GPU Graphics Processing Unit
HMM Hidden Markov Model
ILP Instruction Level Parallelism
IPC Instructions Per Cycle
ISA Instruction Set Architecture
MIMD Multiple-Instruction Multiple-Data
NW Needleman-Wunsch
PE Processing Element
PL Programmable Logic
PS Processing System
SIMD Single-Instruction Multiple-Data
xiii
SOC System On Chip
SSE Streaming SIMD Extension
SW Smith-Waterman
TLP Thread Level Parallelism
VLIW Very Long Instruction Word
xiv
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
Dynamic Programming (DP) is a common methodology for solving complex problems, by dividing
them into smaller sub-problems that are simpler to solve. DP is applied in a vast set of different appli-
cation domains, such as bioinformatics, where it can be used in sequence alignment problems, Hidden
Markov Models (HMMs), where it can be used to find a sequence of hidden states in the model or parser
algorithms, to name a few.
Due to its problem partitioning, DP methods can exploit Data Level Parallelism (DLP) to concur-
rently compute several problems and thus increase the performance of the algorithms, as long as those
problems remain independent from one another. This parallelism paradigm is commonly exploited by
vectorial extensions to the instruction set, which are usually offered by most of today’s General Purpose
Processor (GPP) architectures.
Over the years, the processors also started to introduce Instruction Level Parallelism (ILP) with
pipelined and superscalar architectures. However, ILP extraction in DP algorithms represents a harder
optimization paradigm than DLP and thus other strategies were explored.
With the appearance of multi-processor architectures, a new type of parallelization gained the spot-
light: Thread Level Parallelism (TLP). With it, greater performance can be achieved by executing multiple
threads and/or programs at the same time, in a single or in multiple cooperating processors. The latter
configuration usually requires less energy and power per processing unit than a single high-end GPP.
This attribute is rather desired since DP applications often require high performance in a low power
environment.
More dedicated architectures, often implemented in application-Specific Integrated Circuits (ASICs),
can also better meet the application requirements, when compared to GPPs. However they are not
flexible enough to support algorithmic changes and are much more expensive. Implementations in
Field-Programmable Gate Arrays (FPGAs) (e.g. Application-Specific Instruction-set Processors (ASIPs))
arise as an intermediate solution, by representing a trade-off between the flexibility of the overall less
expensive GPPs and the higher performance of the ASICs, filling the architectural spectrum between
the two [1].
1.1 Motivation
DP applications tend to be very computation-demanding with large datasets, often requiring high
performance in a low power environment. In bioinformatics, applications related with the biological
sequence processing, operate in sequence banks that grow larger at a very fast rate [2]. As an example,
the release of GenBank from February 2013 contains 150 × 109 base pairs from over 260000 formally
described species [3]. Most of these algorithms optimal solutions are obtained by applying DP methods,
such as the Smith-Waterman (SW) algorithm [4] or the Needleman-Wunsch (NW) algorithm [5], for
the local and global sequence alignment, respectively. Given the implied computational demands and
large data sets, these algorithms often require large runtimes in GPPs, often resorting to less precise
solutions (targeting heuristic-based algorithms), such as the BLAST [6] and FASTA [7]. However, these
alternatives are less accurate and thus not optimal.
2
In the field of HMMs the same paradigm is also observed. The Viterbi algorithm [8], which consists
in a DP algorithm used to find the most probable sequence in a HMM, presents high computational
demands in GPPs. Therefore, heuristic implementations like the HMMER [9] are usually preferred.
The main problem concerned with the current DP implementations in GPPs is that they are stuck to
their Instruction Set Architecture (ISA), limiting the range of optimization methods. Looking back at the
work done throughout the years to the sequence alignment SW algorithm, we see that some algorithm
changes were initially proposed by Gotoh [10] and later by Wozniak [11], who proposed a DLP scheme.
Further down the line, Farrar [12] and Rognes [13] improved on Wozniak’s work, optimizing the DLP.
The most recent addition for GPPs was made by Rognes [14], introducing multi-core processing to the
algorithm.
Due to their implementations on GPPs, none of these works focused on ILP extraction which, when
accompanied by DLP, can provide an increase in performance and in energy efficiency. The configurabil-
ity of FPGAs thus can result in the perfect environment to envisage such architecture. Furthermore, this
architecture would not be limited to an existing ISA, granting the potential to devise a much more efficient
processor than GPPs with different architectural paradigms (e.g. Very Long Instruction Word (VLIW)),
with added support for different families of DP algorithms.
1.2 Objectives
This thesis aims the development of a novel programmable processor architecture to be implemented
in a FPGA (or in an ASIC). The processor should have an ISA particularly optimized to compute different
families of DP algorithms, with a focus on sequence alignment algorithms like the SW and the Viterbi’s
HMM.
The objective is to design the architecture from scratch, exploiting both DLP and ILP to ensure max-
imum performance with a minimal power consumption, and design its ISA in order to guarantee DP
compatibility with an higher programmability (i.e., supporting general instructions, similarly to GPPs in-
struction sets), in order to target low power systems like biochips (for bioinformatic DP algorithms) or
other embedded systems. It it also expected to tackle the bottlenecks of the current available imple-
mentations, in order to, not only avoid them but, to find a better solution to them, and implement the
architecture.
A thorough performance and energy efficiency will be conducted, comparing the proposed architec-
ture to a set of state-of-the-art architectures from different domains: i) a mobile and low-power GPP; ii)
a high-performance GPP; iii) and a programmable ASIP.
Hence, this work hopes to fill the gap between GPPs and dedicated architectures, by bringing a more
flexible alternative than most existing dedicated implementations, with low power consumption and still
faster than most GPP software applications.
3
1.3 Main Contributions
Based on the evaluated sequence alignment DP algorithms, the SW [4] and Viterbi [8], and a careful
analysis of the state-of-the-art implementation for both algorithms (Farrar [12] and Rognes [13] for the
SW algorithm and HMMER [9] for the Viterbi algorithm), a novel VLIW architecture was developed,
providing a versatile and low-power platform for DP algorithms.
The ISA was designed to exploit both DLP and ILP, in order to not only increase the performance
of the implemented algorithms, but also to reduce the hardware requirements and guarantee a better
hardware usage, leading to a low power consumption. Furthermore, the adoption of a VLIW architecture
allowed the addition of a special unit to seamlessly access the memory in parallel to the algorithm
computations, while fully exploiting DLP, thus elimination the memory limitations often present in GPP
implementations for these types of algorithms, such as Wozniak’s [11] SW implementation.
This culminated in a low power architecture with several independent execution units, working at an
operating frequency of 98.5 MHz, while still providing high performance computing for DP algorithms.
As a result of the developed research, it was already published a manuscript in the HPCS 2014
international conference, where the main contributions have been reported:
• Miguel Tairum Cruz, Pedro Tomas and Nuno Roma. Low-Power Vectorial VLIW Architecture for
Maximum Parallelism Exploitation of Dynamic Programming Algorithms, In International Confer-
ence on High Performance Computing & Simulation (HPCS 2014), pp. 88-95, Bologna - Italy, July
2014.
1.4 Dissertation Outline
This document is structured in 6 chapters. The current chapter, Chapter 1, introduces the developed
work. The following chapter revises the DP paradigm, using as examples existing DP algorithms, as well
as their corresponding implementations in state-of-the-art architectures. It will also present parallelism
paradigms that are widely used in DP implementations. Chapter 3 describes the proposed architecture.
It starts by mapping the DP requirements, to fully depict the complete resulting design. Chapter 4 details
the DP algorithm implementations in the proposed architecture. It will focus on the evaluated SW and
Viterbi algorithms. Chapter 5 presents the prototyping of the proposed architecture, as well as the eval-
uations that have been conducted with it and the reference state-of-the-art architectures. It will detail the
algorithms implementations for the remaining architectures, and comment in the obtained test results.
Finally, chapter 6 provides a conclusion to the thesis, discussing the obtained results and providing an
analysis on the devised architecture, addressing its advantages and drawbacks. Furthermore, it is also
discussed the open research directions that can be realized over the proposed architecture.
4
2Dynamic Programming
Contents2.1 Dynamic Programming Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Needleman-Wunsch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Smith-Waterman Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.4 Comparison between profile HMMs and single alignment algorithms . . . . . . . . 15
2.3 Implementation of DP Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Data Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 State of the Art Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5
2.1 Dynamic Programming Algorithms
DP is an algorithm methodology for solving complex problems, by dividing them into smaller sub-
problems that are simpler to solve. If these sub-problems are solvable and the optimal solution for
each sub-problem is found, the solution for the main problem can be realized through the sequence
of solutions of its sub-problems. This property is known as the optimal substructure property [15], and
problems that present it can be solved by DP. Another property that a problem must have to be solved as
a DP approach is that the space of sub-problems must be ”small”, in a sense that a recursive algorithm
for the problem solves the same sub-problems over and over, rather than always generating new sub-
problems. Contrary to recursive solutions, DP takes advantage of these overlapping sub-problems by
solving each sub-problem only once and then storing its solution. If the solution is later required, it
can be looked up instead of recomputed. DP thus uses additional memory to save computation time,
resulting in a time-memory tradeoff, where the savings often translate into an exponential-time solution
to be transformed into a polynomial-time solution.
There are usually two equivalent DP approaches that can be implemented: a top-down with memo-
ization and a bottom-up approach. The first uses a recursive method, by storing the intermediate results
corresponding to each sub-problem, returning the saved value when required (memoization), thus sav-
ing further computations at the given recursive level. The latter approach depends on the size of the
sub-problems, solving them in size order, with smallest first. Each sub-problem is solved only once, with
the guarantee that all the prerequisite (and smaller) sub-problems have already been solved. These two
approaches yield algorithms with the same asymptotic running times, with the bottom-up approach often
having much better constant factors, since it has less overhead for procedure calls.
DP algorithms are often represented in matrix form, where each cell corresponds to a sub-problem
depending on the adjacent cells (sub-problem dependencies). This results in a final matrix where the
last cell can only be computed after all the previous cells have been computed (optimal substructure
property). This representation allows for multiple independent cells to be processed in parallel, thus
increasing the performance.
These algorithms are then used in a wide variety of problems: Matrix chain multiplication, sequence
alignment, optimal binary search trees, shortest paths and others, as long as those problems present
optimal substructure and overlapping sub-problems. The following sections will detail specific DP prob-
lems and respective algorithms that were studied and used throughout the work.
2.2 Sequence Alignment
Bioinformatic applications have an essential role on molecular biology and related fields. Sequence
alignment algorithms, like the NW [5] or the SW [4], use DP methods to search for similarities between
DNA or protein sequences within large databases (eg. GenBank/EMBL/DDBJ [2]).
Depending on the type of alignment that is required, two algorithms can be used that apply DP
methods: SW, which outputs a local alignment; and the NW, which outputs a global alignment for any
6
two given sequences. A local alignment represents a region of greater similarity between the compared
sequences and is preferred when the query sequence (sequence to compare to a database) is smaller
than the database sequence. The global alignment method, on the other hand, spans the entire query
sequence in attempt to align every symbol in the sequence with the whole database sequence. This is
useful when comparing sequences of about the same size, that are known to be similar (DNA or protein
sequences with similar functions).
Besides the NW and the SW, there are also other sequence alignment algorithms based on HMMs.
HMMs are stochastic models that model a process where the future states of that given process depend
only on the present state and not on the complete sequence of states that preceded them. In addition,
the states are hidden from the observer, which has only the information regarding the observed outputs
that were generated by the hidden sequence of states.
In particular, the Viterbi algorithm [8] is a DP algorithm used to solve HMM problems, returning the
most probable state sequence that originated the observed sequence of outputs. Although being in a
different family of DP algorithms, Viterbi shares many properties with the sequence alignment algorithms
mentioned before (NW and SW)[16].
Although the alignment algorithms mentioned above result in optimal alignments (being global or
local), there are other commonly used tools in the field based on faster heuristic approaches (instead of
DP approaches) with reduced complexity, implemented in GPPs. Some examples are the the BLAST [6],
FASTA [7] and HMMER ([16], [9]) tools. However, these tools can only guarantee a good approximate
alignment and not always the best, often requiring a later passage of a more complex DP algorithm (like
the SW or Viterbi) for better results.
2.2.1 Needleman-Wunsch Algorithm
The NW algorithm [5] is a DP algorithm for computing the global alignment between a query and
database reference sequences. The resulting score represents the best alignment between the com-
pared sequences (a query sequence Q of size n and a database sequence D of size m) and is based on
a substitution score matrix Sm (which defines the scores given to substitution mutations), a gap penalty
α (corresponds to a negative score given to an insertion or deletion mutation) and a recurring relation
that computes the resulting score matrix H (see equation (2.2)). This algorithm takes O(nm) time to
complete.
H(i, 0) = α ∗ i
H(0, j) = α ∗ i(2.1)
Hi,j = max
Hi−1,j−1 + Sm(qi, dj)Hi−1,j + αHi,j−1 + α
(2.2)
From the equations above, it can be seen that each cell in the resulting H matrix has three depen-
dencies in its computation: the cell at its left position (horizontal dependency); the cell at its top position
7
(vertical dependency); and the cell at its top left position (diagonal dependency). The scores given by
the vertical and horizontal dependencies are subtracted by the gap cost and correspond to an insertion
or deletion in the alignment. The score given by the diagonal dependency is added to the substitution
score matrix for the current cell and corresponds to a match or mismatch in the alignment. The maxi-
mum of these 3 values will be the final cell value. Figures 2.1 (a) and (b) show an example of the NW
algorithm for two small DNA sequences.
(a) First iterationof the NWalgorithm.
(b) Last iterationof the NWalgorithm.
(c) First iteration ofthe tracebackphase.
(d) Last iteration ofthe tracebackphase.
Figure 2.1: Example of the NW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -ACC : CACT ].
After the H matrix is computed, the last cell entry (Hn,m) presents the maximum score among all
possible alignments. To compute the actual alignment, a traceback algorithm starting in this maximum
score cell is computed (see figures 2.1 (c) and (d)). This traceback algorithm compares the three de-
pendencies of the cell currently being computed, to see which one of them was the source to the current
cell result. The chosen cell then becomes part of the alignment sequence and the traceback algorithm
repeats this process, now for the chosen cell. When the first cell of the H matrix (H0,0) is reached, the
traceback ends.
Different alignment sequences can be found whenever there is more than 1 possible cell to choose
from during the traceback. This happens when, during the score computation in the NW algorithm, there
is more than one maximum result in the main recursion, ie., the cell has more than one source.
The NW algorithm is rarely used when the sequences under comparison have different sizes, since
the resulting alignment would be dominated by gaps. This happens because the global alignment tries
to align whole sequences, while the local alignment only tries to align similar regions, thus performing
much better with different sequence sizes. Since different sized sequences are commonly used, the NW
algorithm does not get as much exposure as the SW algorithm, resulting in less implementations for it.
2.2.2 Smith-Waterman Algorithm
The SW algorithm is a DP algorithm for computing the optimal local alignment score between a
query and a reference sequence. The resulting score represents the degree of similarity between the
sequences and, similarly to the NW algorithm, it is based on a substitution score matrix and a gap-
penalty function. The algorithm was proposed by Smith and Waterman [4] and was improved later by
8
Gotoh [10] for multiple sized gap penalties, having an O(nm) time complexity, where n and m are the
sizes of the query (Q) and reference (D) sequences, respectively.
Given a substitution score matrix Sm, a negative gap-open penalty α and a negative gap extension
penalty β, the score matrix H can be computed by the following recursive relations:
Hi,j = max
0Ei,j
Fi,j
Hi−1,j−1 + Sm(qi, dj)
(2.3)
H(i, 0) = H(0, j) = 0
The terms Ei,j and Fi,j are defined in equations (2.4) and (2.5), respectively. Ei,j corresponds to the
scores ending with a gap in the reference sequence (horizontal dependency), while Fi,j corresponds
to the scores ending with a gap in the query sequence (vertical dependency). Accordingly, Hi,j repre-
sents the local alignment score involving the first i symbols of Q and the first j symbols of D (diagonal
dependency).
Ei,j = max
{Ei,j−1 + βHi,j−1 + α
(2.4)
E(i, 0) = E(0, j) = 0
Fi,j = max
{Fi−1,j + βHi−1,j + α
(2.5)
F (i, 0) = F (0, j) = 0
These relations are very similar to the NW algorithm. In fact, each cell still has the three dependen-
cies in its computation (horizontal, vertical and diagonal) with the horizontal and vertical dependencies
representing insertions or deletions in the alignment, and the diagonal dependency representing a match
or mismatch between the sequence symbols.
The only major difference is the fact that the H cell values do not go below zero. This will result in
the maximum cell value in the H matrix not being necessarily the last position (Hn,m), as in the NW
algorithm. Figures 2.2 (a) and (b) show this example of the SW algorithm for two small DNA sequences.
(a) First iterationof the SWalgorithm.
(b) Last iterationof the SWalgorithm.
(c) First iteration ofthe tracebackphase.
(d) Last iteration ofthe tracebackphase.
Figure 2.2: Example of the SW algorithm ((a) and (b)) and its respective traceback phase ((c) and(d)), taken from the applet available in [17]. Two sequences (ACC and CACT) are compared, with agap penalty of -1 and a matching score of 2 (mismatch of -1) for all symbols. The resulting alignmentsequence is [ -AC : CAC ].
9
The traceback phase (see figures 2.2 (c) and (d)) will then start in the H matrix cell that has the
highest score value, and will continue its computation along the sources for each considered cell, until it
reaches a zero valued cell, instead of stopping only on the first position of the H matrix (H0,0), as in the
NW algorithm.
Both these two differences in the score computation and traceback parts of the algorithm, will result
in the most similar region between the two compared sequences, i.e., the local alignment.
2.2.3 Hidden Markov Models
A Markov model consists in a stochastic model, where the future states of a process depend only on
the present state and not on the complete sequence of states that preceded it. This particular property
can be expressed by equation (2.6) for a given state sequence {w1, w2, ..., wn} [18].
P (w1, ..., wn) =
n∏i=1
P (w1|wi−1) (2.6)
A particular Markov model is the Hidden Markov Model (HMM) [19] where some (or all) states are
hidden from the observer. In a HMM, the observer has only the information regarding the sequence of
outputs that were generated by a hidden sequence of states.
An alternative mathematical expression for the HMM can be deduced by applying Bayes’ rule for a
given state sequence {w1, w2, ..., wn} and an output (observations) sequence {u1, u2, ..., un}:
P (w1, ..., wn|u1, ..., un) =P (u1, ..., un|w1, ..., wn)P (w1, ..., wn)
P (u1, ..., un)(2.7)
P (u1, ..., un|w1, ..., wn) =
n∏i=1
P (ui|wi) (2.8)
where P (w1, ..., wn) is the probability of a given state sequence, P (u1, ..., un) is the prior probability of
seeing a particular sequence of outputs, P (u1, ..., un|w1, ..., wn) is the probability of observing the output
for a particular state and P (w1, ..., wn|u1, ..., un) is the probability of the future state, given the current
output observation and it is the one that HMM pretends to find.
Two different tasks, with different outputs, can be performed on the HMMs: decoding and generation.
The first outputs the path of states that is more likely to have generated a given output sequence, and
its corresponding probability. The latter presents the likelihood probability of a given sequence being
generated by the model. The decoding task is computed by the Viterbi algorithm [8], which computes the
most probable state to generate each new output observation, for all the available states. The generation
task is computed by a similar algorithm, the Forward algorithm, which calculates a progressive sum of
the probabilities of all previous state paths for each new observation, resulting in a final probability
consisting in the sum of all final probabilities of all the states.
Although the above description of HMMs refers to a process of single alignment (one sequence
against another), the algorithms mentioned above are used in real applications for searching similar
sequences in a database, and thus require a method to search and compare a group of sequences
10
against a database (instead of only one). This method is achieved by creating alignment profiles, which
highlight the family’s sequences common features and effectively model an entire sequence family (see
figure 2.3). These profiles are usually generated by an initial multiple alignment, followed by a proba-
bilistic breakdown of the elements present in each position.
With alignment profiles, a query can now be compared against a family of sequences (profile), thus
greatly reducing the computational cost. Furthermore, a profile gives a more correct representation of
the defining characteristics of a family, by weighing the elements in proportion to their actual frequency
(and thus importance) in the underlying family.
Alignment
A T C C A G C T
G G G C A A C T
A T G G A T C T
A A G C A A C C
A T G C C A T T
A T G G C A C T
Profi le
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
Figure 2.3: Example of a Consensus Profile, derived from a multiple alignment of a family of similarsequences.
Since the Viterbi algorithm gives the most likely path of states to generate a given sequence and
its corresponding probability, it is suitable for computing sequence alignment problems. However, the
Forward algorithm can only indicate the likelihood of the query sequence belonging to a family of se-
quences. This way, and given that both algorithms are very similar, our work will focus on the Viterbi
Algorithm.
2.2.3.A Profile Hidden Markov Models
As previously referred, HMMs can be used to statistically model the distribution of sequence elements
in a profile, by determining the probability of each element in each position of the family’s sequences as
the emission probability in each state. Thus, a Profile HMM can be used to compute the probability of
database sequences being generated by a given query, ie., align the query sequence to a database.
The construction of the Profile HMM model starts by modeling a global alignment (with no gaps)
to a succession of consecutive matching states, where each state corresponds to each column of the
profile sequences. Each matching states is also accompanied with emission probabilities, since they
emit the alignment symbols. These probabilities are derived from the relative frequencies of symbols in
the family’s sequences, at each column (see figure 2.4 (a)).
Then, insertion states are added to the model to represent gaps, i.e., portions of sequences that
do not match anything in the previous model (with only the match states). Since insertions can occur
at any point in the model, there is an Insert state for every pair of Match states (see figure 2.4 (b)).
11
B EM1 M2 M3 M4
(a) Example of a HMM composed solely of matchingstates, allowing for ungapped global alignment.
B EM1 M2 M3 M4
I2 I3 I4I1
(b) Example of a HMM that allows arbitrary inser-tions.
Figure 2.4: Example of the construction of a Profile HMM, starting with the match states (a) and with theaddition of insert states (b), with the respective state connections.
Furthermore, in order to support the affine gap model, the insert states must also have a loop to allow
for long inserted regions. The probabilities for entering an Insert state the first time versus staying in the
Insert state can also be different and, since they are arbitrary, are usually set to equal to the background
probabilities in a profile.
Finally, Deletion states are added to the model. These states represent portions of the profile that
are not matched by the sequence, and thus do not emit any output symbol. Naturally, the Delete - Delete
state transition corresponds to gap-extend costs, thus completing the Profile HMM model [20] (see figure
2.5).
M1 M2
I1
D1
I2
D2
M3
I3
D3
I0
M4
I4
D4
B
E
Figure 2.5: HMM for the optimal gapped global alignment (additional transitions from insert states todelete states, and vice-versa, are included for the sake of correctness, although these transitions areusually very improbable and have a negligible effect).
The profile HMMs can also be extended to support local alignment. This can be done by adding
two special flanking states that delimit the sub-region of the local alignment (States B and E), and two
self-looping flanking states (states N and C) that precede or follow the flanking states [19] (see figure
2.6). The flanking regions correspond to the unmatched regions of the aligning sequence and so, in
order to capture a local alignment, it is only required to add these two regions as self-looping states
with transitions from and to each match state. These new states also emit tokens with a probability
distribution, which can be set to the background random distribution of the profile.
Finally, in order to support multihit alignments, ie., multiple local alignments, another special state
is required (state J), which connects the flanking states B and E. This new state has jump and loop
probabilities, in order to cover the unmatched region between two local alignments (see figure 2.7).
12
M1 M2
I1 I2
D2
M3
I3
D3
M4
I4
D4
B E T
M5
CNS
Figure 2.6: Profile HMM for unithit local alignment.
M1 M2
I1 I2
D2
M3
I3
D3
M4
I4
D4
B E T
M5
J
CNS
Figure 2.7: Profile HMM for multihit local alignment.
2.2.3.B Viterbi Algorithm
As previously stated, Viterbi’s algorithm [8] is a DP algorithm that finds the most likely sequence path
of hidden states in a HMM (or a Profile HMM), for a given sequence of observed outputs. The required
inputs for the algorithm are:
• State Space (S): Vector with all the possible states.
• Observation Space (O): Vector with all the observed outputs.
• Observation Sequence (Yi): Vector with the sequence of observed outputs.
• Initial Probabilities (πi): Vector of initial state probabilities.
13
• Transition probabilities (Ti,j): Matrix of transition probabilities from the state i to state j.
• Emission probabilities (Qi,j): Matrix with the probabilities of observing the output i given the state
j.
Given the inputs above, the algorithm can compute the most probable state sequence {x1, ..., xT }
that originated the observed outputs {y1, ..., yT }, by using the following recursive relations:
Vt,k =
{Qyt,k × πk t = 1Qyt,k ×maxx∈S(Tx,k ∗ Vt−1,x) t 6= 1
(2.9)
where Vt,k is the probability of the most likely hidden state sequence, responsible for the first t observa-
tions, having k as its final state.
From the relations in equation 2.9, it can be seen that during the first iteration of the algorithm (t = 1),
the probability of the first hidden state only depends on the initial probabilities and on the emission matrix,
since there are no previous states yet.
For the remaining iterations (t 6= 1), the computations for every hidden state k (which will become
the xt after the computation) will now require, for every x state in S, the probabilities of its preceding
state xt−1, as well as the emission probability Q of observing the output given the state k, and the T
probability of passing from the preceding state xt−1 to the current one being computed (k).
This wields a time complexity of O(TS2). Figure 2.8 shows an example of the Viterbi algorithm for a
HMM with 2 states.
Start
S00.2
S10.08
S00.098
S10.019
Observation 1 Observation 2
S10.1
S00.051
Observation 3
Figure 2.8: Trellis diagram for a sequence of three observations in the Viterbi algorithm. Each state hasthe correspondent probability value. The T and Q values are not depicted. The red path corresponds tothe most probable state sequence.
From a DP perspective, the main problem can be seen as the full sequence of hidden states that
Viterbi tries to find, while the sub-problems are the computations of the probabilities of every hidden
state along the sequence, since they only depend on their previous state.
Since the algorithm only stores information regarding the previous state, a back pointer for the pre-
ceding state should also be used, in order to retrieve the final state sequence path at the end of the
computation (traceback).
14
2.2.4 Comparison between profile HMMs and single alignment algorithms
The Viterbi algorithm, used for profile HMMs, is very similar to the SW algorithm, presented previ-
ously. In fact, when using profile HMMs for solving sequence alignment problems, both algorithms have
the same recursive dependencies, with little differences. The following equations represent the applica-
tion of the Viterbi algorithm using a notation similar to the one that was adopted for the SW algorithm.
Mi,j = log eMj(xi) +max
Bi−1 + log tBj−1Mj
Mi−1,j−1 + log tMj−1Mj
Ii−1,j−1 + log tIj−1Mj
Di−1,j−1 + log tDj−1Mj
(2.10)
Di,j = max
{Mi,j−1 + log tMj−1Dj
D1,j−1 + log tDj−1Dj
(2.11)
Ii,j = max
{Mi−1,j + log tMjIj
Ii−1,j + log tIjIj(2.12)
Where the M , D and I state values correspond to the the H, E and F values in the SW equations,
respectively. The B state corresponds to a special state that was previously explained, and will be
omitted here for comparison purposes.
The above equations are represented in log-space, not only to eliminate the multiplications, but also
to provide better accuracy. The log eMj(xi) and log tXjYj
thus correspond to the emission and transition
scores from Xj to Yj , respectively, which are already pre-computed scores present in the profile. The
transition values can be compared to the gap scores in the SW algorithm, but with an added delay,
since they depend on the position-specific transition states, requiring a previous look up. Additionally,
the M emission values roughly correspond to the substitution score matrix in the SW algorithm, varying
accordingly to the current Match state Mj and sequence symbol. Thus, they are already in a model-
specific profile, and can be re-used between sequences, essentially acting like a Query-Specific Profile
for the SW algorithm.
Apart from the differences stated above, the M state will require the Ii−1,j−1 and Di−1,j−1 states,
instead of the I and D states computed in the current iteration, as it happens for the SW in equation
(2.3). This will increase the amount of registers and memory that is required for the dependency values,
since all dependencies must now be stored to be later used in a future iteration. This will lead to a
reorganization of the algorithm computations, by having delayed loads and stores of the M state values,
where the dependencies required for the M state are only updated after the new M state values is
computed.
2.3 Implementation of DP Algorithms
When solving a problem using a DP-based approach, it is first necessary to decompose it into a set
of smaller sub-problems. This translates into computing the value of each cell in a n-dimensional matrix
15
by relying on the value of pre-computed adjacent cells. Given the usually large length of these matrices,
it is of utmost importance to implement additional methods to parallelize and to speed up DP algorithms.
Since the sub-problems are usually independent, DLP is often implemented in order to maximize the
number of independent cells (sub-problems) computed in each iteration of the algorithms. Additionally,
ILP can be also used with DLP in order to minimize the hardware impact brought by DLP, achieving
better performances with a lower power usage. In current architectures, however, it is not often possible
to conciliate both these parallelism paradigms at full extent, given the memory accesses or other pre-
existent structural architectural designs.
The following sections will cover both the DLP and ILP paradigms, as well as some state-of-the-art
architectures, both programmable and dedicated.
2.3.1 Data Level Parallelism
As previously mentioned, DLP is widely exploited in DP algorithms since most of the data to compute
is independent. This means that, in any given time during computations, it is possible to operate different
data elements simultaneously, i.e., operate over a vector of elements. Given the matrix-like represen-
tation of DP algorithms, this translates into a vector of cells. Using a 2D-matrix as an example, and
given the three data dependencies (vertical, horizontal and diagonal) found in the sequence alignment
algorithms previously mentioned, it is possible to see that the only possible vector, in order to have only
independent cells, is composed by the cells along the anti-diagonal, as depicted in figure 2.9. Any other
vector composition would result in data hazards, as the dependencies required for a given cell would
not be calculated before that cell. Since the sub-problems in a DP algorithm usually have the same set
of operations applied to them, each data vector would only require one set of operations applied to it, in
order to calculate all its containing cells. Ideally, this would result in a speedup equal to the number of
cells in each vector, in comparison to a single data architecture.
Figure 2.9: Example of DP Cell parallelism.
Most current processors already include vectorial instruction sets, like Streaming SIMD Extension
(SSE) or Advanced Vector Extension (AVX), making DLP an easy and viable option. For this reason,
and given the performance boost it gives, DLP is used in most GPP architectures for DP algorithms, as
well as in dedicated architectures.
16
2.3.2 Instruction Level Parallelism
When compared to DLP, ILP does not have a great impact in DP algorithms. Since the different sub-
problems often require the same set of operations to be solved, there is no particular need for different
operations being concurrently computed, as it even may lead to racing conditions and eventually data
or structural hazards. However, if there is a guarantee of no racing conditions (and thus no hazards)
while computing different cells at different operation steps of the computation, ILP can potentially reduce
the required hardware used by a DLP-only solution. In fact, vectorial DP algorithms do not make the
best usage of the available hardware, since, at any given time, only one type of operation is being
computed on a given vector. With the addition of ILP it is possible to have, at any given time, a different
subsets of cells computing different operations. This way, the length and number of functional units
in the architecture would be reduced, promoting a better hardware usage, since more functional units
would be working at the same time. The solution just described is the one implemented by VLIW
architectures, where a larger single instruction is issued, instructing different operations to different
data elements in parallel. Although this solution seems attractive from an hardware requirements point
of view, VLIW architectures are not that common. Most current processors have however different
ILP mechanisms, where the work of extracting the ILP is of the compiler’s responsibility. Instruction
pipelining, out-of-order execution or branch prediction are just some of the methods that are frequently
present in most processors, which are used alongside the DLP extensions to maximize the performance
of DP algorithms.
2.3.3 State of the Art Architectures
Hardware architectures can be divided into programmable and non programmable architectures.
Programmable architectures present greater flexibility when compared with the non programmable, since
they are easily adapted to new types of problems and algorithms. They are commonly found in GPPs,
which are used for working with a wide spectrum of applications.
Non programmable architectures, on the other hand, are usually implemented in dedicated hardware,
such as ASICs or FPGAs, and are commonly used to solve a specific task or family of similar tasks. This
type of architectures are mainly designed for speed and optimization, often resulting in high performance
with a low power consumption, but also with a higher complexity and implementation costs than the
programmable architectures.
In particular, FPGAs are regarded as a hardware alternative to the GPPs and ASICs, by balancing
the flexibility often found in GPPs with the performance often found in more dedicated architectures.
They allow reconfigurable designs with a smaller design cycle, although not achieving the high perfor-
mance offered by an ASIC nor the programability of the GPP. However, when compared to the other
architectures, the FPGA still offers a better performance-programability ratio value depending on the
applications to be implemented.
Due to their large computational times, DP algorithms require fast implementations in order to keep
up with the growing size of the sequences being considered in the sequence alignment problems. Al-
17
though the commonly used (but sub-optimal) tools for these types of problems (BLAST [6], FASTA [7],
HMMER [9]) are implemented in GPPs (due to its flexibility), many dedicated architectures have arisen,
bringing faster algorithm computations to the table.
The following sections will give an overview of the several programmable and non programmable
architectures for the implementation of sequence alignment DP algorithms that were presented in the
previous sections.
2.3.3.A Programmable Architectures
Vector architectures exploit data level parallelism by implementing high-level operations that work on
linear arrays of data instead of individual data items. The vector elements do not have dependencies
between them, ensuring that no data hazards occur.
Nowadays, most commercial processors have support for vector instructions, like the SSE or AVX ex-
tensions in Intel processors [21], containing dedicated registers and functional units for those particular
instructions. These instructions are classified as Single-Instruction Multiple-Data (SIMD) instructions.
Due to their ability to exploit parallelism, this type of architectures is often used for DP algorithm
implementations, which usually require a large quantity of parallelizable computations.
Smith-Waterman
By reviewing the implementation of the SW algorithm presented before, it is possible to observe that, for
the computation of the final score matrix H, the only cells that do not have dependencies between them
are the ones along the anti-diagonal (see figure 2.11(a)). This allows an inner loop parallel processing
of vectors composed by the anti-diagonal values of the H matrix and was first proposed by Wozniak
[11]. Although the loops are fully parallelizable, this parallelization scheme has the drawback of difficult
memory acess patterns, introducing large overheads in data manipulation when implemented on GPPs.
Rognes and Seeberg [13] improved on Wozniak work by pre-computing a query profile (figure 2.10
(a)) once for the entire database sequence. This query profile indexes a modified substitution score ma-
trix by the query sequence position and the database sequence symbol, instead of the original matrix by
the query sequence symbol and the database sequence symbol (figure 2.10 (b)). For a given database
symbol, the resulting score for matching it with all the query sequence symbols is stored sequentially in
one column of the matrix with other columns corresponding to other database symbols.
When implemented with Intel’s AVX/SSE instructions, the vector elements are composed of cells
parallel to the query sequence, instead of cells along the anti diagonals (see figure 2.11(b)). This vali-
dates the use of the query profile, but has the disadvantage of introducing data dependencies between
the cells of the vector. It also introduces conditional branches in the inner loop for the computation of the
F term (see equation (2.5)) when data dependencies occur. SWAT optimization [22] of this procedure
tries to minimize the impact caused by these inter-vector dependencies. This optimization assumes that
the E and F terms are often equal to zero, hence not contributing to the score value H. In fact, it was
demonstrated that as long as H is not larger than the threshold α + β (respectively, the gap open and
gap extensions penalties), E and F will remain zero along the column and row of the matrix, eliminating
18
C A C T
A -1 2 -1 -1
C 2 -1 2 -1
C 2 -1 2 -1
Database Sequence
Query Sequence
(a) Query-profiled substitution scorematrix.
A C T G
A 2 -1 -1 -1
C -1 2 -1 -1
T -1 -1 2 -1
G -1 -1 -1 2
Database Symbols
Query Symbols
(b) Substitution score matrix.
Figure 2.10: Comparison between a substitution score matrix with and without query profiling for theDNA sequences [AAC] and [CACT]. Note that the depicted DNA sequences may be composed by 4different symbols (A,C,T and G).
data hazards in the parallel computation of the vector elements. When this verification is not true, data
dependencies may arise and the cells will take a more time consuming computation process.
Farrar [12] tackles this problem by also organizing the SIMD registers in parallel to the query se-
quence (just as Rognes), but accessing them in a striped pattern (See figure 2.11(c)). This modified
access pattern moves the conditional branches of the vertical dependencies to a lazy loop, executed
outside the inner loop of the algorithm. This way, the conditional branches only have to be taken into
account once for every database symbol. After the completion of the inner loop, a first pass is made to
check the values of F , for each of the query segments against the values of H for the given database
symbol. A second pass - the lazy loop - is only needed when the values of F are greater than the values
of H.
(a) Wozniak [11] (b) Rognes [13] (c) Farrar [12] (d) Rognes [14]
Figure 2.11: State of the art SIMD implementations of the SW algorithm (extracted from [23]).
More recently, Rognes [14] presented a modified implementation of the SW algorithm that exploits
multithreading capabilities in multicore processors, by comparing several database sequences with a
19
single query sequence in parallel (see figure 2.11(d)). This implementation achieves higher speedups
but diverges from the previous implementations in a way that it does not solve the same single-reference
single-query alignment problem, solving instead multi-reference single-query alignment problems.
The implementation in [23] also takes a different approach and tries to optimize Farrar’s striped pat-
tern scheme by expanding the processor’s ISA.
At this point, it is important to note that the NW algorithm can use the same SIMD architecture of
the SW algorithm presented above. This is possible since both algorithms are based on a very similar
recursion. In fact, as it was stated in a previous section, the only difference in the main recursion is the
fact that the computed scores in the SW algorithm cannot go below 0, unlike the NW algorithm. Some
Multiple-Instruction Multiple-Data (MIMD) architectures were also exploited in [24] and in [25].
Viterbi
Viterbi’s algorithm also takes advantage of SIMD architectures, since the involved procedure computes
multiple cells (states) that only depend on the previous iteration. By sharing many similarities with the
SW and NW algorithms, most of their optimizations can also be used for this algorithm. In fact, the
commonly used tool for HMM problems, HMMER [9], uses the SSE instruction set extension and adopts
Farrar’s striped pattern [12].
HMMER mainly uses the multihit model for the profile HMM (see figure 2.7), corresponding to a
local alignment, just like the SW algorithm. As a result of applying Farrar’s method, the only major
differences to the SW algorithm are in the treatment of the Delete (D) states. In Viterbi’s algorithm, these
D state values correspond to a dependency in a previous column, while in the SW they correspond to a
dependency in the same column (vertical dependency). This requires a larger memory to save all the
additional D values, but it will come with the advantage of a simplified lazy loop, since the HMM topology
allows the lazy-F loop to correct only the D values of the current iteration, since both the Match (M) and
Insert (I) state values are computed with the D values of a previous iteration.
One particularity in the main recursion of Viterbi algorithm that was not yet addressed in this overview
is the fact that it uses multiplication operations (see equation (2.9)). However, since the probability values
used in the computations are below 1, and given the large number of computations, this can result in
very small numbers and, in more heavy computations, it could even result in an underflow. In order to
guarantee a better accuracy, the probabilities used in Viterbi algorithm are usually changed to logarithmic
probabilities. This will also replace the multiplications by sums, resulting in a more efficient computation,
since sum operations are frequently much faster to compute than multiplication operations (as seen in
equations (2.10), (2.11) and (2.12)). The Viterbi implementation in HMMER [9] tool uses this, as the
majority of other implementations of the algorithm, such as Intel’s evaluation in [26].
2.3.3.B Non-Programmable Architectures
Despite very flexible, the previous GPP implementations can hardly exploit the hardware at its full
potential, since they are limited to the processor’s ISA. ASICs do not have this problem and can of-
20
ten achieve a much lower computational time ([23], [27]), although they still lack flexibility and come at
higher cost. As a result, more flexible solutions for the implementation of specialized architectures are
often pursued, such as FPGAs, which can adapt to the targeted applications while still providing a good
performance, since they are reconfigurable ([28], [29]).
Most non-programmable dedicated architectures that have been proposed to implement DP algo-
rithms are based on ASIC and FPGA implementations. Usually, they consist in a systolic array structure,
composed by multiple Processing Elements (PEs), that perform the necessary computations for many
cells in parallel. The scalability of these implementations is often associated to an increase of the number
of PEs present in the array, resulting in very dense arrays with very high throughput.
Smith-Waterman
For the SW algorithm, the implementations usually parallelize the cells computation along the anti diag-
onal, in order to avoid unnecessary computations brought by data hazards. The implementation in [30]
uses this scheme, while also exploring the optimization of the traceback phase and the evaluation of the
n-Best alignments of a given sequence pair. This parallelization scheme is also used in [31] for the NW
algorithm. Here, the algorithm is implemented both in a FPGA and in a Graphics Processing Unit (GPU)
for comparison. In [32], both the NW and SW are implemented and a common architecture is used to
measure the performance of both algorithms.
Viterbi
Some dedicated architectures have also been proposed to implement certain HMMER procedures that
compute the Viterbi algorithm and optimize them. The FPGA implementation in [33] presents a systolic
array architecture with 4 stage pipelined PEs, performing multiple searches in parallel. A Viterbi imple-
mentation based on Rogne’s inter-sequence SIMD parallelisation [14] is also proposed in [34] with an
additional exploitation of cache locality for an even faster throughput.
The implementation in [29] presents a parallelization scheme based in a polyhedral model using
different linear space-time mappings. This last architecture is composed of a linear array of PEs that
also compute multiple instances of the viterbi algorithm in parallel.
2.4 Summary
DP can be used to solve complex problems by partitioning those problems in simpler and smaller
ones. These sub-problems, often independent of one another, enable data parallelism exploitation,
by computing several sub-problems at the same time. This greatly accelerates the execution of the
algorithms that implement such methods, like the SW for sequence alignment problems or the Viterbi
algorithm for HMM and Profile HMM problems.
Given the current support in most everyday processors, DLP can be easily exploited in most archi-
tectures through SIMD extensions, and thus most state of the art architectures and DP algorithms make
use of this paradigm. Additionally, it was seen that ILP could also be used in conjunction to DLP in order
21
to reduce the hardware requirements and to maximize their utilization. However, the ideal solution would
fall in the field of VLIW architectures, which do not have that much compatibility with current compilers,
and thus are usually avoided. Other ILP mechanisms, like out-of-order executing or multiple issuing, are
instead present in most modern GPP and, although not made specifically for DP problems, can help in
boosting those algorithm performances.
When looking for a compromise between DLP and ILP, the design of a programmable processor
architecture with an optimized ISA, capable of computing different families of DP algorithms, would
come closer to the flexibility of a common GPP for this type of algorithms, while also maintaining a great
level of optimization and consequently faster processing speeds. Furthermore, when implemented in a
FPGA device, no restrictions would apply (besides the FPGA hardware resources limitations), permitting
using and studying different architectures for the processor, such as the VLIW paradigm that is rarely
pursued in common GPPs, given the lack of software optimization and compilation support.
The architecture that will be proposed in this thesis aims just that: to provide a middle-ground for DP
algorithm computation, where high performances can be achieved with an higher degree of programma-
bility than usually found on high performance GPP architectures. Furthermore, this architecture should
present itself scalable enough to add new instructions and other dedicated/optimized PEs later down the
road, in order to expand the supported families of DP algorithms, while maintaining good performances.
22
3Proposed VLIW Architecture
Contents3.1 Architecture Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Register Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.2 Functional Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.4 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
23
3.1 Architecture Requirements
The proposed architecture targets the simultaneous exploitation of DLP and ILP paradigms, in order
to position itself as a faster solution than the current DP solving architectures, and to support a broader
range of algorithms.
As it was showed in the previous chapter, DP problems can be translated into a n-dimensional matrix,
where each sub-problem corresponds to a cell in the matrix, with adjacent cells as prerequisite sub-
problems. In a 2D matrix, this typically results in horizontal, vertical and diagonal data dependencies
from the left, top and top-left cells, respectively, for each cell computation. Accordingly, to maximize
the processing efficiency and to minimize the number of dependencies, cell computations should be
performed in parallel along the anti-diagonal (see figure 2.9).
However, exploiting the DLP along the anti-diagonal brings two problems: harder memory organiza-
tion/access (visible in Wozniak’s [11] implementation of the SW algorithm) and larger hardware require-
ments. While the former can be solved by implementing specialized memory-access units to gather cell
values in non-adjacent memory positions, the latter requires the consideration of a different type of par-
allelism. In fact, vector-only solutions will always result in low Functional Unit (FU) usage. For example,
consider that vector processing is used to compute the value of N cells in parallel, which requires a
total of M vector instructions. Assuming the inexistence of some inherent data dependency and that
only one of these operations is a square root, an utilization of 1/M is expected for all N parallel square
root FUs. Naturally, it is possible to reduce the number of FUs (hardware requirements) by serializing
the operation on the different vector elements. However this solution trades performance for hardware
requirements, hence it is not ideal.
The alternative is to also explore ILP, by assigning different operations to be executed on different
parts of the vector. This results into having multiple independent units, each computing their own vector
operation, in a given part of the vector, which not only increases the potential for additional parallelism,
but also reduces the hardware requirements and increases the utilization rate. This is the paradigm
used by VLIW architectures, and can be seen side by side to a vector-only approach, as it is illustrated
in figure 3.1. In this figure, although it takes one more clock cycle for the VLIW architecture to compute
the 2 instructions for all the 4 elements, it must be taken into account that the example only depicts two
instructions. In fact, the ILP introduced here will only have an impact in the initialization of the algorithm,
resulting, for the stationary phase of the algorithm, in the same number of clock cycles as the vector-
only approach. The use of ILP is also supported by the set of common steps usually included in a DP
algorithm. Usually, these steps consist in dependency loads followed by cell computations, finalizing with
the store of the results. Assigning these different steps of the algorithm to different cells in the matrix
(along the anti-diagonal) validates the ILP (different instructions operating over different cells) while also
maintaining data coherence, given the independence between the cells. The only control requirement
is to guarantee that the cell dependencies are always computed in advance to the the cells that require
them, in order to avoid data hazards.
To efficiently support DP algorithms and to simultaneously explore DLP and ILP, the proposed archi-
24
SqrtSqrtSqrtSqrt
Functional Units
...
Vector VLIW
+/-
SumSumSumSum
+/- +/- +/-
√√ √ √
Logic Logic Logic Logic
x x x x
SqrtSqrtSumSum
Functional Units
...
+/-
SumSum
+/-
√√
Logic Logic
x x
SqrtSqrt
Figure 3.1: Comparison between a Vector architecture composed of 4 elements, and an equivalent VLIWarchitecture with 4 units composed of 1 element each. Both examples compute a square root operation,followed by a sum operation. In the Vectorial approach, two clock cycles are required to compute thetwo instructions, which use 4 FUs each (colored FUs). The VLIW approach takes 3 clocks cycles tocompute the two instructions, but only requires a maximum of two FUs per instruction (colored FUs).This is achieved by delaying the two last units in order to reduce the number of FUs.
tecture must then comply with the following requisites: Independent execution units to compute indepen-
dent instructions in parallel, issued from an instruction bundle; and a Data Stream Unit (DSU) to access
the memory concurrently with the execution units (to reduce the latency brought by non-adjacent mem-
ory accesses). Each execution unit will then be assigned a different vector of cells and an independent
register bank, in order to operate independently of the other units. This also enables the exploitation of
memoization, a technique where the current algorithm iteration re-uses the results obtained in the previ-
ous iteration, which are stored in the register bank. This technique is specially useful in DP algorithms,
where the sub-problems only depend on the result of previous sub-problems, which are represented by
the previous iteration (or series of iterations) results. This also reduces the number of required memory
accesses and thus increases the performance. However, there is still the need to ensure the communi-
cation between the different execution units, either because of a data dependency that is in a different
unit, or simply because it would greatly improve an algorithm to have shared values between units. The
usage of memory to share values would prove inefficient. Accordingly, this will be achieved with the ad-
dition of a small group of shared registers and sniffing mechanisms between a small subset of registers
in the register bank of each execution unit.
Finally, and given the amount of data that is required to be loaded and stored from memory, the
architecture will also present a RAM memory, as well as a smaller and local memory to store constants
that are often used during algorithm computations. The existence of these two memories can further
help to reduce memory congestion, specially in memory heavy algorithms, like Viterbi.
25
3.2 Proposed Architecture
As it was previously referred, there are several ways to explore ILP alongside DLP. In the proposed
architecture, static ILP is explored, since it requires less control hardware, thus accomplishing better
energy efficiency. This is achieved by issuing an instruction bundle that is composed of several different
instructions, each operating over a vector of independent elements (DLP) in different execution units.
This way, instead of using a single large vector computing the same instruction (as it is typical in vec-
tor architectures), the architecture has several smaller vectors, each effectively computing a different
instruction.
In DP algorithms, this will correspond to the parallel processing of cells that are in different steps
of the algorithm, in order to maximize the parallelism and thus, to reduce the required hardware. This
parallelism must be done cautiously, since it is prone to introduce data race conditions if certain con-
ditions are not met. However, only two conditions are required to be met, in order to avoid data race
conditions: all cells currently being processed must be independent; and if there are cells that are being
processed in advance (computing a further instruction in the algorithm), there must never be dependen-
cies to the cells that still are in previous processing steps of the algorithm. By using the anti-diagonal
parallelism as an example, this second condition is met by ensuring the cells being processed at the
most down-left section of the anti-diagonal, are in advance regarding the cells at the top-right section
of the anti-diagonal, since the dependencies propagate from the left to the right and from the top to the
bottom (see figure 3.2).
1
2 1
3 2 1
1 3 2 1
2 1 3 2
3 2 1 3
3 2 1
3 2
3
Qu
ery
i+1
Qu
ery
i
Unit 0 Unit 1 Unit 2 Unit 3
Ref 0 Ref 1 Ref 2 Ref 3
1
2 1
3 2 1
1 3 2 1
2 1 3 2
3 2 1 3
3 2 1
3 2
3
Qu
ery
i+1
Qu
ery
i
Unit 0 Unit 1 Unit 2 Unit 3
Ref 0 Ref 1 Ref 2 Ref 3
Clock Cycle
Computation of one Cell
Figure 3.2: Example of two iterations of a DP algorithm with 3 instructions per iteration and 4 cells beingprocessed along the anti-diagonal. Each cell is processed in a different execution unit with a differentreference symbol, which are represented by the columns. Each group of 3 instructions correspondsto one cell computation, for a given query-reference pair. The dependencies between symbols arerepresented by the arrows, while the rows represent clock cycles. Unit 0 is the most advanced unit.
26
To compute each instruction that integrates the instruction bundle, the architecture presents inde-
pendent execution units. Each one of these units operates a different vector of cells, and has its own
register bank to locally store all the intermediary results that are generated by the algorithms (useful for
memoization in DP algorithms), as well as any other value or dependency required in the immediate
computations, thus reducing memory access operations and improving the processing performance,
while maintaining an organized data structure (see figure 3.3). However, there will be situations where
an execution unit will require data values from a different unit (e.g. the execution units that are in advance
regarding the computation steps of an algorithm), which will generate dependencies that are required by
a different and delayed unit. Thus, a sniffing mechanism, operating in a small subset of registers in each
register bank, is implemented. These special registers will sniff others in an adjacent unit to ensure the
commitment of the required dependencies, and access them as they were in the same register bank,
thus maintaining the independence of the register banks while keeping data coherence and avoiding
unnecessary memory accesses.
Unit 0Unit 1...Unit n
RegisterBanks
Figure 3.3: Execution Units with the respective independent register banks.
In addition to sniffing, there is an additional way to share data between execution units (which are
not adjacent) without resorting to memory. This is specially important in DP algorithms, since there are
often dependency values that are used to compute all cells for a given number of iterations, being later
updated and repeating the process. Since there is a constant need to load these values from memory,
and given the independence between execution units, this would lead to redundant memory accesses,
as different execution units would have to fetch the same values from memory. This problem is solved
by adding a set of shared memory registers to each execution unit. These registers can be accessed
by all execution units and thus can be used to store dependencies that are required by several units,
removing the redundant memory accesses (see figure 3.4).
Although the register banks are mainly used to store dependencies between iterations, it is still often
required for DP algorithms to store some of their dependencies in memory, especially when they are re-
quired in a much later iteration of the algorithm. In order to minimize the impact of the resulting memory
accesses on the processing performance, a DSU is used to perform the necessary memory loads and
stores in parallel to the execution units. This way, while the execution units are computing the main steps
27
Memory
registers
Dual Port Memory
Memory
registersMemory
registers
Memory
registers
Unit 0Unit 1...Unit n
RegisterBanks
Sniffing SniffingSniffing
Figure 3.4: Register banks with the sniffing mechanism and the additional shared memory registers ineach execution unit (with the respective connection to the memory).
of an algorithm, the DSU is pre-storing or pre-loading cell values at the same time, which will only be
required at a later iteration. Here, contrary to common VLIW architectures, all the existing units (both the
execution units and the DSU) can access the memory, requiring a priority access list to avoid conflicts.
Since the DSU main function is memory access operations, it has the top priority over all other units.
DataIStream(DSU)
ExecutionUnit 0
ExecutionUnit 1
...Execution
UnitIn
RegisterBanks
ScratchpadMemory
InstructionMemory
JumpControl
PC
(Shared) VectorIFunctionalIUnits
Sniffing Sniffing Sniffing
Memory
registers
DualIPortIMemory
Memory
registersMemory
registers
Memory
registers
Figure 3.5: Proposed architecture.
To further ease the memory access delay problem, a local fast (scratchpad) memory is also included,
which is used to store constant values, required in several DP algorithms. These constant values are
28
InstructionMemory
Fetch
Decode
Decoder
Execution
Write-Back
Re
gisterUBan
ks
InstructionUBundle
JumpControl
U0U1...Un
MemoriesFunctional
Units
PC
EXECForwarding
WBForwarding
U0U1...Un
U0U1...Un
Figure 3.6: 4-Stage pipeline structure.
pre-fetched at the beginning of the computation by the DSU, and can only be accessed by the execution
units (with an access priority list to decide between them).
These specifications result in the architecture presented in figure 3.5. The architecture also presents
a 4-stage pipeline: a FETCH stage, where the next instruction is loaded from the instructions memory;
a DECODE stage, where the fetched instructions are decoded in all units; an EXECUTE, stage where the
FUs and memory operate the instructions; and a WRITE-BACK stage, where the results are written to the
register banks. The pipeline also includes stalling and data forwarding mechanisms to prevent hazards
and to minimize the number of stalls on the processor, respectively, and can be seen in figure 3.6.
U3U2U1U0 U7U6U5U4U3U2U1U0
U3U2U1U0
2n
n
n
DLPScalability
ILPScalability
Figure 3.7: Processor scalability: This example doubles the number of processed cells by doubling thevector size from n to 2n (DLP scalability to the left) and by doubling the number of available units from 4to 8 (ILP scalability).
By considering the set of characteristics listed above, the proposed architecture can also be easily
scalable in two distinct ways (see figure 3.7): by increasing the length of each execution unit and thus
increasing the vector length (DLP); and by increasing the number of different execution units, and thus
29
increasing the number of parallel instructions (ILP). The first solution would mainly require an increase
of the vector size that is processed in the functional units, while the second solution would require an
increase in the number of functional units. Both these solutions can be applied together, in order to
provide a better balance between both parallelism paradigms.
3.2.1 Register Banks
Each execution unit has its own private register bank of 28 registers, as well as a small set of 4 shared
memory registers, achieving a total of 32 registers (illustrated in figure 3.5). Although the presence
of private registers in each execution unit results into reduced register access times, better structural
organization and thus, better performance, it would be advantageous to have the possibility to share
values between units without resorting to copy the value to a shared register. Specifically, and using
the 2D matrix to represent the processing pattern of many DP algorithms as an example, horizontal and
diagonal dependencies between the cells in the edges of the execution units would require, in every
iteration, values to be passed from the adjacent unit to the one that requires those dependency values.
Given that these dependencies occur very frequently (every iteration), the delay caused by copying the
values would be very significant. To circumvent this, the previously referred sniffing mechanism is used
(see figures 3.4 and 3.5).
This mechanism affects a very small number of private registers (in the 2D example, only two reg-
isters in each register bank would require sniffing), and it consists on mirroring those registers to the
adjacent execution unit’s register bank. Accordingly, whenever an update is made on the registers being
sniffed, that same update is reproduced to the adjacent execution unit. Given that the typical types of
dependencies in DP algorithms follow a top-down and left-right approach, the sniffing mechanism is only
required from one unit to the one at its right, with the last unit (the one that computes the most left cells)
not requiring sniffing.
The existence of a sniffing mechanism does not exclude, however, the need for shared memory
registers. These registers are mainly used by a DSU to communicate with the memory. This is done
in order to separate the parallel memory operations that are handled by the DSU from the intermediary
results of the main algorithm computations, issued by the execution units, and thus avoiding register
access conflicts or data hazards. To further avoid conflicts, the sharing privileges between execution
units only covers load accesses, with the writing being exclusive to the register owner unit and to the
DSU. In case of a writing conflict between the DSU and one execution unit, a priority list is used, with
the DSU having the top priority. These memory registers also serve the purpose of reducing the number
of memory accesses, in situations where a dependency value, loaded by one execution unit, is required
by other units. Instead of being retrieved multiple times from memory to multiple execution units, these
dependencies can be loaded to only one unit and then be used by all units or, if the dependency requires
to be constantly updated, it can even be shifted to the other units memory registers, with the help the
DSU.
The proposed architecture also supports different word sizes, resulting in multiple words being stored
in each register if a multiple of the maximum word size is used. Accordingly, if the word size is half the
30
CMP+++
SUM SUM CMP CMP
Functional Units
...
HOLD
NOP+NOPNOP
SUM SUM CMP CMP
Functional Units
...
First Clock Cycle Second Clock Cycle
Figure 3.8: FU conflict control. During the first clock cycle, 4 execution units try to compute 3 sumoperations and 1 comparison. Since there are only 2 FUs capable of doing sums, the 3rd unit will holdits instruction and the previous pipeline stages will be stalled. In the second clock cycle, the FUs arenow free to operate the 3rd unit’s sum operation (the remaining execution units will not compute anyinstruction), finalizing all instructions in the bundle, and resuming the normal processor operation.
maximum value, each register will then store two different words. If the word size is a quarter of the
maximum value size, each register will store four different words, and so on. This design paradigm allows
for different accuracy ranges, improving algorithm performance if higher accuracy is not required (more
cells computed simultaneously, with the same hardware resources), while still supporting problems that
require a higher level of precision.
3.2.2 Functional Units
The FUs that are present in the architecture are shared between all the execution units. This is done
in order to optimize the resource usage, and to reduce the hardware requirements. However, this design
option will also require a conflict control, to manage those situations when multiple execution units try to
access more FUs than those available. Therefore, whenever there are execution units trying to access
a busy FU, a stall is generated, and the instruction will be held until the required FU is free, taking such
instruction additional clock cycles to compute (see figure 3.8).
The order how each execution unit gets assigned a FU follows a priority list, where the units that
process the left-most cells have higher priority than those that process the right-most cells. In order to
reduce the conflicts probability, more FUs could be added, leading to an increase in both hardware and
power requirements. To reach an optimal solution, the number of FUs should be the minimum possible in
order to not cause conflicts, leading to a better resources/usage ratio. In order to tweak the most suitable
number of available FUs, the DP algorithms must be characterized both in terms of the amount and type
of operations. Considering that most DP algorithms use simple operations like sums or subtractions,
shifts and logic operations, there is a requirement for multiple units of these types. By looking at the
subset of algorithms that are considered in this work, the considered set of available FUs in the devised
architecture are: Sum, Maximum, Shift, Logic (AND, OR, XOR) and Comparison units.
Bearing in mind the adopted 4-stage pipeline, the registers whose values are required to be com-
puted during the EXECUTE stage (where the FUs operate) may not have been yet updated, since the
31
WRITE-BACK stage of the pipeline only occurs after the EXECUTE stage. When this occurs, a data forward-
ing mechanism is implemented, pushing back to the entrance of the FUs the yet to be updated value in
a later pipeline stage, instead of using the current register value. This mechanism is also implemented
for memory accesses, where a data vector that has not yet been written to memory is forwarded to the
entrance of the FUs.
The FUs are also prepared to operate with different amounts of vector elements. As previously
stated, the considered register structure supports multiple word sizes, storing more or less words de-
pending on the defined word size. Therefore, if the vector is composed by only one large word, the FUs
behave like scalar units, each operating over single data elements. If the vector is composed by several
smaller words, the FUs will then behave like vector units, each operating over multiple data elements.
3.2.3 Memories
There are 3 different memories in the devised architecture: an instructions memory, which stores
all the instructions to be computed by the processor; a dual-port RAM memory, which stores values
required throughout algorithm computations; and a local fast memory, which serves the purpose of
storing constant values to be used during algorithm computations.
The instructions memory consists in a read-only memory where the instructions to be computed by
the processor are stored. A program counter controls the issuing of instructions, updating the memory
address accordingly (either by incrementing or branching), while the memory constantly provides a new
instruction at every cycle. This occurs in the first stage of the pipeline, the FETCH stage. If a stall is
generated by the processor in a later pipeline stage, the program counter stalls the current instruction,
not updating the memory address until the stall is resolved.
The two remaining memories, a dual port RAM memory and a local fast memory, have two inde-
pendent ports for write-only and read-only operations. They serve different purposes: the former is a
larger memory for storing large data sets, as well as intermediary values of algorithm computations (that
cannot be stored in the registers banks); the latter is a smaller memory that only stores constant values
used throughout algorithm computations. Furthermore, the RAM memory is accessed by both the DSU
and the execution units for write and read operations (with the DSU having higher priority over the exe-
cution units), while the local fast memory can only be written by the DSU and can only be accessed for
read operations by the execution units. This way, during algorithm computations, it is possible to have
both the DSU loading values from the RAM memory to the memory registers, and the execution units
computing their algorithm iterations by loading values from the local fast memory, thus minimizing the
delay that is introduced by concurrent memory accesses, while also promoting a better data organiza-
tion by separating the constant data values of the algorithms from the constant changing intermediary
results.
The data width of both the RAM and local memory corresponds to the maximum register width, with
multiple words being stored or loaded in one data vector if the word length is set to be a multiple of the
maximum data width, as it was previously explained in the register bank section. These memories have
an access latency of 2 clock cycles: one cycle to index the correct address, and another cycle to load
32
the value in the previous indexed address. As it will be seen, there are two instructions to compute both
steps of a memory load, enabling a parallel usage of a memory load instruction between two units: one
indexing an address, and the other effectively loading a data vector, achieving a throughput of 1 clock
cycle with a latency of 2 clock cycles, when loading a value from memory.
3.2.4 Instruction Set Architecture
The ISA of the proposed architecture is structured around a large bundled instruction, composed of
several smaller instructions, each one computed in its respective unit. The instruction bundle is then
divided into several execution units and one DSU, as it can be seen in figure 3.9(a). Each execution unit
instruction is encoded with 32 bits (see figure 3.9(b)), while the DSU instruction has a variable width,
depending on the number of executions units that arepresent (see figure 3.9(c)).
163 - 131 130 - 99 98 - 64 63 - 67 66 - 35 34 - 0
Unit n Unit n-1 … Unit 1 Unit 0 Data Stream
(32) (32) (32) (32) (32) (36)
(a) Instruction Bundle.
WE Td Rd Ta Ra Tb Rb Opcode OpControl
31 30 29 - 25 24 23 - 19 18 17 - 13 12 - 6 5 - 0
(1) (1) (5) (1) (5) (1) (5) (7) (6)
(b) Execution unit instruction.
ShiftEN Left/Right shiftAddr localWE MWE Unit Madd Radd AddrEN regWE Unit Madd Radd
35 34 33 - 32 31 30 29 - 28 27 - 18 17 - 16 15 14 13 - 12 11 -2 1 - 0
(1) (1) (2) (1) (1) (2) (10) (2) (1) (1) (2) (10) (2)
Shift Bits Memory Write Bits memory Load Bits
(c) DSU instruction for an architecture with 4 execution units.
Figure 3.9: Instruction words for the bundle and the composing units.
As it can be seen in figure 3.9(b), the encoding of the execution unit’s instructions comprehends
the common register address fields Ra, Rb and Rd (which correspond to the first and second operand
addresses, and to the destiny address, respectively), the WE field, that indicates when a register should
write a given value, and by the operation encoding fields, namely the Opcode field, which selects be-
tween the different types of instructions (arithmetic/logical, control/branch and memory access) and the
OpControl field, that identifies a certain modifier to the instructions (e.g., usage of immediate values in
arithmetic/logical operations, usage of inequality comparisons for control operations). Three more spe-
cial control fields are also present in this encoding: Td, Ta and Tb. The Td field enables a broadcast write
(enabling a 3-way register write, relevant for DP algorithms that have up to 3 dependencies), while bits
Ta and Tb are used to specify which part of the data is to be loaded or written to registers, for memory
instructions that operate with dividable parts of data.
The summarized implemented instruction set can be seen in Table 3.1 and the full instruction set can
be seen in Appendix A.1. The instruction set presents commonly used arithmetic and logic instructions
33
(e.g. addition, subtraction and logic OR, AND and XOR with their respective immediate counterparts)
as well as a special Maximum and Move (MAXMOV) instruction. This instruction performs the maximum
operation while moving a register in parallel, proving useful in DP algorithms that present dependencies
for the iteration after the next (e.g. diagonal dependencies in the SW algorithm). The SUM, SUB and MAX
instructions are the only ones that can be modified by the Td field, to enable the broadcast write, as
can be seen in Appendix A.1. The MAX and MAXMOV instructions can also be modified by the OpControl.
When the bit 5 of OpControl is active, both instructions perform gap register comparisons for the SW
algorithm. The MAX instruction also concatenates the result of the maximum operation with a different
register (e.g. concatenation with a sniffing register).
The memory instructions are responsible for the load and store operations on the RAM and local
fast memories. The load operations require a previous indexation, encoded by the instructions INDEX
MADDR (for the RAM memory) and INDEX SPADDR (for the local fast memory). The different sized loads
and stores are controlled by the Ta and Tb instruction fields (see Appendix A.1). The INDEX SPADDR
instruction can also be modified by the OpControl in order to perform a comparison between two values
to find the correct address to index (requiring a comparison FU). This is useful for alignment algorithms,
which present substitution scores dependent on the aligning symbols.
The control instructions consist solely in delayed branches, where the instruction following the branch
is still computed before branching. The instruction set is thus composed by a simple branch instruction
as well as common conditional branches (e.g. not equal, less than, greater than) and their immediate
counterparts.
The DSU has a different instruction format than the execution units and it is depicted in figure 3.9(c).
This instruction format also explores ILP, since it computes three distinct and parallel operations: a
memory load/register write (bits 15 - 0); a memory write (bits 31 - 16); and a register shift (bits 35 - 31).
It is worth noting that the DSU instruction’s length will depend on the number of execution units that are
present in the architecture. In fact, the depicted figure 3.9(c) shows the case where 4 execution units
are present, including 2 bits required to address each one of the 4 units in both Unit fields. The memory
load operations are responsible for loading a value from memory (addressed by Madd), to one of the
memory registers (addressed by the the two-bit field Radd, since there are only 4 memory registers per
execution unit) in one of the existent execution units (field Unit). Since the load instructions require a
preliminary indexation before the actual load, the bits AddrnEN and regWE will identify the index operation
and the load operation, respectively. In order to optimize the throughput of memory load instructions,
these two bits also enable simultaneously indexation and load operations. In such situation, both the
Unit and Radd fields identify the register to store the data loaded from memory, while the Madd field
identifies the new memory address to be index for a later load operation. The memory write operations
are very similar to the load operations, with the exception of only requiring one enable flag (MWE) for
allowing writing access to the memory. There is also a localWE field that chooses between the RAM
memory and the local fast memory, since the DSU is the only unit that can also write to the local memory.
The register-shift operation is responsible for creating, along with a memory read or write, a register
34
Table 3.1: Abridged implemented instructed set. The full instruction set is depicted in Appendix A.1.INSTRUCTION MNEMONIC
Arithmetic and Logic Instructions
Add, Subtraction SUM, SUB
Maximum, Maximum and Move MAX, MAXMOV
Comparison CMP
Arithmetic and Logic Right and Left Shift SRA, SRL, SLA, SLL
Logic OR, AND, XOR OR, AND, XOR
Memory Instructions
Index Memory address INDEX MADDR
Load Byte, Half-word, Data LB, LH, LD
Index local memory address INDEX SPADDR
Local Memory Load SPAD LD
Store Byte, Half-word, Data SB, SH, SD
Control Instructions
Delayed Branch BRD
Delayed Branch Equal, Not Equal BEQD, BNED
Delayed Branch Less Than, Less Than Equal BLTD, BLTED
Delayed Branch Greater Than BGTD
window mechanism integrating all the memory registers. This mechanism is depicted in figure 3.10 and
can reduce the impact of memory accesses, by pre-loading a data value that will be required in future
iterations of the computation or by pre-storing a value to be later used in future iterations. These memory
accesses are done in parallel to the computations, without overwriting any values that have yet to be
used, in order to prevent any data hazards. The registers to be shifted are chosen by the ShiftAddr
bit-mask, from one of the periphery execution units (unit 0 or unit n) to the other, with the direction being
chosen by the Left/Right field. An enable flag (ShiftEN) activates the shift operation.
It is also important to notice that the shift operation that is implemented by the DSU operates in-
dependently to the FUs. Moreover, given the higher priority of the DSU over the execution units, the
shift operation will always overwrite the target memory register when an execution unit tries to update
that register during the same clock cycle. For this reason, the memory registers should mainly be used
by execution units for accessing the stored values and not for updating them, as that is the DSU main
functionality.
As it was previously mentioned, more execution units can be easily added to the architecture by
widening the instruction bundle and adding register banks to the new units. Furthermore, these exe-
cution units can also be expanded to accommodate more words, by increasing their vector width. This
would also require some modifications in the FUs and in the memory accesses, in order to maintain
compatibility. The former scalability solution is better suited for algorithms that require many instructions
per iteration, while the latter has a better use by algorithms that require less instructions and work with
higher volumes of data.
35
Memory
Register fileof Unit 0
Register fileof Unit 1
Register fileof Unit n-1
Register fileof Unit n
Figure 3.10: Register Window example. In this example, the third register in each array is shifted to theregister of the array on its right, while the left-most array is loaded with a new value from memory.
3.3 Interface
The proposed architecture is envisaged to act as an accelerator element highly interconnected with
an off-the-shelf GPP, where the non-regular and less complex parts of the algorithms (e.g. control and
management structures) will be executed. Accordingly, it was decided to extend the design of the pro-
posed architecture to its interface with the outside world. In particular, it is envisaged an interfacing
structure that aims to be suited to implementations supported either in ASIC or FPGA technologies.
Naturally, a greater emphasis will be given to FPGA-based implementations, due to its greater availabil-
ity in the lab.
System On Chip (SOC) processing structures are usually formed by heterogeneous aggregates of
processing elements. In particular, they commonly include a set of GPP elements and several accel-
erating processing structures. The GPP elements typically comprehend a processor/microcontroller,
together with the cache, the RAM and all the corresponding interconnections and input/output periph-
eral ports. A popular example of such a SOC structure based on FPGA technology is the Xilinx Zynq
FPGA, comprehending a Processing System (PS) and Programmable Logic (PL) sections. The latter
section is frequently used to create custom designs and integrate them with the processor in the PS.
The proposed architecture is then particularly suited to be integrated as a core located in the PL section
of the FPGA.
In this section, it will be presented an interfacing structure to the proposed VLIW processor based
on the Advanced Microcontroller Bus Architecture (AMBA), according to its Advanced eXtensible Inter-
face (AXI). These specifications are adopted by some FPGA vendors (e.g. Xilinx) and are considered to
be the de-facto standard for 32-bit embedded processors, due to being well documented and royalty free.
After analyzing the proposed architecture, previously presented in this chapter, three main structures
were identified as requiring communications with the GPP element: i) the instruction memory, ii) the
RAM memory and iii) the local fast memory. The GPP only requires write access to all these memories,
since they are only used by the VLIW core.
When integrated with the GPP, all the required data to be computed in the proposed architecture core
is stored in the system’s RAM, requiring it to be loaded to the memories inside the core. The GPP is
thus responsible to select and send the correct data to the correct memories, depending on the algorithm
36
that is being processed. Ideally, the data is transferred in parallel to the algorithm computations, with a
controller unit monitoring the data transfer to guarantee coherence. However, the VLIW core memories
only have 2 access ports (a write-only and a load-only port, as previously detailed) and, with exception
of the instruction memory, both ports are already used by the core, preventing the access in parallel by
the GPP, due to structural conflicts. To solve this, a multiplexer at the entrance of the write ports for the
RAM and local fast memory is required. This multiplexer thus chooses between the proposed core or the
GPP for writing access. The multiplexer selection is done by an additional control unit, located outside
the proposed core and inside the PL (see figure 3.11). This control unit must then be able to recognize
the current algorithm phase to switch the multiplexer accordingly and to enable the memory writes. This
can either be done by also sending the instructions that are being processed by the proposed core to
the control unit, or by using a feedback system, where the VLIW core communicates the current state of
the operations.
ProgrammableULogic (PL)
AXIUMemoryController
AXIUMemoryController
ProposedUArchitectureUCore
Datapath
RAMLocalUFastMemory
ControlUUnit
….
ControlUUnit
AXIUInterconnect
ProcessingUSystem (PS)
GPP
Figure 3.11: AXI Interconnection scheme between the RAM and the local fast memory in the proposedarchitecture core and the GPP in the PS.
As opposed to these two memories, the instruction memory has only one port being used to load
the instructions to the different units, inside the proposed core. By connecting the remaining free port to
the GPP, we can seamlessly transfer the new instructions to the VLIW core at the same time that other
instructions are decoded in the core, without the need for a multiplexer (see figure 3.12). However, given
37
the difference between the data transfer frequency of the GPP to the VLIW core and the core’s operating
frequency, structural hazards can occur, and thus a control unit is required. Accordingly, this unit must
be able to monitor the memory, ensuring a correct data transfer. Therefore, the control unit requires the
knowledge of the current instruction being computed in the VLIW core (similarly to the control units for
the other two memories), as well as the control of the memory port signals, in order to appropriately
enable the writing access and choose the addresses for the data transfers.
AXIaMemoryController
InstructionaMemory
DSUUnitN ... Unit1 Unit0
AXIInterconnect
ProcessingaSystem (PS)
GPP
….
ProgrammableaLogic (PL)
ControlaUnit
ProposedaArchitectureaCore
Figure 3.12: AXI Interconnection scheme between the instruction memory in the proposed architecturecore and the GPP in the PS.
In order to connect the memories inside the VLIW core to the GPP, AXI controllers are required.
These units provide the interface to connect the memories to a central AXI Interconnect, which in turn
completes the communication bridge to the GPP in the PS. Figure 3.13 depicts the full interfacing
structure scheme.
The AXI follows a handshake process to transfer both the address, control and data information,
where the master (GPP) asserts and holds a VALID signal when data is available to transfer, and the
slaves (memories inside the VLIW core) respond with a READY signal when they are able to accept the
data. When both signals are active, the transfer occurs. The AXI supports data bursts, which are nec-
essary for the memories in the VLIW core. The instruction memory requires multiple instructions to be
stored prior to the start of the algorithm, which must be sent in long bursts to reduce the stall time.
Similarly, the RAM and local fast memory will also require long bursts of data, in order to prolong the
algorithm computations without stalling the core, since their write ports can only be accessed either by
38
ProcessingLSystem RPSF
GPP
CacheLSystem 1 Control
MemoryLInterfaceCentral
Interconnect
I/OPeripherals
AXILInterconnect
ProgrammableLLogic RPLF
AXILMemoryController
InstructionLMemory
DSUUnitN ... Unit1 Unit0
ControlLUnit
AXILMemoryController
AXILMemoryController
RAMLocalLFastMemory
Datapath
ControlLUnitControlLUnit
ProposedLArchitectureCore
Figure 3.13: Interface scheme for the proposed architecture core.
the core or the GPP at a given time. Using the SW algorithm as an example, and due to the large
length of the reference and query sequences, the RAM memory can only store a limited number of
sequences. In sequence alignment algorithms, it is common to perform multiple query alignments to
the same reference sequence. Therefore, every time that a fixed number of query sequences is aligned
to the reference sequence, a new set of queries must be sent from GPP. During this time, the VLIW
core will be stalled until all the new queries are stored in the RAM memory for the new alignments. By
maximizing the burst length of query sequences, the time that the core is stalled can be minimized, thus
increasing performance.
39
An important problem that was not yet addressed is the number of input/output pins in the proposed
core. The number of pins could significantly reduce the operating frequency of the core, due to an
increase in routing complexity. In order to address this problem, it is necessary to know the width of the
data being transferred to and from the proposed core, and how to reduce those widths.
The instructions length for the VLIW core varies with the total number of units (execution units and
DSU) that are present. The encoding corresponding to each execution unit has a length of 32 bits,
and the DSU has a length varying with the number of execution units present. As an example with 4
execution units and one DSU, the full instruction length would then be 164 bits. Adding a 32-bit RAM
and local fast memory on top of that, the required total number of bits to be transferred to the core would
raise to 228 bits. This excludes the outputs of the proposed core that are necessary to send information
to the control units, as well as the algorithm results back to the GPP. In order to reduce these input
widths, the transferred data should be shortened and sent in more frequent and smaller bursts. For the
instructions, the adopted width should match the width of each unit. Therefore, each instruction sent
from the GPP would be divided in the number of units present in the core. For the previous example
with 4 execution units and 1 DSU, it would be required one 36-bit and four 32-bit data transfers (five data
transfers in total) for the full instruction to be available in the proposed core.
For the remaining memories, a similar solution can be used. Since the proposed core allows the word
size to be a multiple of a maximum data width, the inputs for these memories could have the same width
as the word size, with multiple word-sized transfers being required for the full data to be transferred. The
same can be applied to the solution output, which also has the same width as these memories.
Finally the control signals sent by the VLIW core to the control units previously introduced, should
only consist in small flags and thus should not require any additional modifications.
3.4 Summary
This chapter listed all the necessary requirements for the proposed architecture, and gave a detailed
description of all the architecture structures, including an interfacing structure proposal.
Exploiting both DLP and ILP, the resulting architecture consists in a VLIW architecture with multiple
execution units and DSU. Each execution unit is responsible for the operation of an independent data
vector, while the DSU takes care of parallel memory accesses. In order to enable communications be-
tween the execution units, shared register sets and sniffing mechanisms are implemented in the register
banks. Additionally, the existence of two distinct memories (RAM and local fast memory) helps reducing
the conflicts between the units when accessing the memory, reducing the existence of delays and pro-
moting a better structural organization. All these characteristics not only result in an optimized processor
for DP algorithms, but also in a programmable architecture with potential for broader compatibility.
The interfacing structure to connect the proposed architecture to a GPP is discussed in the last
section of the chapter. Although some techniques and considerations are taken for this interface, the
proposed interface was not implemented in our work, due to time constraints.
40
4DP Algorithm Implementations
Contents4.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Viterbi (Profile HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
41
This chapter describes the two algorithm implementations made for the proposed architecture: the
SW and the Viterbi algorithms. It focus on the processing scheme used by the algorithms, as well
as the necessary instructions to compute them in the proposed architecture, together with any special
mechanisms and considerations used.
The considered architecture for the implementations will consist in 4 execution units and 1 DSU with
32-bit vectors. The SW implementation will present 8-bit words, processing 4 words (cells) per execution
unit, while the Viterbi implementation will have 16-bit words, processing 2 words (cells) per execution
unit.
4.1 Smith-Waterman
As explained in the second chapter, the SW algorithm computes the local alignment between a query
and a reference sequence. With the help of a substitution score matrix and gap penalty scores (affine
model) that indicate, respectively, the weight of matches/mismatches and insertions/deletions in the
alignment, the algorithm fills the resulting score matrix, from the upper left to the bottom right. This filling
operation respects the three dependencies that are present in the computations of every cell: the left,
top and top-left cell dependencies, resulting in parallelism extraction along the anti-diagonal, as it was
previously seen.
In addition to the anti-diagonal parallelism extraction, the algorithm will also follow a processing along
the query sequence (see figure 4.1). This processing scheme results in two distinct algorithm loops: an
inner loop, where a small reference sub-sequence is compared against the full query sequence; and an
outer loop, where a new reference sub-sequence is loaded, restarting the inner loop.
Although the processor is configurable to admit other setups, the described implementation uses, in
each of its 4 execution units, 32-bit vectors, each composed of 4 8-bit words, resulting in 16 8-bit cells
being simultaneously computed in all units.
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3
Figure 4.1: SW processing scheme along the query sequence and extracting parallelism along theanti-diagonal.
Revisiting the SW main equations, it is possible to observe that, in order to compute the result for the
cell (i, j), the negative gap values (β or α) are added to the vertical (cell (i − 1, j) (4.3)) and horizontal
42
(cell (i, j − 1) (4.2)) dependencies, and the substitution score is added to the diagonal dependency (cell
(i− 1, j − 1) (4.1)).
Assuming that both sequences and the substitution matrix are already stored in memory, and the
gap and dependency values are already stored in the register banks, the required algorithmic steps in
the inner loop of the algorithm can be broke-down to the following: an indexation and respective loads
of the query symbols and substitution scores; the 3 dependency sums with the substitution and gap
scores; and two maximum evaluations in order to find the final cell result.
Hi,j = max
0Ei,j
Fi,j
Hi−1,j−1 + Sm(qi, dj)
(4.1)
Ei,j = max
{Ei,j−1 + βHi,j−1 + α
(4.2)
Fi,j = max
{Fi−1,j + βHi−1,j + α
(4.3)
Due to its length, the query sequence is stored in the RAM , while the substitution score matrix is
stored in the local fast memory. Therefore, the query sequence memory accesses can be performed by
the DSU, allowing the execution units to load the substitution symbols in parallel, taking a total of two
clock cycles per iteration (see clock cycles 1 and 2 for the Unit 0 in figure 4.2). Since the substitution
score load requires the current query symbol in order load the correct value (by performing a comparison
between the query symbol and the reference symbol), the query symbol that is being loaded in parallel
will be used in the next iteration, with the current query symbol being already present in the register file.
Following the substitution score load, the 3 main sums can now be computed, since all the 3 depen-
dencies and gap scores are already stored in the register banks. These sums can be encoded to one
single sum instruction if the Td flag is activated, as it was seen in the architecture’s instruction set in the
previous chapter. Therefore, these 3 sums will only take 1 clock cycle to compute (see clock cycle 3 for
Unit 0 in figure 4.2).
Finally, the maximum operations will find the final result, which corresponds to the maximum value
of the three previous sum results. Two maximum instructions are necessary, thus taking 2 clock cycles
to finish (see clock cycles 4 and 5 for Unit 0 in figure 4.2). At the same time, the query symbols in each
execution unit (which are stored in the memory registers) are shifted to the adjacent unit, in order to be
reused during the next iteration. This can be done since the parallelism along the anti-diagonal and the
processing along the query sequence are exploited. The query symbol pre-loading during a previous
clock cycle, together with the symbol shifting corresponds to a register windows scheme, similar to the
one depicted in figure 3.10.
After the final cell value is computed, the inner loop restarts. The table in figure 4.2 details the inner
loop for an example with 4 execution units and 1 DSU.
The ILP is exploited in the SW implementation by having an offset of one instruction computation
between adjacent execution units. Due to the processing along the query sequence, the most advanced
43
Data…Stream…Unit Unit…0 Unit…1 Unit…2 Unit…3
1 Index crit. dep. (Unit 0) INDEX SPADDR (i+3,j) 1
2 Load crit. dep. (Unit 0) | Index crit. gap (Unit 0) SPAD LD INDEX SPADDR (i+2,j) 2
3 Load crit gap (Unit 0) | index query symbol (Unit 0) SUM (Td = 1) SPAD LD INDEX SPADDR (i+1,j) 3
4 Store cell result (Unit 3) | Load query symbol (Unit 0) MAXMOV SUM (Td = 1) SPAD LD INDEX SPADDR (i,j) 4
5 Store gap result (Unit 3) | Shift query symbols (u0 to u3) MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) SPAD LD 5
6 … … MAX (Opcontrol(5) = 1) MAXMOV SUM (Td = 1) 6
7 … MAX (Opcontrol(5) = 1) MAXMOV 7
8 … MAX (Opcontrol(5) = 1) 8
Clo
ck…C
ycle
s
Figure 4.2: Main iteration (inner loop) operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSU.
unit will correspond to the unit that is aligning the latest query symbol. Also, given the anti-diagonal
parallelism and the dependency propagation from the top-left to the bottom right, the most advance unit
will also correspond to the left-most unit, as can be seen in figure 4.1. Accordingly, due to the anti-
diagonal parallelism and the number of existent units, these computational offset will not introduce any
conflicts, as was seen in the previous chapter (see figure 3.2).
The ILP exploitation greatly reduces the required number of FUs. From the table in figure 4.2, it is
possible to see that, with 4 execution units, there are never more than 1 SUM, 1 INDEX SPADDR, and 2
maximum instructions (MAXMOV and MAX) being computed during the same clock cycle. Therefore, the
SW algorithm implementation will only require 3 SUM/SUB units (since the sum instruction refers to a 3-
way broadcast sum), 2 MAXIMUM units and 1 COMPARISON unit (for the INDEX SPADDR instruction). If more
execution units were present, the required number of FUs would be higher, or it could remain the same
at the cost of adding stalls due to the rise of conflicts.
The outer loop of the SW algorithm consists only in the load of new reference symbols and occurs
every time the end of the query is reached by an execution unit. These symbols are stored in the memory
registers, and thus can be loaded in parallel by the DSU, similarly to the query symbols. The table in
figure 4.3 depicts the instructions in the DSU and execution units for the outer loop. As we can see from
figure 4.3, the outer loop will not introduce any additional clock cycles since it can be fully performed in
parallel by the DSU.
Due to the partitioning of the reference sequence, some problems will arise when solving cell de-
pendencies between execution units, specifically the horizontal and diagonal dependencies. Since the
processing scheme follows the query sequence, thus adopting a top-down anti-diagonal parallelism
approach, the computed cell values will be stored in the register banks and be used as vertical de-
pendencies during the next algorithm iteration. In the following iteration, the register with the vertical
dependency value is overwritten with the new values. The same happens for the diagonal and hori-
zontal dependencies. Inside the same unit, these dependencies are rapidly retrieved, since they are all
located in the same register bank. However, the dependencies between units require the use of sniffing
mechanisms. These mechanisms are used by a unit to access the dependency cells from the adjacent
execution unit to its left (unit in advance), in order to use them in the next iteration, as they were stored
in its own register bank. Contrary to the other dependencies, the diagonal dependency requires two
44
DatapStreampUnit Unitp0 Unitp1 Unitp2 Unitp3
Load crit. dep. 6Unit 07
Index crit. gap 6Unit 07
Load crit gap 6Unit 07
Index query symbol 6Unit 07
Store cell result 6Unit 37 | Index ref symbol (Unit 0)
Load query symbol 6Unit 07
Store gap result 6Unit 37 |Load ref symbol (Unit 0)
Shift query symbols 6u0 to u37
Load crit. dep. 6Unit 07
Index crit. gap 6Unit 07
Load crit gap 6Unit 07
Index query symbol 6Unit 07
Store cell result 6Unit 37 | Index ref symbol (Unit1)
Load query symbol 6Unit 07
Store gap result 6Unit 37 |Load ref symbol (Unit 1)
Shift query symbols 6u0 to u37
1313 … MAX 6Opcontrol657 = 17
SUM 6Td = 17 11
12 … MAX 6Opcontrol657 = 17 MAXMOV 12
11 … … MAX 6Opcontrol657 = 17 MAXMOV
8
Ou
terp
Loo
p
(Un
itp1
) 9 MAXMOV SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 9
10 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 10
8 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-17,j7 MAX 6Opcontrol657 = 17
SUM 6Td = 17 6
7 SPAD LD INDEX SPADDR 6i+N,j7 MAX 6Opcontrol657 = 17 MAXMOV 7
6 Index crit. dep. 6Unit 07 INDEX SPADDR 6i,j+17 MAX 6Opcontrol657 = 17 MAXMOV
SPAD LD INDEX SPADDR 6i+6N-37,j7 4
5 MAX 6Opcontrol657 = 17 MAXMOV SUM 6Td = 17 SPAD LD 5
Ou
terp
Loo
p
(Un
itp0
) 4 MAXMOV SUM 6Td = 17
3 SUM 6Td = 17 SPAD LD INDEX SPADDR 6i+6N-27,j7 3
1 Index crit. dep. 6Unit 07 INDEX SPADDR 6i+N,j7 1
2 SPAD LD INDEX SPADDR 6i+6N-17,j7 2
Figure 4.3: Inner loop and outer loop operations (with the respective clock cycles) for the SW algorithmin the proposed architecture. The example depicts the architecture with 4 execution units and 1 DSUand 2 algorithm iterations. The outer loop for each execution unit is comprised by two DSU instructions.
registers in each unit. This is due to the fact that an anti-diagonal scheme is used, and therefore, the
computed cell value will only be used as a diagonal dependency two iterations after the current one (thus
being necessary to store the value to be used in the next iteration and two iterations after the current
iteration).
However, for the most advanced unit (that is aligning the most-left symbols of the reference sub-
sequence), the horizontal and diagonal dependencies cannot be retrieved from its adjacent unit, since
there is no adjacent unit in advance to it. These dependencies are computed in the previous reference
sub-sequence, and therefore should be stored in memory. In fact, as it can be seen in the tables of
figures 4.2 and 4.3, the most delayed unit (that is aligning the most-right symbols of the reference sub-
sequence) will have its final cell values stored to the memory by the DSU, in order to be retrieved later
on by the most advanced unit (with the help of the DSU).
These critical sections (see figure 4.4) only occur between these two execution units that are com-
puting the edges of the reference sub-sequences, and do not introduce any additional clock cycles,
since the memory loads and stores are done in parallel by the DSU. Therefore,their processing can be
seen as a window register scheme, where the new reference symbols are loaded just before they are
required. The sniffing mechanism cannot be applied in the critical sections due to the large length of
the query sequence and the fact that the units are not adjacent. Since the processing follows the query
sequence, the most delayed unit would need to store all of its computed cell values until the end of the
query sequence, which would prove impossible due to the number of existing registers when compared
to query length.
The affine gap model will also require a mechanism similar to the horizontal and vertical dependen-
45
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3 U0 U1 U2 U3
Figure 4.4: Critical section between two sub-sequences of the reference sequence for an example casewith 4 execution units. Each color/symbol represents a different iteration, with 4 iterations being depicted.The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory.
cies. Since this model takes into account two distinct gap values (an initialization value and an extension
value in case there are several gaps in a row), all execution units will have two registers in their register
bank with both gap values constantly stored. During the maximum operations of the SW algorithm, an
auxiliary register will store the information regarding which dependency originated the max result. If it is
a vertical or horizontal dependency, the auxiliary register will compare its previously stored value, and
check if the new result is a gap extend or initialization, updating its value accordingly. This way, during
the sum operations in the following iteration, the correct gap value to be used is already stored in the
register bank.
For the most advanced execution unit, the auxiliary register that indicates the type of gap for the
horizontal dependencies belongs to the most delayed unit in a previous iteration. Given that the required
value is computed in a former iteration of the algorithm, it is stored in memory to be later loaded by
the execution unit in advance, similarly to the horizontal dependency (see DSU instructions in the table
in figure 4.3). Also, just like the horizontal dependencies, the adjacent units share these auxiliary gap
registers by using the sniffing mechanism.
4.2 Viterbi (Profile HMMs)
The Viterbi algorithm can find the most likely sequence path of hidden states in a HMM for a given
sequence of observed outputs. As previously mentioned, this algorithm is well suited for solving se-
quence alignment problems, with the help of profile HMMs. These HMMs take into account a family of
similar sequences (profile), thus enabling an alignment between a query sequence and the whole family
(which can be seen as the reference sequence) at once. They will also require additional states not
46
present in normal HMMs, as depicted in figure 2.7. These special states are specific for the multihit
local alignment, achieving several local alignments between the compared sequences. This model was
chosen to facilitate the comparison to the GPP implementation of the Viterbi present in the next chapter,
as well as to enable a comparison to the previously explained SW algorithm.
The considered Viterbi algorithm implementation will follow the same anti-diagonal parallelism and
processing scheme along the query sequence as the SW algorithm (see figure 4.1). It will have 16-bit
words, resulting in 8 cells being computed at every iteration, two per execution unit. This translates into
8 query and reference symbols being compared every iteration.
This implementation will require, in addition to the query sequence, a profile corresponding to the
reference sequence, with transition and emission values between all the existing states, all stored in
memory. The profile should follow the optimizations made by the HMMER [9] application, since it will
be used as a comparative study in the following chapter. This optimized profile has the transition and
emission values aligned to the algorithm’s access pattern, resulting in faster accesses for these values.
However, the access pattern implemented by the HMMER application consists in a stripped pattern along
the query sequence, based on Farrar’s [12] implementation for the SW algorithm (see figure 4.5(a)). As
a result, the profile must be modified to adapt to the anti-diagonal access pattern used in the proposed
architecture.
Given that two symbols are being compared in each unit, the profile should then group the emission
and transition scores in pairs, in order to both scores being retrieved in a given unit with only 1 load
instruction. In fact, after analyzing the profile in the HMMER tool, it was observed that each combination
of query-reference symbols will only require a total of two different emission/transition scores, instead
of a different score for every state. This occurs due to score overlapping between different states.
Furthermore, given the fact that two cells are computed in each unit, it will result in 2 load instructions
per unit, for a total of 8 load instructions at every iteration. Figure 4.5(b) depicts the emission/transition
score pattern that should be used for the implemented architecture. It is important to notice that these
memory accesses will not have any influence in the algorithm throughput, since they can be computed
exclusively by the DSU, in parallel to the main algorithm operations (see Appendix B.1).
Furthermore, both the emission and transition, ordered accordingly to the query sequence, have
new values being retrieved for each new reference symbol. Given the potentially large size of the query
sequences, the storage of all emission and transition scores in the proposed architecture cannot be
accommodated in the registers banks, and so, similarly to the SW algorithm, only the smaller required
subset of scores will be available at any given instant, with the rest being stored in memory. Effectively,
for every sequence symbol being computed in the proposed architecture, there must be a different set
of emission and transition scores, of which a small subset of those scores must be retrieved at every
iteration.
The operations required to compute a pair of cells in one execution unit are listed in figure 4.6. Both
the sequence and query symbols, as well as the respective emission/transition scores required for any
47
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
rallelism
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment ispadded.
g N
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment is
g N
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
cN Decomposition in stripedlines. Each segment is
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
possible to parallelize the
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
possible to parallelize the
nce the same element j of
ndencies across ’segment
sed later, in a second inner
values may only have any
better results, it is also less
o run banded alignmentsN.
ad Gmostly from the Lazy-F
A’ L’ I’ G’ N’ I’ N’ G
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Striped blocks of 14 M emissions1,5,9,13 2,6,10,14 3,7,1,x 4,8,2,x
Transition Costs1,5,9,13 4,8,2,x 4,8,2,x 4,8,2,x 1,5,9,13 1,5,9,13 1,5,9,13
[tBMk] [tMM] [tIM] [tDM] [tMD] [tMI] [tII]
2,6,10,14 1,5,9,13 1,5,9,13 1,5,9,13 2,6,10,14 2,6,10,14 2,6,10,14
[tBMk] [tMM] [tIM] [tDM] [tMD] [tMI] [tII]
3,7,1,x 2,6,10,14 2,6,10,14 2,6,10,14 3,7,1,x 3,7,1,x 3,7,1,x
[tBMk] [tMM] [tIM] [tDM] [tMD] [tMI] [tII]
4,8,2,x 3,7,1,x 3,7,1,x 3,7,1,x 4,8,2,x 4,8,2,x 4,8,2,x
[tBMk] [tMM] [tIM] [tDM] [tMD] [tMI] [tII]
Lazy 1,5,9,13 2,6,10,14 3,7,1,x 4,8,2,x
Loop [tDD] [tDD] [tDD] [tDD]
Q = 1
Q = 2
Q = 3
Q = 4
(a) Profile example for the HMMER [9] platform (left), with the respective stripped pattern (right).Each cell in the transition costs matrix has the 4 costs for all the 4 cells computed in parallel. Foreach cell, only two different transition scores are used for all the 7 transitory states (representedin grey). The last row represents the transition scores that are necessary for the lazy loopsresulted from the stripped pattern.
ARLRIRGRNRIRNRG
ALGORITHM
AaR Decomposition in antiOdiagonalsH The lowerN
d i ht id
cellm1 cellm2 cellm1 cellm2 cellm1 cellm2 cellm1 cellm2
diagm1 1,5 9,13 2,6 10,14 3,2 4,11 5,7 11,3
diagm2 2,4 5,14 2,4 2,7 9,9 12,4 2,1 1,1
diagm3 5,12 12,11 11,1 10,9 8,4 6,8 3,11 2,13
diagm4 9,1 10,2 7,6 3,6 5,16 4,2 8,7 6,9
Unitm3Unitm2Unitm1Unitm0U0 U1 U2 U3
Firstmtransitionmcost:m[tBMk]Nm[tMD]Nm[tMI]Nm[tII]Secondmtransitionmcost:m[tMM]Nm[tIM]Nm[tDM]
Formeachmcell:
(b) Profile example (with random costs) for the proposed architecture (left), with the respective anti-diagonal pattern (right). Each unit computes 2 cells, which results in 4 transition/emission costsnecessary for each unit. The 2 costs per cell cover all the transitory states. The cells in each unithave their correspondent in the diagonal pattern matched by the colored circles.
Figure 4.5: Comparison of example profiles for the HMMER platform [9] (computing 4 cells in parallel)and the proposed architecture (computing 8 cells in parallel). The way that the scores are orderedaccordingly to the used processing pattern is highlighted in both examples.
given iteration, are stored in the respective registers banks, prior to any of the cell operations being
computed. Just like the SW algorithm, the main operations for the three main states (M , I and D)
consist in sums/subtractions and maximum operations. The same also applies for the special states B,
E and J .
The dependencies required for the M state will differ from the SW algorithm, since they will now
require both the diagonal dependency of the states I and D, whereas, on the SW, only the M diagonal
48
M state computation
E state computation
D+1 state computation
I state computation
J state computation
B state computation
Unit 1 Unit 2 Unit 3 Unit 41 SUM xBv SUM xBv
2 SUM xBv SUM xBv
3 SUM mpv SUM mpv
4 MAX sv MAX sv SUM mpv SUM mpv
5 SUM ipv SUM ipv MAX sv MAX sv
6 MAX sv MAX sv SUM ipv SUM ipv
7 SUM dpv SUM dpv MAX sv MAX sv
8 MAX sv MAX sv SUM dpv SUM dpv
9 SUM sv SUM sv MAX sv MAX sv
10 MAX xEv MAX xEv SUM sv SUM sv
11 SUM dcv SUM dcv MAX xEv MAX xEv
12 SUM dcv SUM dcv
13 SUM mpv SUM mpv
14 SUM mpv SUM mpv
15 SUM ipv SUM ipv
16 MAX sv MAX sv SUM ipv SUM ipv
17 SUM xJ SUM xJ MAX sv MAX sv
18 SUM xJ SUM xJ
19 MAX xJ MAX xJ SUM xJ SUM xJ
20 SUM xJ SUM xJ
21 SUM xB SUM xB MAX xJ MAX xJ
22 MAX xB MAX xB SUM xB SUM xB
23 MAX xB MAX xB
Clo
ck C
ycle
s
Figure 4.6: Main iteration (inner loop) operations for the Viterbi algorithm in the proposed architecture.Only the execution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.
dependency and the current I and D states were required. This will result in a delayed load/store
scheme by using additional registers to store the previous and current scores, similar to the solution
used for the diagonal dependencies in the SW implementation. The remaining dependencies for the
I and D states are implemented in the same way as their SW counterpart (see the equations (2.10),
(2.11) and (2.12) in chapter 2).
The special states B, E and J are also required to be updated at every iteration, since the B state
dependency is used in the computation of the M state score, while depending itself on the J state. In
turn, the J state depends on the E state (see figure 2.7).
The E score will correspond to the current maximum score for the corresponding sequence symbols
in any execution unit. Accordingly, it has to be constantly updated at every iteration and propagated
to the computation of the states J and B. In turn, the J state compares the fixed transition cost of
moving from the updated E state to the cost of remaining in the J state. Finally, the B score takes into
account the newly updated J score and compares its cost to the cost of moving from state N to state B.
This special state N will be computed in the outer loop, since it only depends on the current reference
sequence symbol. The different loop and move transition costs are constant throughout the algorithm
computations and thus they are pre-stored in the register banks for faster access. These special states
will then introduce additional sums and maximum operations in the inner loop of the algorithm, which
can be seen in figure 4.6.
The remaining special state C is only updated in the outer-loop, just like the N state. While the N
state only depends on the current reference sequence symbol, the C state corresponds to the maximum
49
cell value in the respective execution unit. Therefore, after all execution units reach the end of the
query sequence and before they start computing a new sub-sequence of the reference sequence, the
maximum C score must be found between all units. This is possible by storing the C scores in the
shared registers, which makes them available for all units. The final C score is then stored in the first
execution unit and a new sub-sequence of symbols can start its computation (see figure 4.7).
UnitF1 UnitF2 UnitF3 UnitF4
1 OR dcv OR dcv SUM dcv SUM dcv
2 OR xEv OR xEv SUM xEv SUM xEv
3 OR mpv OR mpv SUM mpv SUM mpv
4 OR ipv OR ipv SUM ipv SUM ipv
5 OR dpv OR dpv SUM dpv SUM dpv
6 SUM xNv SUM xNv
7 SUM xNv SUM xNv
8
9
10 SUM xN SUM xN
11 SUM xN SUM xN
12 SUM xEv SUM xEv
13 SUM xEv SUM xEv
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Clo
ckFC
ycle
s
InnerFLoop
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Vectors(set
to(-infinity
gOR(and(SUM
instructions
are(used(to
avoid(adding(FUsF
OuterFLoopFInitialization
OuterFLoopFFinalization
(additionalFspecialFstates)
7 SUM xNv SUM xNv
8
9
10 SUM xN SUM xN
11 SUM xN SUM xN
12 SUM xEv SUM xEv
13 SUM xEv SUM xEv
14 SUM xC SUM xC
15 MAXFxC MAXFxC SUM xC SUM xC
16 MAXFxC MAXFxC
17 MAXFxC MAXFxC
18 MAXFxC
Vectors(set
to(-infinity
gOR(and(SUM
instructions
are(used(to
avoid(adding(FUsF
Clo
ckFC
ycle
s
OuterFLoopFInitialization
OuterFLoopFFinalization
(additionalFspecialFstates)
InnerFLoop
Figure 4.7: Outer loop pseudo-code of the Viterbi algorithm in the proposed architecture. Only theexecution unit instructions are depicted. For the full pseudo-code consult Appendix B.1.
The fact that all execution units reach the end of the query sequence before starting the alignment
of a new sub-sequence introduces a small delay that was nonexistent in the SW implementation, since
now there will be 3 initialization and finalization iterations, at the beginning and at the end of the query
sequence, respectively, for every new sub-sequence of the reference sequence, as can be seen in fig-
ure 4.8. Additionally, the processing scheme will also have critical sections just like those seen in the
SW algorithm (see figure 4.8). To solve them, a similar register window scheme is used, where the
dependencies generated in the last execution unit are stored in memory after they are computed, and
the dependencies required by the first unit are loaded before they are needed. This is also consolidated
by the delayed load/store scheme, mentioned above.
The ILP that is adopted in the Viterbi’s implementation will also differ from the one that was observed
with the SW algorithm. Previously, all execution units were 1 instruction in advance regarding their
adjacent unit, resulting in the most advanced unit being 4 instructions in advance, regarding the most
delayed unit. In Viterbi’s implementation, the delay between instructions only occurs in pairs, resulting in
the first two units being in advance 1 instruction in comparison to the last two units. This can be seen in
figures 4.6, 4.7 and in Appendix B.1 (where the instructions appear in pairs). This was done in order to
keep the same number of FUs that were used for the SW algorithm implementation. If an identical ILP
extraction was used, the required number of FUs would be greater, but it would come with an increase
50
A L I G N I N G
ALGORITHM
Que
ry S
eque
nce
Reference SequenceSub-Sequence 1Sub-Sequence 0
U0 U1 U2 U3 U0 U1 U2 U3
Figure 4.8: Critical section of the Viterbi implementation between two sub-sequences of the referencesequence for an example case with 4 execution units. Each anti-diagonal/color/symbol represents adifferent iteration, with 7 iterations being depicted (four in sub-sequence 0 and three in sub-sequence 1).Sub-sequence 1 can only start its computations after all units finish their computations in sub-sequence0. The dependencies required by Unit 0 for the sub-sequence 1 must be retrieved from memory and arerepresented by the red arrows.
in performance.
The implementation of Viterbi’s algorithm in the proposed architecture thus results in a stationary
phase (inner loop) comprising an average of 23 instructions for an execution unit to complete an iteration
of the algorithm, with two cell being updated simultaneously in each unit. After all units reach the end of
the query sequence, the outer-loop will take 18 cycles until a new sub-sequence starts being aligned.
4.3 Summary
This chapter described the algorithm implementations for the SW and Viterbi algorithms, in the pro-
posed architecture.
These algorithms compute the sequence alignment of a reference sequence against a query se-
quence, exploiting an anti-diagonal parallelism. This processing scheme avoids any dependency be-
tween the cells being processed, thus increasing performance. The algorithms also take advantage of
the available proposed processor mechanisms, such as the sniffing mechanism, the shared registers
and the DSU, which parallelizes memory accesses.
Finally, the pseudo-codes for both algorithms are also presented in this chapter.
51
52
5Prototyping and Evaluation
Contents5.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Reference State-of-the-art Architectures . . . . . . . . . . . . . . . . . . . . . . . 555.2.2 Application Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Performance and Energy Effiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Smith-Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.3.2 Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
53
This chapter will detail the reference state-of-the-art architectures used to evaluate the benchmarked
algorithm applications: the SW and Viterbi algorithms. The implementation of these applications in the
evaluated architectures will also be detailed, together with their respective dataset.
A performance evaluation will be held for the presented architectures, followed by a performance and
energy efficiency evaluation, completing the evaluation tests.
5.1 Hardware Prototype
The proposed architecture was prototyped in a Zynq SoC 7020 FPGA [35]. The implemented config-
uration architecture issues, at each clock cycle, one bundle of instructions to four 32-bit execution units
and one DSU, each using vectorial instructions to process multiple cells in parallel. This results in a
128-bit wide VLIW, allowing the computation of the 16 8-bit (for the SW algorithm) or 8 16-bit (for the
Viterbi algorithm) cells in parallel, used by the considered benchmark algorithms. The register banks
and memories share the same width as the execution units, thus being composed by several cells in
each register and memory block.
The synthesis and place-&-route of the architecture was performed by using the Xilinx ISE 14.4
tool. The reported amount of occupied resources are presented in table 5.1. As can be observed, the
proposed architecture uses 6% of the Slice Registers, 50% of the Slice LUTs, and 5% of the BRAMs
available on the Zynq SoC 7020, achieving a maximum post-route operating frequency of 98.5 MHz.
By using the Xilinx Power Estimation tool [36], we further estimated the power consumption of the
proposed processor. Assuming worst-case conditions for flip-flop and memory updates, it results in a
power consumption of 0.584 W.
Table 5.1: Hardware resources, operating frequency and power estimation for the proposed architecture.Hardware Resources Used Total Utilization
Slice Registers 7135 106400 6%
Slice LUTs 26725 53200 50%
36-bit Block RAMs 7 140 5%
Frequency 98.5 MHz
Power 0.584 W
The amount of used Slice LUTs corresponds to 50% of the total LUTs available, and thus will be the
limiting factor of the processor scalability, when increasing the number of execution units or the vector
lengths. In fact, a scalability evaluation of the proposed architecture was performed, showcasing the
hardware requisites. Such study was conducted by changing the size of the vector in all execution units
from 32 to 40 bits (increase in DLP), and by including an additional execution unit (increase in ILP).
The increase of the vector width results in a 21.4% and 24.6% increase of slice registers and LUTs,
respectively, while the addition of one execution unit resulted in an increase of 23.3% and 29.9% in slice
registers and LUTs. The number of Block RAMs is only affected by the changes of the vector width,
increasing by one unit for every 16 bits added to the length of the vector.
54
Although the increase in hardware, the estimated power drops to 0.504 W (13.7%) when the vector
width increases to 40 bits, and to 0.563 W (4%) with the addition of an execution unit. This can be
explained by the significant drop in the operating frequency for both situations. Figure 5.1 summarizes
the hardware scalability results.
SliceRegisters
Slice3LUTs BRAMsFrequency
[MHz]Power3[W]
43328bit3units37baselinearchitecture1
7135 26725 7 9895 G9584
434G8bit3units 8662 333GG 7 6694 G95G4
53328bit3units 8795 34726 7 74 G9563
FPGA3Total 1G64GG 532GG 14G
G91
1
1G
1GG
1GGG
1GGGG
1GGGGG
1GGGGGG
Figure 5.1: Hardware scalability of the proposed architecture. The considered evaluations includedincreasing the width of the vectors, as well as increasing the number of execution units. The obtainedhardware resources, operating frequency and power estimation are presented.
5.2 Performance Evaluation
This section details the reference state-of-the-art architectures and compare them to the proposed
architecture, by using performance evaluation metrics. It also presents the application benchmarks and
respective used datasets.
5.2.1 Reference State-of-the-art Architectures
The proposed architecture was evaluated against three distinct state-of-the-art architectures rep-
resenting three distinct domains: i) mobile and low-power GPP; ii) high-performance GPPs; iii) pro-
grammable ASIP.
ARM Cortex-A9: Consists in a low-power GPP running at an operating frequency of 533 MHz. It
is integrated within the Zynq SoC 7020 FPGA (the same board used for the proposed architecture),
consisting in the PS of the SOC. Its architecture support out-of-order execution, with dual instruction
issue and 128-bit SIMD extensions. This allows issuing up to 2 instructions per clock cycle. In order to
take full benefit of all vector capabilities of the ARM processor, the processor’s SIMD extension (NEON
intrinsics [37]) is used.
55
Intel Core i7 3820: Consists on a high-performance GPP, running at a maximum frequency of 3.6
GHz. This processor uses a complex control structure capable of multiple instruction issue with out-of-
order and speculative execution (issuing up to 6 micro-ops per clock cycle [38]), achieving an average
of 2 Instructions Per Cycle (IPC) for the evaluated algorithms and respective datasets. The SSE2 SIMD
extension [38] was used with 128-bit wide vectors.
Bioblaze [23]: Consists in a dedicated ASIP, running at a frequency of 158 MHz. It uses a 128-bit
adapted SIMD extension ISA, and it was implemented in the same Zynq FPGA, for a fair comparison.
The SW algorithm was implemented in all architectures, while the Viterbi algorithm was only imple-
mented in the first two.
5.2.2 Application Benchmark
The benchmark applications consist in the previously introduced DP algorithms: the SW and Viterbi
algorithms. Both were implemented to solve sequence alignment problems between a query and refer-
ence sequences.
5.2.2.A Smith-Waterman
As it was described in the previous section, the considered implementation of the SW algorithm uses
8 bits for all symbols and scores. Given that the vector lengths are dimensioned to a maximum width of
128-bits, it results in a total of 16 8-bit cells being processed in parallel (4 cells per execution unit in the
proposed architecture).
The considered SW algorithm implementation was already detailed in chapter 4. To summarize, the
algorithm is parallelized along the anti-diagonal (in order to avoid data dependencies) and processed
along the query sequence, aligning smaller reference sub-sequences at a time. During the steady state
of the algorithm, this processing scheme results in only 5 clock cycles to process any cell of the DP
scoring matrix. In practice, and given the 16-cell parallelism, it results in 3.2 cells being computed each
clock cycle. This is made possible by the DSU, which parallelizes the memory accesses, removing
their impact from the inner loop of the algorithm. Nevertheless, the considered processing scheme
presents critical sections where it is required an extra memory access to migrate to a new reference
sub-sequence. However, these memory accesses can also be computed by the DSU, thus elimination
any performance impact caused by the outer loop.
For the remaining benchmarked architectures, the implemented SW algorithm follows Farrar’s im-
plementation [12], by using their SIMD ISA extensions with an equivalent vector length to the proposed
architecture (128-bits), to guarantee a fair comparison. Furthermore, only one core of each architecture
is used.
This implementation adopts a stripped access pattern processing scheme, along the query sequence
direction, where the computations are carried out in several separate F stripes that cover different parts
56
(a) Memory layout for the query profile.The vectors run parallel to the querysequence in a stripped pattern.
(b) Data dependencies between the last F vectorand the first.
Figure 5.2: Stripped pattern processing scheme and correspondent dependencies (Figures taken from[12]).
of the query sequence. Accordingly, the query is divided into F p-length segments, where p is given by
the number of vector elements that can be simultaneously accommodated in a SIMD register (see figure
5.2(a)). This results in a value of p equal to 16, for 8-bit data elements and 128-bit SIMD registers.
However, the data elements in this processing scheme are not fully independent, since the F seg-
ments have vertical dependencies to each other (see figure 5.2(b)). Hence, after all segments are
processed, a lazy loop is required to be executed in order to verify if any data hazards have occurred. If
there is a need for correction, a second pass of the loop is required to correct the errors, before a new
reference symbol is loaded for alignment. Although this loop is done in the outer loop of the algorithm
(after the query sequence is fully swiped), its performance impact is still very relevant, specially when
compared to the anti-diagonal processing scheme where no data dependencies occur and, therefore,
no lazy loops are required.
The extended SIMD-ISA offered by the Bioblaze ASIP is specially tailored for the SW algorithm.
Therefore, it results in an accelerated versions of the original Farrar’s implementation, since an efficient
fine-grain parallelism exploitation can be extracted. However, it still remains the same algorithm, with its
stripped processing scheme and the lazy loops.
Dataset
To benchmark the SW algorithm, a DNA dataset composed of several reference sequences (ranging
from 128 to 16384 elements) and a set of query sequences with length ranging from 20 to 2276 ele-
ments was used. The reference sequences correspond to twenty indexed regions of the Homo sapiens
breast cancer susceptibility gene 1 (BRCA1gene) (NC 000017.11). The query sequences were obtained
57
from a set of 22 biomarkers for diagnosing breast cancer (DI183511.1 to DI183532.1) and a fragment,
with 68 base pairs, of the BRCA1 gene with a mutation related to the presence of a Serous Papillary
Adenocarcinoma (S78558.1).
5.2.2.B Viterbi
The considered implementation of the Viterbi algorithm adopts a representation with 16 bits for all
symbols and scores. The vector lengths are dimensioned to a maximum width of 128-bits, which results
in a total of 8 16-bit cells being processed in parallel (2 cells in each execution unit for the proposed
architecture), corresponding to double the cell size that was adopted in the SW algorithm. This is due to
the higher precision requirements of the Viterbi algorithm versus the SW.
The Viterbi algorithm implementation on the proposed architecture was already described in detail
in chapter 4. Just like the SW, the algorithm is parallelized along the anti-diagonal and along the query
sequence, partitioning the reference sequence in smaller sub-sequences. As a result, during the steady
state of the algorithm, any given cell takes an average of 23 clock cycles to be computed (effectively
taking 2.875 clock cycles to compute a cell, given the 8-cell parallelism). This is made possible by the
DSU, which parallelizes the high number of memory accesses, removing their impact from the inner
loop of the algorithm. Additionally, the processing scheme presents critical sections whenever the end
of the query is reached. These critical sections will introduce a small computational delay (nonexistent
in the SW algorithm) in order to ensure the commitment of the data dependencies. Therefore, the outer
loop accounts for 3 additional inner loop iterations, or 69 clock cycles.
For the remaining evaluation platforms, it was used HMMER’s [9] Viterbi implementation. This im-
plementation follows a processing scheme very similar to Farrar’s implementation of the SW algorithm,
with the required modifications to suite the Viterbi algorithm. As such, the implementation follows the
same stripped access pattern processing scheme along the query sequence, where the computations
are carried out in several separate F stripes that cover different parts of the query sequence.
Differently to the SW algorithm, all dependencies for the match states in Viterbi’s algorithm depend
on scores from a previous row and column states (as seen in chapter 4). Therefore, HMMER’s im-
plementation uses a delayed load/store scheme that only stores the new values after the preemptive
load of the previous values. Although this algorithm inherently has more instructions than the SW, this
instruction reordering helps minimizing the number of required instructions at a cost of more storage.
Additionally, the lazy loops will still exist in the outer loop of the algorithm (whenever the end of the
query is reached). However, unlike its SW counterpart, the lazy loops in Viterbi’s algorithm are more
simple, with a lower impact on the resulting performance.
Dataset
To evaluate Viterbi’s algorithm implementation, a sample of 28 HMMs from the Dfam database of Homo
Sapiens DNA [39] were used. The adopted model lengths vary from 60 to 3000, increasing by a step of
roughly 100 model states. These models were created by the HMMER3.1b1 tool [9] and their complete
58
list is presented below (their length is prefixed to the model name):
M0063-U7 M0700-MER77B M1409-MLT1H-int M2204-CR1 Mam
M0101-HY3 M0804-LTR1E M1509-LTR104 Mam M2334-L1M2c 5end
M0200-MER107 M0900-MER4D1 M1597-Tigger6b M2434-L1MCa 5end
M0301-Eulor9A M1000-L1MEg2 5end M1727-L1P3 5end M2532-L1MC3 3end
M0401-MER121 M1106-L1MD2 3end M1817-REP522 M2629-L1MC4a 3end
M0500-LTR72B M1204-Charlie17b M1961-Charlie4 M2731-Tigger4
M0600-MER4A1 M1302-HSMAR2 M2101-L1MEg 5end M2858-Charlie12
A query sequence (generated by the HMMER tool) with a length of 10000 symbols was used to eval-
uate the alignment against all the above reference sequences. Additionally, in order to study the impact
of both the query and reference lengths in the algorithm performance, a sample of 17 generated query
sequences, with lengths ranging from 20 to 10000, was used to evaluate the algorithm’s performance in
the alignment against the longest reference sequence with a length of 2991 symbols.
5.2.3 Performance Evaluation
In the proposed architecture, both the RAM and the local fast memory are pre-loaded with the refer-
ence and query sequence (RAM), together with all the necessary constants and cost/score values (both
memories) required by the evaluated algorithms. Therefore, only the algorithm steps are accounted for
in the performed evaluations.
Accurate clock cycle measurements of the required time to execute each biological sequences anal-
ysis in the proposed platform were achieved by using the Xilinx ISim [40]. In the Bioblaze, the clock
cycle measurements were achieved by using Modelsim SE 10.0b [41]. In the ARM Cortex-A9 and the
Intel Core i7, the system timing functions were used to determine the total execution time of the DNA
sequence alignment. To improve the measurement accuracy, several repetitions of the same alignment
were done. The obtained values were subsequently divided by the number of repetitions and the pro-
cessor clock frequency.
The performance evaluation will then consist in two metrics: the number of Clock Cycles per Cell
Update (CCPCU) and the number of Cell Updates Per Second (CUPS).
5.2.3.A Smith-Waterman
Table 5.2 depicts the average number of clock cycles to complete the DNA sequence alignment in all
evaluated architectures, for the previously presented dataset. The resulted clock cycle ratios between
the reference architectures and the proposed architecture can be observed in the respective columns
(relating the observed differences in terms of clock cycles), which account for the affine model of the
algorithm.
The charts in figure 5.3 were drawn in order to study how number of clock cycles is affected by the
length of each sequence (both query and reference sequences). The plot in figure 5.3(a) represents the
number of clock cycles of the Bioblaze and the proposed architecture for an alignment between a fixed
59
Table 5.2: Average number of clock cycles for different DNA query sequences matched against a 4092element reference sequence, using the SW algorithm and the considered execution platforms, with therespective clock cycle ratios.
Clock Cycles [×106]
Query Size 20 68 74 85 94 685 1861 2276
Proposed Architecture 0.026 0.087 0.095 0.109 0.120 0.876 2.380 2.911
Clock Cycles (c.c.) [×106]
Query ARM CortexA9 c.c. BioBlaze c.c. Intel c.c.
Size (NEON) ratio [23] ratio Core i7 3820 ratio
20 1.154 44.990 0.307 11.969 0.384 14.769
68 1.373 15.776 0.555 6.377 0.606 6.966
74 1.339 14.139 0.543 5.734 0.429 4.516
85 1.470 13.515 0.631 5.801 0.480 4.404
94 1.373 11.415 0.627 5.213 0.487 4.058
685 6.303 7.195 3.375 3.853 1.504 1.717
1861 16.262 6.833 8.848 3.718 3.530 1.483
2276 19.491 6.696 10.744 3.691 4.163 1.430
query sequence (with 64 symbols) and multiple references (ranging from 128 to 16384 symbols). The
plot in figure 5.3(b) represents the number of clock cycles of the same architectures for an alignment
between a fixed reference sequence (with 4096) and multiple queries (ranging from 20 to 2276 symbols).
Both graphics are accompanied by the respective clock cycle ratios between the reference architecture
and the proposed architecture.
As it can be observed, the variation of the length of both sequences does not have any significant
impact on the resulting gains obtained for the proposed architecture. In fact, the clock cycle ratio tends
to stabilize around 8.15 for large reference sequences aligned to a fixed query sequence composed of
64 symbols, and around 3.7 for large queries aligned to a fixed reference composed of 4096 symbols.
The results shown in figure 5.3(a) (where the query reference is matched against increasing reference
sequences) also demonstrate that the percentage of lazy loops occurrences remains almost constant
throughout all the alignments, supported by the speedup stabilization.
The executions times for the remaining architectures (ARM Cortex-A9 and Intel i7) are not depicted in
these graphics to ensure a better clarity. In fact, since they present the same algorithm implementation
as the BioBlaze (Farrar’s implementation), they wield very similar instructions, resulting in similar plots
when compared to the proposed architecture.
In figure 5.4, it is presented a more convenient performance metric, by using the clock cycles per
cell update (CCPCU) (lower is better). These values were obtained by dividing the total number of
clock cycles (c) by the product between the length of the reference and query sequence (m and n
respectively): - c/(m × n). As it can be seen, the proposed architecture achieves a number of CCPCU
13.7x lower than the ARM Cortex-A9, even though the latter processor can issue two instructions per
clock cycle. When compared with the Bioblaze and the Intel i7, a CCPCU of 5.44x and 4.32x lower,
60
0,00
1,00
2,00
3,00
4,00
5,00
6,00
7,00
8,00
9,00
0,001
0,01
0,1
1
10
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
C.C
.nRat
io
Clo
cknC
ycle
sn[x
10
^6]
ReferencenLength
BioBlaze VLIW C.C.nRatio
(a) Average number of clock cycles for the SW algorithm implementation using the Bioblaze and theproposed architecture, when considering a fixed query sequence composed of 64 symbols andmultiple reference sequences.
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
0,01
0,1
1
10
100
0 500 1000 1500 2000 2500
C.C
.gRat
io
Clo
ckgC
ycle
sg[x
10
^6]
QuerygLength
BioBlaze VLIW C.C.gRatio
(b) Average number of clock cycles for the SW algorithm implementation using the Bioblaze andthe proposed architecture, when considering a fixed reference sequence composed of 4096symbols and multiple query sequences.
Figure 5.3: Comparison of the average number of clock cycles for the SW algorithm implementationusing the Bioblaze and the proposed VLIW architecture, with different query and reference widths.
respectively, is achieved. This proves that the SW algorithm has a much better raw performance in the
proposed architecture than in the other architectures, showcasing the advantages of a better data-level
parallelism along the anti-diagonal.
In addition to the CCPCU comparison, the attained raw throughput (evaluated in Cell Updates per
Second (CUPS)) was also evaluated (see figure 5.4(b)). This metric accounts for the total number of
cells (given by the length of the query sequence (m) times the length of the reference sequence (n))
that are updated in a corresponding runtime (t), in seconds (accounting for the maximum operating
frequency in each implementation platform: (m × n)/t. Therefore, the higher the CUPS, the better the
performance.
The analysis of the MCUPS metric demonstrates that despite using a considerable lower operating
frequency than the other architectures, the proposed architecture achieves a throughput superior to both
the ARM (2.54x) and the Bioblaze (5.01x). However, as it would be expected, the Intel i7 achieves a
61
4,29
1,71,35
0,31
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
ARM-Cortex-A9 BioBlaze
Intel-Core-i7-3820 Proposed-Architecture
(a) Clock Cycles per Cell Update (CCPCU)
124,24
62,94
2274,07
315,18
1
10
100
1000
10000
ARMsCortex-A9 BioBlaze
IntelsCoresi7s3820 ProposedsArchitecture
(b) Mega Cell Updates per Second (MCUPS)
Figure 5.4: Performance evaluation results for the SW algorithm implementation in all evaluation archi-tectures.
much superior throughput (7.2x over the proposed architecture) given its much higher operating fre-
quency (31.17x over the proposed architecture).
5.2.3.B Viterbi
The average number of clock cycles for the Viterbi algorithm to execute the DNA sequence alignment
in the considered architectures are presented in table 5.3. This table depicts the results obtained for an
alignment between selected reference sequences from the dataset and a fixed query sequence with
a length of 10000 symbols. It also includes the respective clock cycle ratios between the reference
architectures and the proposed architecture (relating the observed differences in terms of clock cycles).
Table 5.3: Average number of clock cycles for different DNA reference sequences matched againsta 10000 element query sequence using the Viterbi algorithm, when implemented in the consideredexecution platforms.
Clock Cycles (c.c.) [×106]
Reference Proposed ARM CortexA9 c.c. Intel Core c.c.
Size Architecture (NEON) ratio i7 3820 ratio
200 6 130 22.624 46.05 7.68
472 14 311 22.911 54.17 3.29
900 26 565 21.822 95.17 3.66
1305 38 817 21.754 141.22 3.72
1727 50 1188 23.885 190.34 3.81
2204 63 1513 23.847 239.46 3.80
2532 73 1771 24.297 251.11 3.28
2991 86 2117 24.588 288.58 3.36
Similarly to what was done with the SW algorithm, figure 5.5 depicts additional plots representing the
average number of clock cycles and the corresponding variation for the Viterbi algorithm implementation
for several query-reference sets, when considering the proposed architecture and the ARM Cortex-A9.
The Intel architecture is not presented in order to provide better clarity. Additionally, since it implements
62
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
40,00
45,00
1
10
100
1000
10000
0 500 1000 1500 2000 2500 3000
C.C
.hRat
io
Clo
ckhC
ycle
sh[x
10
^6]
ReferencehLength
ARM VLIW C.C.hRatio
(a) Average number of clock cycles for a fixed query sequence composed of 10000 symbols andmultiple reference sequences.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
1
10
100
1000
0 2000 4000 6000 8000 10000
C.C
.nRat
io
Clo
cknC
ycle
sn[x
10
^6]
QuerynLength
VLIW ARM C.C.nRatio
(b) Average number of clock cycles for a fixed reference sequence composed of 2991 symbols andmultiple query sequences.
Figure 5.5: Comparison of the average number of clock cycles between the ARM Cortex-A9 and theproposed VLIW architecture, when executing the Viterbi algorithm with different query and referencewidths.
the same algorithm as the ARM Cortex-A9, it wields very similar instructions, resulting in a very similar
plot (after accounting for the performance differences).
The plot in figure 5.5(a) refers to the average number of clock cycles of a fixed query sequence
(composed by 10000 symbols) aligned against multiple references, while the plot in figure 5.5(b) refers
to the average number of clock cycles of a fixed reference sequence (with a length of 2991 symbols)
aligned against multiple query references. As it can be observed, the increase of the reference sequence
length leads to a very slow stabilization of the clock cycle ratio of the proposed architecture over the
ARM, reaching a value of 25. When varying the query sequence length, the clock cycle ratio stabilizes
63
very fast with the length increase, at a value of 24.6. These results demonstrate that the impact caused
by the critical sections in the outer loop of the algorithm implementation in the proposed architecture
is negligible, when compared to the other architectures, due to the slow rate of the clock cycle ratio
stabilization in the plot in figure5.5(a).
Figure 5.6(a) depicts the CCPCU metric. By following a trend entirely similar to the previously pre-
sented results, the proposed architecture achieves a number of CCPCU 23.4x lower than the ARM and
3.45x lower than the Intel i7.
67,48
9,94
2,88
1
10
100
ARMuCortex-A9 InteluCoreui7u3820 ProposeduArchitecture
(a) Clock Cycles per Cell Update (CCPCU)
7,9
308,85
34,26
1
10
100
1000
ARMhCortex-A9 IntelhCorehi7h3820 ProposedhArchitecture
(b) Mega Cell Updates per Second (MCUPS)
Figure 5.6: Performance evaluation results for the Viterbi algorithm implementation in all evaluated ar-chitectures.
Additionally, figure 5.6(b) presents the attained raw throughput, measured in CUPS. The graphic
shows a speedup of the proposed architecture of 4.34x over the ARM Cortex-A9. However, given Intel’s
considerably higher operating frequency, the proposed architecture loses to it, with the Intel having a
speedup of 9.01x over the proposed architecture.
5.3 Performance and Energy Effiency
The considered architectures were also evaluated regarding their energy efficiency and performance-
energy efficiency. The adopted energy efficiency metric is the Cell Updates per Joule (CUPJ), given by
the total number of processed cells, divided by the total consumed energy. Naturally, the higher the
CUPJ, the better the energy efficiency.
The adopted performance-energy efficiency metric is given in Cell Updates per Joule-Second (CUPJS)
and it can be regarded as an inversion and normalization of the commonly used Energy-Delay Product
(EDP) metric. In fact, while the EDP is generally given by the product of the total energy consumption
and the corresponding runtime, the adopted CUPJS is obtained by inverting the EDP and multiplying it
by the total number of processed cells. Just like the previous metrics, the higher the CUPJS, the better
the performance-energy efficiency. It is important to notice that the architecture with the best results
in terms of the CUPJS metric is not necessarily the architecture with the highest performance and the
lowest power consumption. In fact, a given platform can have extremely high performances with a high
energy cost, and still achieve a better performance-energy ratio than an architecture with lower perfor-
64
mance and very low energy consumption. The final conclusion must always have the target application
domain in mind, which often presents strict power requirements, thus not always corresponding to the
architecture with solely the best performance-energy efficiency.
Table 5.4 depicts the operating frequency and power estimation for all evaluated architectures. The
power corresponding to the proposed architecture, the ARM and the Bioblaze, was estimated with the
Xilinx Power Estimation tool [36], assuming worst-case conditions for flip-flop and memory updates. The
Intel i7 power was estimated by measuring the Intel performance counters, available in the processor.
Table 5.4: Operation frequency and power estimation for all evaluation architectures.Proposed Architecture ARM BioBlaze Intel i7
Frequency 98.5 MHz 533 MHz 158 MHz 3070 MHz
Power 0.584 W 0.95 W 0.3 W 38 W
5.3.1 Smith-Waterman
The obtained CUPJ and CUPJS results for the SW implementation are presented in figures 5.7 (a)
and (b), respectively.
130,78
331,93
59,84
539,69
0
100
200
300
400
500
600
ARMsCortex-A9 BioBlaze
IntelsCoresi7s3820 ProposedsArchitecture
(a) Mega Cell Updates per Joule (MCUPJ)
127,47
175,64
368,9
412,43
0
50
100
150
200
250
300
350
400
450
ARMsCortex-A9 BioBlaze
IntelsCoresi7s3820 ProposedsArchitecture
(b) Peta Cell Updates per Joule.Second(PCUPJS)
Figure 5.7: Performance and energy evaluation results obtained for the SW algorithm implementation inall the evaluated architectures.
The proposed architecture achieves an energy efficiency 4.13x, 1.63x and 9.02x greater than the
ARM Cortex-A9, the BioBlaze and the Intel i7, respectively. This was expected, given the lower power
consumption of the proposed architecture against the ARM and the Intel. In fact, although the Bioblaze
has a lower power consumption, the higher throughput of the proposed architecture results in a better
energy efficiency.
Regarding the performance-energy metric, the proposed architecture achieved results 3.24x and
2.35x greater than the ARM Cortex-A9 and the Bioblaze, respectively. These results were also expected,
given the higher raw throughput and better energy efficiency of the proposed architecture. Regarding
the Intel i7, and despite its very high computational performance, the proposed architecture, with its low-
65
power consumption, manages to achieve a performance-energy efficiency 1.12x higher, compensating
the lower performance with higher energy savings.
These results make the proposed architecture a well suited candidate for mobile low power envi-
ronments. Furthermore, they demonstrate that the performance of a low power architecture can be
compared to the performance of a high end GPP, when also account for energy efficiency.
5.3.2 Viterbi
The obtained CUPJ and CUPJS results for the Viterbi implementation are represented in figures
5.8(a) and (b), respectively.
8,31 8,13
58,66
1
10
100
ARMuCortex-A9 InteluCoreui7u3820 ProposeduArchitecture
(a) Mega Cell Updates per Joule (MCUPJ)
8,1
50,1 44,83
1
10
100
ARMuCortex-A9 InteluCoreui7u3820 ProposeduArchitecture
(b) Peta Cell Updates per Joule.Second(PCUPJS)
Figure 5.8: Performance and energy evaluation results obtained for the Viterbi algorithm implementationin all the evaluated architectures.
As it can be seen, the proposed architecture presents an energy efficiency that is 7.06x and 7.22x
greater than the ARM Cortex-A9 and the Intel i7, respectively. These results were both expected given
the lower power consumption and high raw performance observed in the proposed architecture when
compared to the other evaluated architectures.
For the performance-energy metric, it is possible to observe a gain of about 5.53x over the ARM.
However, when compared to the Intel, the proposed architectures has a worse performance-energy, with
the Intel having a gain of 1.12x. This is mainly due to the very higher operating frequency of the Intel
i7 processor (and thus to its very high throughput), compensating the much higher energy consumption
when compared with the proposed architecture, in the long haul.
However, as it was seen for the SW algorithm, the high energy efficiency of the proposed architecture
makes it the perfect candidate for mobile low power environments.
5.4 Summary
Two widely used DP algorithms (SW and Viterbi algorithms) were evaluated in the proposed archi-
tecture and in 3 alternative state-of-the-art architectures: the ARM Cortex-A9, representing a low-power
GPP; the Intel Core i7 3820, representing a high-performance GPP; and a low-power dedicated ASIP,
specifically tailored for the SW algorithm.
66
For both evaluated algorithms implementations, the proposed architecture manages to achieve a
better raw performance than all the reference architectures, with the exception of the Intel Core i7, given
its much higher operating frequency (31.17x the operating frequency of the proposed architecture). It
also achieved a better energy efficiency than all other architectures, validating the proposed architecture
for low-power embedded environments. A performance-energy metric was also evaluated, where the
proposed architecture managed to surpass all evaluated architectures with the exception of the Intel
Core i7 for the Viterbi implementation (although the intel i7 only achieved a gain of 1.12x). This exception
can be explained by the fact that no optimized instructions (for the Viterbi algorithm) were added to the
ISA in the proposed architecture, as well as the difference in operating frequencies.
These results also demonstrate that the proposed programmable architecture, specially tailored for
DP algorithms, can compete with higher-end GPPs (performance wise), in low-power environments,
such as in embedded systems (e.g. biomarker detection SOCs).
67
68
6Conclusions and Future Work
Contents6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
69
6.1 Conclusion
The proposed processor is based on a VLIW architecture, composed by several independent vector
execution units and a DSU to parallelize memory accesses. The architecture exploits DLP by computing
vector instructions in each execution unit, and exploits ILP by issuing a bundle of instructions to the
parallel execution units.
The custom ISA was specially adapted for DP algorithms, allowing a high level parallelism, with
reduced and more efficient hardware requirements. Each execution unit has its own register bank,
assuring a convenient data organization and avoiding structural hazards. Additionally, the architecture
presents two distinct memories (a RAM and a local fast memory) and shared FUs, thus reducing the
impact caused by memory accesses and reducing the hardware requirements, respectively. It also
presents special mechanisms to access data on neighboring cells, such as sniffing and register window
mechanisms, together with shared memories, that further help reducing the number of clock cycles in
the algorithm implementations.
Two benchmark algorithms (the SW and the Viterbi algorithms) were implemented in the proposed
architecture and the reference state-of-the-art architectures, which consist in: i) a mobile low power
ARM Cortex-A9 GPP; ii) a high performance Intel Core i7 3820 GPP; iii) and a dedicated ASIP, the
Bioblaze. These algorithm implementations made use of the corresponding vector extensions in all
architectures, with 128-bit data vectors (8-bit words for the SW implementation and 16-bit words for the
Viterbi implementation).
From the obtained performance results, the proposed architecture achieved a better throughput than
most architectures (maximum speedup of 5.01x over the Bioblaze, for the SW algorithm, and 4.34x
over the ARM Cortex-A9, for the Viterbi algorithm), only losing to the Intel i7 (in both algorithm imple-
mentations) due to its very high operating frequency (31.17x). However, the energy efficiency results
demonstrate that the proposed architecture is well suited for low power platforms. In fact, it achieved
an energy efficiency superior to all the reference architectures, reaching gains as high as 9.02x over
the Intel i7, for the SW algorithm implementation. When accounting for a performance-energy efficiency
metric, the proposed architecture still achieved better results than most reference architectures, only los-
ing to the Viterbi implementation on the Intel i7 (which has 1.12x better performance-energy efficiency),
due to its very high throughput, compensating for the high power consumption.
Therefore, the presented results confirm that the devised architecture is a viable solution not only
for DP applications, but also for restricted low power environments, such as embedded systems. In
addition, the high performance-energy efficiency results demonstrate that it can even surpass state-of-
the-art GPPs, filling a gap in programmable low power and high performance architectures.
6.2 Future Work
The proposed VLIW architecture was implemented in a Zynq FPGA. In chapter 3, an interfacing
structure for the architecture was proposed, but it was not fully implemented. By completing the design
and implementation of the interface, it would ensure that real world applications could interact with the
70
architecture, allowing the study of different integration technologies such as ASICs, FPGAs or low power
systems like biochips (for bioinformatic DP algorithms).
The scalability of the proposed architecture should also be further analyzed, in order to find the best
ratio between the vector width/number of execution units and the maximum performance and energy
efficiency.
Additionally, a broader set of algorithms such as matrix chain multiplication, Dijkstra’s shortest path
or even non-DP algorithms should be considered and implemented in the architecture, due to the fact
that its ISA can be easily modified to extend the algorithm support. This study would consolidate the
architecture as a programmable and high energy efficient architecture.
71
72
Bibliography
[1] K. Shibu, Introduction to Embedded Systems, 1st Edition. McGraw-Hill Education, June 2009.
[2] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, “Genbank,”
Nucleic acids research, vol. 28, no. 1, pp. 15–18, 2000.
[3] D. A. Benson, M. Cavanaugh, I. Karsch-Mizrachi, D. J. Lipman, and J. Ostell, “Genbank,”
Nucleic acids research, vol. 41, no. D1, pp. D36–D42, January 2013.
[4] T. F. Smith and M. S. Waterman, “Identification of Common Molecular Subsequences,” Journal of
molecular biology, vol. 147, no. 1, pp. 195–197, 1981.
[5] S. B. Needleman and C. D. Wunsch, “A general Method Applicable to the Search For Similarities
in the Amino Acid Sequence of Two Proteins,” Journal of molecular biology, vol. 48, no. 3, pp.
443–453, 1970.
[6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment Search
Tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990.
[7] W. R. Pearson and D. J. Lipman, “Improved Tools For Biological Sequence Comparison,”
Proceedings of the National Academy of Sciences, vol. 85, no. 8, pp. 2444–2448, 1988.
[8] A. Viterbi, “Error Bounds For Convolutional Codes And an Asymptotically Optimum Decoding Algo-
rithm,” IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260–269, 1967.
[9] S. R. Eddy, “Accelerated Profile HMM Searches,” PLoS computational biology, vol. 7, no. 10, p.
e1002195, 2011.
[10] O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,” Journal of molecular
biology, vol. 162, no. 3, pp. 705–708, 1982.
[11] A. Wozniak, “Using Video-Oriented Instructions to Speed Up Sequence Comparison,” Computer
applications in the biosciences: CABIOS, vol. 13, no. 2, pp. 145–150, 1997.
[12] M. Farrar, “Striped Smith–Waterman Speeds Database Searches Six Times Over Other SIMD Im-
plementations,” Bioinformatics, vol. 23, no. 2, pp. 156–161, 2007.
[13] T. Rognes and E. Seeberg, “Six-fold Speed-up of Smith–Waterman Sequence Database Searches
Using Parallel Processing on Common Microprocessors,” Bioinformatics, vol. 16, no. 8, pp. 699–
706, 2000.
73
[14] T. Rognes, “Faster Smith-Waterman Database Searches With Inter-Sequence SIMD Parallelisa-
tion,” BMC bioinformatics, vol. 12, no. 1, p. 221, 2011.
[15] C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen, Introduction to Algorithms. The MIT
press, 2001.
[16] S. R. Eddy, “Profile Hidden Markov Models,” Bioinformatics, vol. 14, no. 9, pp. 755–763, 1998.
[17] N. Casagrande, “B.A.B.A. - Basic-Algorithms-of-Bioinformatics Applet,” Last Accessed on 12
August, 2014. [Online]. Available: http://baba.sourceforge.net/
[18] E. Fosler-Lussier, “Markov Models and Hidden Markov Models: A Brief Tutorial,” International
Computer Science Institute Technical Report TR-98-041, 1998.
[19] R. Durbin, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-
bridge university press, 1998.
[20] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov Models in Compu-
tational Biology: Applications to Protein Modeling,” Journal of molecular biology, vol. 235, no. 5, pp.
1501–1531, 1994.
[21] Intel. (2013, Sep.) Intel R© 64 and IA-32 Architectures Software Developer’s Manual. Last Accessed
on September 10, 2014. [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/
documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
[22] P. Green. (1996) Swat Documentation. Last Accessed on September 10, 2014. [Online]. Available:
http://www.phrap.org/phredphrap/general.html
[23] N. Neves, N. Sebastiao, A. Patricio, D. Matos, P. Tomas, P. Flores, and N. Roma, “BioBlaze: Multi-
core SIMD ASIP for DNA Sequence Alignment,” in Application-Specific Systems, Architectures and
Processors (ASAP), 2013 IEEE 24th International Conference on. IEEE, 2013, pp. 241–244.
[24] W. Martins, J. del Cuvillo, F. Useche, K. B. Theobald, and G. R. Gao, “A Multithreaded Paral-
lel Implementation of a Dynamic Programming Algorithm For Sequence Comparison,” in Pacific
Symposium on Biocomputing, vol. 6, 2001, pp. 311–322.
[25] E. W. Edmiston, N. G. Core, J. H. Saltz, and R. M. Smith, “Parallel Processing of Biological Se-
quence Comparison Algorithms,” International Journal of Parallel Programming, vol. 17, no. 3, pp.
259–275, 1988.
[26] Intel. (2000) Using the Streaming SIMD Extensions 2 (SSE2) to Evaluate a Hidden Markov
Model with Viterbi Decoding. Last Accessed on September 10, 2014. [Online]. Available:
http://software.intel.com/sites/default/files/m/d/4/1/d/8/17679 ap-946 w hmm viterbi.pdf
[27] L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, “Design of a Super-pipelined Viterbi Decoder,” in Circuits
and Systems, 1999. ISCAS’99. Proceedings of the 1999 IEEE International Symposium on, vol. 1.
IEEE, 1999, pp. 133–136.
74
[28] N. Sebastiao, N. Roma, and P. Flores, “Scalable Accelerator Architecture for Local Alignment of
DNA Sequences,” 2010, unpublished.
[29] S. Derrien and P. Quinton, “Hardware Acceleration of HMMER on FPGAs,” Journal of Signal
Processing Systems, vol. 58, no. 1, pp. 53–67, 2010.
[30] N. Sebastiao, N. Roma, and P. Flores, “Integrated Hardware Architecture for Efficient Computation
of the n-Best Bio-Sequence Local Alignments in Embedded Platforms,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 20, no. 7, pp. 1262–1275, 2012.
[31] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach, “Accelerating Compute-intensive Applications
With GPUs and FPGAs,” in Application Specific Processors, 2008. SASP 2008. Symposium on.
IEEE, 2008, pp. 101–107.
[32] K. Benkrid, Y. Liu, and A. Benkrid, “A Highly Parameterized and Efficient FPGA-based Skeleton
for Pairwise Biological Sequence Alignment,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 17, no. 4, pp. 561–570, 2009.
[33] A. C. Jacob, J. M. Lancaster, J. D. Buhler, and R. D. Chamberlain, “Preliminary Results in Accel-
erating Profile HMM Search on FPGAs,” in Parallel and Distributed Processing Symposium, 2007.
IPDPS 2007. IEEE International. IEEE, 2007, pp. 1–8.
[34] M. Ferreira, N. Roma, and L. M. Russo, “Cache-Oblivious Parallel SIMD Viterbi Decoding For Se-
quence Search in HMMER,” BMC Bioinformatics, vol. 15, no. 1, p. 165, 2014.
[35] Xilinx. (2013) Xilinx DS190 Zynq-7000 All Programmable SoC Overview. Last Accessed
on September 10, 2014. [Online]. Available: http://www.xilinx.com/support/documentation/
data sheets/ds190-Zynq-7000-Overview.pdf
[36] xilinx. (2014) Power Estimator User Guide. Last Accessed on September 10,
2014. [Online]. Available: http://www.xilinx.com/support/documentation/sw manuals/xilinx2014 2/
ug440-xilinx-power-estimator.pdf
[37] ARM. (2014) ARM R© NEONTM Intrinsics Reference. Last Accessed on September 10,
2014. [Online]. Available: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A
arm neon intrinsics ref.pdf
[38] Intel. (2014) Intel R© 64 and IA-32 Architectures Software Developer’s Manual. Last Accessed
on September 10, 2014. [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/
documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
[39] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R. Eddy, A. Heger, K. Hether-
ington, L. Holm, J. Mistry et al., “Pfam: The Protein Families Database,” Nucleic acids research, p.
gkt1223, 2014.
[40] Xilinx. (2012) ISim User Guide. Last Accessed on September 10, 2014. [Online]. Available:
http://www.xilinx.com/support/documentation/sw manuals/xilinx14 1/plugin ism.pdf
75
[41] M. Graphics. (2012) ModelSim R© User’s Manual. Last Accessed on September 10, 2014. [Online].
Available: http://www.microsemi.com/document-portal/doc view/131619-modelsim-user
76
AProposed Architecture Instruction Set
77
Figure A.1: Full implemented instruction set.Opcode OpControl
&]RER, :RER[
ArithmeticRandRLogicRInstructions
NoRoperation [[[[[[[ [[[[[[
Sum [[[[[&[ [[[[[[
SumRwithRimmediate [[[[[&[ [&[[[[
Subtraction [[[[&[& [[[[[[
SubtractionRwithRimmediate [[[[&[& [&[[[[
Maximum [[[[&[[ [[[[[[
MaximumRandRMove [[[[[&& &[[[[[
Comparison [[&[[[[ [[[[[[
ArithmeticRShiftRRight [[[&[[[ [[[[[[
ArithmeticRShiftRLeft [[[&[[& [[[[[[
LogicRShiftRRight [[[&[&[ [[[[[[
LogicRShiftRLeft [[[&[&& [[[[[[
ArithmeticRShiftRRightRIimmediatey [[[&[[[ [&[[[[
ArithmeticRShiftRLeftRIimmediatey [[[&[[& [&[[[[
LogicRShiftRRightRIimmediatey [[[&[&[ [&[[[[
LogicRShiftRLeftRIimmediatey [[[&[&& [&[[[[
LogicROR [[[&&[[ [[[[[[
LogicRAND [[[&&[& [[[[[[
LogicRXOR [[[&&&[ [[[[[[
LogicRORRwithRimmediate [[[&&[[ [&[[[[
LogicRANDRwithRimmediate [[[&&[& [&[[[[
LogicRXORRwithRimmediate [[[&&&[ [&[[[[
MemoryRInstructions
IndexRMemoryRaddress &&[[&[& [[[[[[
LoadRByte &&[[[[& [[[[[[
LoadRhalfEword &&[[[&[ [[[[[[
LoadRData &&[[&[[ [[[[[[
IndexRlocalRmemoryRaddress &&[&[[& [[[[[[
LocalRmemoryRLoad &&[&[[[ [[[[[[
StoreRByte &&&[[[& [[[[[[
StoreRHalfEword &&&[[&[ [[[[[[
StoreRData &&&[&[[ [[[[[[
ControlRInstructions
DelayedRBranch &[&[[[[ [[[[[[
ImmediateRDelayedRBranch &[&[[[[ [&[[[[
DelayedRBranchREqual &[&[[[& [[[[[[
ImmediateRDelayedRBranchREqual &[&[[[& [&[[[[
DelayedRBranchRNotREqual &[&[[&[ [[[[[[
ImmediateRDelayedRBranchRNotREqual &[&[[&[ [&[[[[
DelayedRBranchRLessRthan &[&[[&& [[[[[[
ImmediateRDelayedRBranchRLessRthan &[&[[&& [&[[[[
DelayedRBranchRGreaterRthan &[&[&[[ [[[[[[
ImmediateRDelayedRBranchRGreaterRthan &[&[&[[ [&[[[[
BGTD
BGTID
BNEID
BLTD
BLTID
BEQD
BEQID
BNED
BRD
BRID
SB
SH
SD M[RaDRb]R4ERRd[<&"[]
INDEXRMADDR IMRR4ERIndexIRaDRby
INDEXRSPADDR IMSR4ERindexIRaHRRbySPADRLD RdR4ERSpad[IMS]
LB
LH
LD Rd[<&"[]R4ERM[IMR]
ANDI
XORI
AND
XOR
ORI
SRLI
SLLI
OR
SLL
SRAI
SLAI
SRA
SLA
SRL
CMP
MAX
MAXMOV
SUB
SUBI
SUM
SUMI
ImmediateR=RRbRbits Instruction SemanticsR/RNotes
NOP NoRoperation
�� �� = � → �� ← �� + Rb�� �� = � → ��1 ← ��1 + ��1 ��2 ← ��2 + ��2 ��3 ← ��3 + ��3
�� ← �� �� �� > �� ���� ��� ���� ← � ���
�� �� = � �� ← �� �� �� > �� ���� ���� �� = � ��1,��2,��3 ← �� �� �� > �� ���� ��
�� ← �� �� ���� ← �� ��� ���� ← �� ��� ��
�� ← �� ≫ �(��)�� ← �� ≪ �(��)
�� ← �� ≫ �(���)�� ← �� ≪ �(���)
�� ← �� ≪ �(��)�� ← �� ≪ �(��)
�� ← �� ≪ �(���)�� ← �� ≫ �(���)
Rd[7"[]R 4E M[IMR]R if TaDTbRzRL[[LRd[&:"T]R 4E M[IMR]R if TaDTbRzRL[&LRd[]<"&,]R4E M[IMR]R if TaDTbRzRL&[LRd[<&"]3]R4E M[IMR]R if TaDTbRzRL&&L
Rd[&:"[]R 4E M[IMR]R if TaDTbRzRL[[LRd[<&"&,]R4E M[IMR]R if TaDTbRzRL&&L
�� �� = � → �� ← �� + Imm�� �� = � → ��1 ← ��1 + ��� ��2 ← ��2 + ��� ��3 ← ��3 + ���
�� �� = � → �� ← �� + �� + 1�� �� = � → ��1 ← ��1 + ��1 + 1 ��2 ← ��2 + ��2 + 1 ��3 ← ��3 + ��3 + 1
�� �� = � → �� ← ��� + �� + 1�� �� = � → ��1 ← ��� + ��1 + 1 ��2 ← ��� + ��2 + 1 ��3 ← ��� + ��3 + 1
�� ← �� �� ����� ← �� ��� ����� ← �� ��� ���
If��������� � = � E TheRresultRisRconcatenatedRtoRtheRsniffRregister
�� ← �� + ���� ← �� + ����� ← �� + �� �� �� = 0�� ← �� + ��� �� �� = 0�� ← �� + �� �� �� ≠ 0�� ← �� + ��� �� �� ≠ 0
�� ← �� + ��� �� �� ≤ 0
�� ← �� + ��� �� �� ≥ 0
�� ← �� + �� �� �� ≤ 0
�� ← �� + �� �� �� ≥ 0
TheRremainderRisRpaddedRwithRzeroes
�� ← �� + �� + 1�� ��� ← 0 �� �� ≥ �� ���� 1
M[RaDRb]R4E Rd[7"[]R if TaDTbRzRL[[LM[RaDRb]R4E Rd[&:"T]R if TaDTbRzRL[&LM[RaDRb]R4E Rd[]<"&,]R if TaDTbRzRL&[L
M[RaDRb]R4E Rd[&:"[]R if TaDTbRzRL[[LM[RaDRb]R4E Rd[<&"&,]R if TaDTbRzRL&&L
TheRremainderRisRpaddedRwithRzeroes
TheRremainderRisRpaddedRwithRzeroes
TheRremainderRisRpaddedRwithRzeroes
�� ← (����� & ��) �� �� > �� ���� (����� & ��)
78
BViterbi Pseudo-code
79
Figure B.1: Complete pseudo-code for the Viterbi implementation in the proposed architecture.Data Stream Unit Unit 1 Unit 2 Unit 3 Unit 4
1 INDEX *tsc++ | STORE xC
2 LOAD *tsc++e| INDEX *tsc++
3 LOAD *tsc++e| INDEX *tsc++ OR dcv OR dcv SUM dcv SUM dcv
4 LOAD *tsc++e| INDEX *tsc++ OR xEv OR xEv SUM xEv SUM xEv
5 LOAD *tsc++e| INDEX *tsc++ OR mpv OR mpv SUM mpv SUM mpv
6 LOAD *tsc++e| INDEX *tsc++ OR ipv OR ipv SUM ipv SUM ipv
7 LOAD *tsc++e| INDEX *tsc++ OR dpv OR dpv SUM dpv SUM dpv
8 LOAD *tsc++e| INDEX *tsc++ SUM xNv SUM xNv
9 LOAD *tsc++ SUM xNv SUM xNv
10 INDEX xJ SUM xBv SUM xBv
11 LOAD xJeIonceeforeunit0)e| INDEX xB SUM xBv SUM xBv
12 LOAD xBeIonceeforeunite0) SUM mpv SUM mpv
13 MAX sv MAX sv SUM mpv SUM mpv
14 INDEX mpv SUM ipv SUM ipv MAX sv MAX sv
15 LOAD mpve| INDEX dpv MAX sv MAX sv SUM ipv SUM ipv
16 LOAD dpve | INDEX ipv SUM dpv SUM dpv MAX sv MAX sv
17 LOAD ipv MAX sv MAX sv SUM dpv SUM dpv
18 SUM sv SUM sv MAX sv MAX sv
19 STORE DMXo MAX xEv MAX xEv SUM sv SUM sv
20 STORE MMXo SUM dcv SUM dcv MAX xEv MAX xEv
21 SUM dcv SUM dcv
22 SUM mpv SUM mpv
23 SUM mpv SUM mpv
24 INDEX *tsc++ SUM ipv SUM ipv
25 LOAD *tsc++e| INDEX *tsc++ MAX sv MAX sv SUM ipv SUM ipv
26 LOAD *tsc++e| INDEX *tsc++ SUM xJ SUM xJ MAX sv MAX sv
27 LOAD *tsc++e| INDEX *tsc++ | STORE IMXo SUM xJ SUM xJ
28 LOAD *tsc++e| INDEX *tsc++ MAX xJ MAX xJ SUM xJ SUM xJ
29 LOAD *tsc++e| INDEX *tsc++ SUM xJ SUM xJ
30 LOAD *tsc++e| INDEX *tsc++ SUM xB SUM xB MAX xJ MAX xJ
31 LOAD *tsc++e| INDEX *tsc++ MAX xB MAX xB SUM xB SUM xB
32 LOAD *tsc++ MAX xB MAX xB
33 INDEX xC SUM xN SUM xN
34 LOAD xC SUM xN SUM xN
35 STORE xJeIfromeu3) SUM xEv SUM xEv
36 STORE xBeIfromeu3) SUM xEv SUM xEv
37 SUM xC SUM xC
38 MAX xC MAX xC SUM xC SUM xC
39 MAX xC MAX xC
40 MAX xC MAX xC
41 MAX xC
Ou
ter Lo
op
Ou
ter Lo
op
Inn
er Lo
op
[DSU]FIndexationFandFscoresFloadingF
[ExecutionFUnits]FSettingFtheFvectorFregistersFtoF-infinityF(ORFandFSUMFinstructionsFareFusedFdueFtoFtheFFUsFavailability)
OneFtimeFloadsFforFtheFspecialFstatesFinFunitF0
DelayedFload/storeFscheme
IndexationFandFscoresFloading
SpecialFstateFcomputationFandFdependencyFstores
80