PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATIONSEQUENCING DATA
By
Yuan Zhang
A DISSERTATION
Submitted toMichigan State University
in partial ful�llment of the requirementsfor the degree of
Computer Science � Doctor of Philosophy
2013
ABSTRACT
PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OFNEXT-GENERATION SEQUENCING DATA
By
Yuan Zhang
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a
wide range of methodologies in order to understand their functions, structures or evolu-
tion history. Next generation sequencing (NGS) technologies generate large-scale sequence
data of high coverage and nucleotide level resolution at low costs, bene�ting a variety of
research areas such as gene expression pro�ling, metagenomic annotation, ncRNA identi�ca-
tion, etc. Therefore, functional analysis of NGS sequences becomes increasingly important
because it provides insightful information, such as gene expression, protein composition, and
phylogenetic complexity, of the species from which the sequences are generated. One basic
step during the functional analysis is to classify genomic sequences into di�erent functional
categories, such as protein families or protein domains (or domains for short), which are
independent functional units in a majority of annotated protein sequences.
The state-of-the-art method for protein domain analysis is based on comparative sequence
analysis, which classi�es query sequences into annotated protein or domain databases. There
are two types of domain analysis methods, pairwise alignment and pro�le-based similarity
search. The �rst one uses pairwise alignment tools such as BLAST to search query genomic
sequences against reference protein sequences in databases such as NCBI-nr. The second one
uses pro�le HMM-based tools such as HMMER to classify query sequences into annotated
domain families such as Pfam. Compared to the �rst method, the pro�le HMM-based method
has smaller search space and higher sensitivity with remote homolog detection. Therefore, I
focus on pro�le HMM-based protein domain analysis.
There are several challenges with protein domain analysis of NGS sequences. First, se-
quences generated by some NGS platforms such as pyrosequencing have relatively high error
rates, making it di�cult to classify the sequences into their native domain families. Second,
existing protein domain analysis tools have low sensitivity with short query sequences and
poorly conserved domain families. Third, the volume of NGS data is usually very large,
making it di�cult to assemble short reads into longer contigs. In this work, I focus on ad-
dressing these three challenges using di�erent methods. To be speci�c, we have proposed four
tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on
detecting and correcting frameshift errors in sequences generated by pyrosequencing technol-
ogy, thus accurately classifying metagenomic sequences containing frameshift errors into their
native protein domain families. MetaDomain and SALT are both designed for short reads
generated by NGS technologies. MetaDomain uses relaxed position-speci�c score thresholds
and alignment positions to increase the sensitivity while keeping the false positive rate at
a low level. SALT combines both position-speci�c score thresholds and graph algorithms
and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene
assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage,
and lower chimera rate compared with existing tools. Finally, I will make a conclusion on
my work and brie�y talk about some future work.
ACKNOWLEDGMENTS
First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to ad-
mit me as her PhD student four years ago provided me the precious opportunity to study in
Michigan State University and led me to the world of bioinformatics. During these four years
under her guidance I have made continuous progress in several aspects, including reading
research papers, proposing research topics, developing methods, designing experiments, to
writing papers. More importantly, I have gradually improved my ability to both indepen-
dently and collaboratively conduct in-depth analysis into sophisticated research problems
and use scienti�c methodologies to solve the challenging problems. She also gave a lot of
suggestions on how to e�ectively demonstrate our work to the audience, especially to people
from other research areas. This ability is very important in that it will profoundly determine
my capability to collaborate in a team of people from di�erent background.
I also want to thank other committee members Dr. C. Titus Brown, Dr. Pang-Ning
Tan, and Dr. James R. Cole. They gave a lot of useful suggestions during the course of
my PhD program. I also thank my lab mates Rujira Achawanantakun, Jikai Lei, Cheng
Yuan, Prapaporn Techa-angkoon, and Jiao He. During these years, we have productive
discussion and cooperation on various research topics and I have obtained great help from
them. I would like to acknowledge my colleagues from BeachMint Inc. during my summer
internship, especially Douglas Cohen, Je� Cooper, and Manunya Rozelle. With their help,
I learned how to apply theories and methods to solve challenging problems in industry. I
gratefully acknowledge other faculties and sta�s of CSE department, especially Dr. Rong
Jin, Dr. Jin Chen, Linda Moore, and Norma Teague. I also owe a lot of thanks to my friends
in MSU and in China. All of them give me a lot of support and help during these years.
iv
My �nal and most important acknowledgement must go to my family. They always give
me persistent and determined love and support.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Next-generation sequencing technologies . . . . . . . . . . . . . . . . . . . . 11.2 Protein domain analysis of NGS sequences . . . . . . . . . . . . . . . . . . . 11.3 Challenges with protein domain analysis of NGS sequences . . . . . . . . . . 3
Chapter 2 Protein domain classi�cation for metagenomic sequences con-taining frameshift errors . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Error models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 The augmented Viterbi algorithm for sequencing error correction . . . 112.3.3 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Accuracy of HMM-FRAME . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Using HMM-FRAME in �Targeted Metagenomic� . . . . . . . . . . . 17
2.4.2.1 Protein domain analysis of nifH sequences . . . . . . . . . . 182.4.2.2 Protein domain analysis of the bacterial aromatic dioxyge-
nase genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Protein domain classi�cation in the deep mine data set . . . . . . . . 22
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3 Pro�le HMM-based protein domain classi�cation for short se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Pipeline of MetaDomain . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Alignment Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3.1 Position speci�c threshold . . . . . . . . . . . . . . . . . . . 343.3.3.2 Alignment trimming . . . . . . . . . . . . . . . . . . . . . . 35
3.3.4 Protein domain classi�cation . . . . . . . . . . . . . . . . . . . . . . . 36
vi
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Identifying transcribed protein domains in transcriptome . . . . . . . 37
3.4.1.1 Performance of read classi�cation . . . . . . . . . . . . . . . 373.4.1.2 Identifying transcribed domains in the transcriptome . . . . 38
3.4.2 Protein domain analysis in a soil metagenomic data set . . . . . . . . 423.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4 A Sensitive and accurate protein domain classi�cation tool (SALT)for short reads based on pro�le HMMs and graph algorithms . 47
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Overview of SALT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Stage 1: pro�le HMM-based �ltration . . . . . . . . . . . . . . . . . . 54
4.3.2.1 Position-speci�c score threshold . . . . . . . . . . . . . . . . 544.3.3 Stage 2: contig generation . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3.1 Constructing a hit graph for a family . . . . . . . . . . . . . 574.3.3.2 Find the K longest paths . . . . . . . . . . . . . . . . . . . 60
4.3.4 Stage 3: E-value computation and contig selection . . . . . . . . . . . 614.3.5 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Protein domain classi�cation of very short reads . . . . . . . . . . . . 64
4.4.1.1 Determining the true membership of reads . . . . . . . . . . 654.4.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . 65
4.4.2 Protein domain classi�cation of an RNA-Seq data of Arabidopsis . . . 694.4.3 Protein domain classi�cation of a non-model organism . . . . . . . . 72
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SAT-Assembler) for NGS data . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Overview of SAT-assembler . . . . . . . . . . . . . . . . . . . . . . . 825.3.2 Pro�le HMM-based homology search . . . . . . . . . . . . . . . . . . 835.3.3 Alignment informed graph construction . . . . . . . . . . . . . . . . . 845.3.4 Pruning and optimization of overlap graphs . . . . . . . . . . . . . . 875.3.5 Guided graph traversal using multiple information . . . . . . . . . . . 885.3.6 Contig sca�olding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3.7 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1 Gene assembly in an RNA-Seq data set of Arabidopsis . . . . . . . . 93
5.4.1.1 Edge creation performance . . . . . . . . . . . . . . . . . . . 935.4.1.2 Performance comparison with other assembly tools . . . . . 95
vii
5.4.2 Targeted gene assembly in a human gut metagenomic data set . . . . 985.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter 6 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . 101
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
viii
LIST OF TABLES
Table 2.1 Comparing the error detection performance of HMM-FRAME, Ge-neWise, and FragGeneScan. . . . . . . . . . . . . . . . . . . . . . . 17
Table 3.1 Number of transcribed and non-transcribed domains using di�erentcuto�s (N) for the number of mapped reads. . . . . . . . . . . . . . 40
Table 4.1 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Burkholderia cenocepacia. . . . . . . . . . . . . 68
Table 4.2 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . 71
Table 4.3 Classi�cation results generated by di�erent classi�ers on the Radixbalthica data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Table 4.4 Description of transcribed families uniquely identi�ed by SALT. . . 74
Table 5.1 Edge creation performance of three strategies on the RNA-Seq dataset of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Table 5.2 Performance comparison between di�erent assembly tools on the RNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . 98
Table 5.3 Performance comparison between di�erent assembly tools in assem-bling genes from butyrate kinase family on the human gut metage-nomic data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
ix
LIST OF FIGURES
Figure 2.1 Frameshifts cause short alignments with marginal scores . . . . . . . 6
Figure 2.2 Change of HMMER alignments' scores, lengths, and E-values (in logspace) before and after error correction for nifH sequences. (For in-terpretation of the references to color in this and all other �gures, thereader is referred to the electronic version of this dissertation) . . . . 20
Figure 2.3 Change of HMMER alignments' lengths, scores, and E-values (in logspace) before and after error correction for the bacterial aromaticdioxygenase genes in a soil sample. . . . . . . . . . . . . . . . . . . . 22
Figure 2.4 Protein domain classi�cation results for the black sample in the deepmine data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.1 Change of the read classi�cation sensitivity of HMMER over readlength and the average sequence identity of domain families. . . . . 27
Figure 3.2 Histogram of the average pairwise sequence identity for 2558 domains 29
Figure 3.3 Three types of alignment distributions. . . . . . . . . . . . . . . . . 32
Figure 3.4 Pipeline of MetaDomain. . . . . . . . . . . . . . . . . . . . . . . . . 33
Figure 3.5 Read classi�cation sensitivity and FP rate of HMMER and MetaDo-main. The size of each bubble represents the number of data points(i.e., domains) with the same sensitivity and FP rate. . . . . . . . . 39
Figure 3.6 ROC curves of HMMER and MetaDomain. . . . . . . . . . . . . . . 41
Figure 3.7 Read length distribution in the soil data set. . . . . . . . . . . . . . 43
Figure 3.8 Reads aligned by HMMER and MetaDomain. . . . . . . . . . . . . 44
Figure 3.9 The distributions of aligned reads for PF09703 by HMMER and Meta-Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
x
Figure 4.1 Two genes, their domain organizations, and the sequenced reads. Do-main X occurs in two di�erent genes. Both genes are transcribed andsequenced. Red lines: positive reads. Blue lines: negative reads. . . 52
Figure 4.2 The pipeline of SALT. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 4.3 (A) Thirteen reads and their alignment layout w.r.t. the pro�le HMMrepresented by its matching states. The alignment scores are shownin the table. Blue reads: negative reads. Red reads: positive reads.(B) The constructed hit graph when k∗ = 4. For simplicity of ex-planation, mismatches are not allowed in this simple example (i.e.e = 0). Red nodes are created by positive reads. Blue nodes are cre-ated by negative reads. (C) The hit graph after removing transitiveoverlaps and adding the root node. . . . . . . . . . . . . . . . . . . . 56
Figure 4.4 ROC curves of di�erent classi�ers. HHblits and SSAKE+HMMERare listed in separate embedded windows because their FP rates areorders of magnitude larger than others. . . . . . . . . . . . . . . . . 67
Figure 4.5 A Venn diagram of the transcribed families identi�ed by di�erentclassi�ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.1 The pipeline of SAT-assembler. Reads of the same color belong tothe same gene family. Reads from di�erent genes of the same familyare distinguished using di�erent patterns. Reads shared by multiplegenes from the same family have multiple patterns. . . . . . . . . . . 83
Figure 5.2 (A)Two reads a and b sequenced from di�erent genes of the samefamily are aligned to the pro�le HMM of the family. Their sequenceoverlap is indicated in red. (B) Read a and read b have an align-ment overlap of 66 and a sequence overlap of 25 (in bold). (C)Thealignment between the translated peptides of a and b is 22 residues. 86
Figure 5.3 A graph containing reads from two di�erent genes A and B. Nodes inred (v1, v4, and v7) and in blue (v2, v5, and v8) are from gene A andgene B respectively. Nodes in black (v3 and v6) are chimeric nodesbecause they are shared by the two genes. Arrows with solid linesare real edges. Arrows with dotted lines and dashed lines indicatepaired-end reads and transitive edges between two nodes respectively. 88
Figure 5.4 Three contigs generated from a metagenomic data set. The greenparts of the contigs are contained in the target gene and thus aregene segments. The blue parts of the contigs are not gene segmetns. 92
xi
Figure 5.5 Chimera rate versus gene coverage when k-mer size or overlap thresh-old changes for di�erent assembly tools. These values are averagevalues of the assemblers' performance on 3,188 input families. . . . . 96
xii
Chapter 1
Introduction
1.1 Next-generation sequencing technologies
In bioinformatics, sequencing means to determine the primary structure of a biological se-
quence. Prior to new sequencing technologies, Sanger sequencing is the main method for
sequencing DNA. However, this technique has several limitations. For example, Sanger se-
quencing is not applicable to sequencing a small amount of DNA, making it expensive and
not accessible to most small labs. Also, the length of the DNA being sequenced is limited.
Next generation sequencing (NGS) technologies are developed at the demand of low-cost
sequencing technologies. These new sequencing technologies make use of massive paral-
lel method. They can produce large-scale sequence data at low costs. These advantages
make large scale sequencing within the reach of many scientists. Moreover, new sequencing
technologies generate much more sequence data that has high coverage and nucleotide level
resolution per run.
1.2 Protein domain analysis of NGS sequences
Inferring functions from sequences is important in analyzing di�erent types of data gen-
erated by NGS technologies. One basic step during the functional analysis is to classify
NGS sequences into annotated functional categories, such as protein families or protein do-
1
main families. Protein domain analysis has been widely used for functional annotations
of RNA-Seq data [1, 2, 3, 4]. In particular, quantifying the expression levels of protein
domains helps us understand how transcriptional changes of domains are associated with
sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For
example, computational domain analysis was applied to identify domains that play a role
in vernalization and e�ux transporters in the gibberellin response in sugar beet [1]. Do-
main analysis is also frequently used to evaluate and compare gene annotation quality of
di�erent gene-�nding tools [3] or to compare domain composition of data sampled using
di�erent techniques [4]. Protein domain analysis has also been used to understand the phy-
logenetic complexity and biological functions of mycrobial communities, as well as their
interactions with the host [5, 6, 7]. For example, Ellrott et al. investigated the distribution
of protein families in the currently available human gut genomic and metagenomic data [8].
Schlüter et al. applied HMMER to understand the genetic diversity and composition of a
plasmid metagenome from a wastewater treatment plant [9]. The phylogenetic algorithm
CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify
the source organisms of environmental DNA fragments.
There are two major comparative methods for protein domain analysis. The �rst method
is based on pairwise sequence alignment tools such as BLAST software suite [11]. Query
sequences are classi�ed via comparison with annotated protein databases such as NCBI-nr
using BLASTX [11]. The second method is pro�le-based similarity search, which classi�es
queries into characterized protein domain or family databases such as Pfam [12], TIGR-
FAM [13], FIGfams [14], etc. There also exist comprehensive protein domain search tools
such as InterProScan [15], which combines di�erent sequence and pro�le-based domain recog-
nition methods from the InterPro [16] consortium member databases into one resource.
2
Although BLAST is one of the most e�cient protein homology search tools, probabilistic
model-based methods have much better sensitivity for remote protein homology recogni-
tion. Using using pro�le hidden Markov models (HMMs) to represent a protein family
greatly improves homology search sensitivity between highly diverged sequences [17]. Thus
it is desirable to conduct protein domain classi�cation using pro�le HMM-based tools such as
HMMER [18]. In conjunction with a fast-growing protein domain family database Pfam [12],
which contains over 10,000 annotated protein domain families, HMMER is able to classify
sequences into di�erent domain families with high accuracy. In addition, the latest imple-
mentation of pro�le HMM-based domain classi�cation tool HMMER 3.0 [18] has achieved
comparable speed to BLAST, making it suitable for large-scale protein compositional anal-
ysis. For the convenience of discussion, we use HMMER to refer to HMMER 3.0 hereafter
unless otherwise speci�ed.
1.3 Challenges with protein domain analysis of NGS se-
quences
Although pro�le HMM-based methods have been successfully applied to genome-wide do-
main analysis, there are still many challenges with protein domain analysis of NGS sequences,
especially complex metagenomic data. First, sequences generated by some NGS platforms
such as pyrosequencing technology have sequencing errors, including insertions or deletions of
nucleotides, especially in homopolymer regions. These errors create frameshifts during trans-
lation, making it di�cult to classify the derived peptide sequences into their native families.
Second, when the length of the query reads decreases, existing tools have low sensitivity
in classifying these short reads, especially for domain families of poor conservation. Many
3
sequencing technologies such as Illumina still generate short reads of 35 bp to 150 bp. More-
over, protein sequences encoded in individual metagenomic sequence reads may share only
a small overlap with existing protein families. Therefore, a sizable portion of various data
set still contain short reads. Third, microbial communities usually contain a large number
of di�erent microbial species, complicating the functional annotation of metagenomic data.
In order to address these challenges and improve performance of protein domain classi-
�cation, We have proposed three tools: HMM-FRAME, MetaDomain, and SALT. HMM-
FRAME is designed to accurately classify metagenomic sequences containing frameshift er-
rors. MetaDomain and SALT are designed to directly classify short reads into their native
protein domain families with better sensitivity than existing tools. Compared to MetaDo-
main, SALT incorporates graph algorithms to improve accuracy of protein domain classi�-
cation. In my future work, I will focus on accurate and scalable gene assembly from complex
metagenomic data.
4
Chapter 2
Protein domain classi�cation for
metagenomic sequences containing
frameshift errors
2.1 Background
Culture-independent methods and high-throughput sequencing technologies now enable us
to obtain community random genomes (metagenomes) from di�erent habitats such as arctic
soils and mammalian gut. Currently, metagenomic annotation focuses on phylogenetic com-
plexity and protein composition analysis. An important component in protein composition
analysis is protein domain classi�cation, which classi�es a putative protein sequence into an-
notated domain families and thus aids in functional analysis. Pro�le HMM-based alignment
is the state-of-the-art method for protein domain classi�cation because of its high sensitivity
in classifying remote homologs. In conjunction with the Pfam database, HMMER [18] can
accurately classify query protein sequences into existing domain families. In addition, the
latest version of HMMER can achieve comparable speed to BLAST, making it applicable to
large-scale metagenomic data sets.
However, HMMER cannot optimally classify sequences containing frameshift errors. In
5
HMMER's domain analysis, six-frame translations of a sequence read or a predicted gene
fragment are aligned with annotated protein domain families using HMMER. One problem
of this method is that sequencing errors, including insertions or deletions of nucleotides,
create frameshifts during translation. As a result, the derived peptide sequences are likely
to generate alignments with marginal scores. As HMMER uses alignment scores, E-values,
or lengths to determine family membership, these reads become unclassi�able or can be
falsely recognized as �novel" proteins during downstream analysis. Figure 2.1 illustrates how
insertion or deletion errors cause marginal alignment scores.
X1X2X3 X4X5X6 X7X8X9 X10X11X12 X13X14X15 X16X17X18 X19X20X21
X1X2X3 X4X5X6 X X7X8X9 X10X11X12 X13X14X15 Y X16X17X18 X19X20X21
aa11 aa12 aa13 aa14 aa15 aa16 aa17
aa11 aa12aa23 aa24 aa25 aa36 aa37
Figure 2.1: Frameshifts cause short alignments with marginal scores
In Figure 2.1, Xi is the ith base of a DNA sequence. Every codon is underscored. aij
is the jth amino acid of a peptide sequence derived under reading frame i. The correct
peptide sequence can be derived from the error-free sequence (shown on the top of the
�gure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y),
the correct peptide sequence is the concatenation of three short peptide sequences derived
using di�erent reading frames. Thus, each peptide sequence derived using one reading frame
can only generate short alignments with insigni�cant scores.
This problem is more serious in domain analysis for metagenomic data sets. Given the
high complexity of many metagenomic data sets, high-quality genome assembly is not always
6
available. Thus, protein annotation can only be conducted on short sequence reads. The
average read length varies from 25-35 to around 400 bases for the next-generation sequencing
methods currently in use. On average there is about one open reading frame per 1000 base
pairs in bacteria genomes. Depending on gene size, many gene fragments in metagenomic
sequence reads may share only a small overlap with existing domain families, generating
even shorter pro�le HMM alignments with signi�cantly lower scores.
Although a number of tools [19, 20, 21, 22, 23, 24] exist for frameshift detection, they are
not designed for protein domain classi�cation using pro�le HMMs. In addition, these tools
have not incorporated sequencing error patterns associated with next generation sequencing
technologies. A clear disadvantage is that they do not distinguish between error rates in and
out of homopolymer regions in pyrosequencing reads. The goal of this work is to design an
accurate pro�le HMM alignment method that can incorporate any given error pattern. Our
experiments show that our tool has high sensitivity (> 95%) in detecting sequencing errors
and has a low false positive rate (∼ 0.15%). By correcting insertion and deletion errors, it
can generate longer alignments with signi�cantly higher alignment scores, and thus provide
more accurate protein domain classi�cation.
2.2 Related work
A number of programs exist to handle frameshifts through DNA versus protein sequence
alignment. The simplest methods discard sequences that might contain frameshifts rather
than trying to correct them. For example, BLASTX provides insightful information about
whether a query DNA sequence contains frameshifts using six-frame translations. However,
it neither explicitly outputs positions of insertions or deletions that create frameshifts, nor
7
does it try to �x them by constructing an alignment from pieces obtained from di�erent
reading frames. Other tools are available to detect and �x frameshift errors automatically.
Frame [19] uses BLASTX to compare all six reading frames of the query nucleotide sequence
against protein sequences. Then the aligned regions are combined for frameshift detection.
Guan et al. [20], Zhang et al. [21], and Halperin et al. [22] describe dynamic programming
algorithms for frameshift detection during pairwise DNA and protein sequence alignment.
Instead of using all reading frames of a DNA sequence to maximize the alignment score,
another group of tools [23, 24] translate a protein sequence back into DNA sequences and
formulate the alignment problem as a network matching problem. Frameshift detection has
also been applied to �nding distant protein homologies where the divergence is the result of
frameshift mutations and substitutions [25, 26, 27].
Some gene-�nding tools detect frameshifts. FrameD [28] relies on a directed acyclic
graph for gene prediction in the presence of frameshifts. Kislyuk et al. [29] apply an ab initio
method to detect possible frameshifts from coding potential generated by GeneMark [30].
GeneTack [31] and FragGeneScan [32] use hidden Markov models for ab initio frameshift
detection in gene �nding.
Despite the extensive study of frameshift detection, the above programs are not designed
for protein family classi�cation through DNA versus protein family alignment. Alternatively,
GeneWise [33], a widely used DNA versus protein alignment tool, allows comparison of
a DNA sequence with a pro�le HMM. Our algorithm di�ers from GeneWise by explicitly
incorporating a position-speci�c error model that is trained on data from di�erent sequencing
platforms such as 454 GS FLX Titanium.
8
2.3 Method
The representative protein domain classi�cation tool HMMER [18] classi�es a query protein
sequence into a pro�le HMM-represented protein family using the Viterbi or the Forward
algorithm [17]. The Viterbi algorithm aligns a query protein sequence to a pro�le HMM
by searching for the most probable state path in the model. If the alignment score or E-
value meets the pre-de�ned threshold, the query is classi�ed into the corresponding family.
The alignment generated by the Viterbi algorithm only accounts for the di�erence caused
by evolutionary divergence between a sequence and a protein family. In order to classify
error-containing sequences into their native families, the alignment algorithm must detect
the di�erences resulted from both evolution and sequencing errors.
In this section, we describe HMM-FRAME, the implementation of an augmented Viterbi
algorithm that searches for the optimal alignment between a DNA query and a pro�le HMM
by considering both evolutionary divergence and sequencing errors. HMM-FRAME di�ers
from HMMER in the following ways: 1) HMM-FRAME directly accepts a DNA sequence
as input, 2) HMM-FRAME accepts a sequencing error model as input, 3) HMM-FRAME
can detect and �x frameshifts caused by sequencing errors in the DNA sequence. The
output alignment indicates which bases are inserted or deleted due to evolutionary change
or sequencing error.
2.3.1 Error models
Here we describe the error models used in our experiments. Di�erent sequencing technologies
may have di�erent types of errors. For example, previous work [34, 35, 36] has shown that
insertions and deletions occur more often in homopolymer regions than in non-homopolymer
9
regions for pyrosequencing reads. Substitution errors occur more often than insertions or
deletions in Illumina sequencing reads. Because deletion or insertion errors cause frameshifts,
we focus on applying HMM-FRAME to pyrosequencing data sets.
In this work, we consider two error models. The �rst one is a published model trained
from GS20 sequencing reads [34]. The insertion and deletion error rates in non-homopolymer
and homopolymer regions are 0.0007 and 0.0044, respectively. The second error model is
computed on data from FLX Titanium sequencing platform. We obtained a set of Titanium
sequence reads (Cole and Wang, unpublished) extracted from the region H of the 16S rRNA,
which were ampli�ed from the Baylor mock community (22 strains, 24 sequences). Then
we computed error rates using insertions and deletions that were annotated by generating
careful Needleman-Wunsch alignments between the Titanium sequencing reads and the con-
trol sequences. In total, 7,040 sequences passed the initial quality control of RDP [37] after
contamination and chimera detection. There were 1,721 insertion and deletion errors. Note
that PCR, which was used to generate the amplicons of the sample, can introduce errors.
However, because most of the errors introduced by PCR are substitution errors, we assumed
that the deletions and insertions were mainly sequencing errors. The derived error rates for
homopolymers of di�erent sizes were: 1: 0.000532, 2: 0.000698, 3: 0.00102, 4: 0.000688,
5: 0.0372, 6: 0.00167, 7: 0.143, where the �rst number is the size of homopolymer regions
(1 means non-homopolymer) and the second number is the rate of insertion and deletion
errors. If we sum the error rates for homopolymer regions of di�erent sizes, the insertion
and deletion error rates for non-homopolymer and homopolymer regions were 0.0005 and
0.001, respectively. They are slightly smaller than the published G20 error rates [34]. We
will compare their performance on a data set with annotated errors in the section 2.4.
10
2.3.2 The augmented Viterbi algorithm for sequencing error cor-
rection
Let π be a state path in a pro�le HMM M . Let r be a set of insertion and deletion positions
in a DNA sequence x. The augmented Viterbi algorithm searches for the most probable path
π∗ and the most probably error position set r∗ such that (π∗, r∗) = argmax(π,r)P (x, π, r).
Intuitively this algorithm searches for an optimal alignment between a DNA sequence and
a pro�le HMM by simultaneously considering 1) evolutionary divergence (i.e. the insertion,
deletion, and substitution of amino acids) and 2) sequencing errors (i.e. insertion and deletion
of nucleotides). To solve the above equation, we �rst divide the search space according to
di�erent types of sequencing errors inside a codon and between two consecutive codons. For
each type of error, we search for the most probable state path.
Input: a DNA sequence x, a pro�le HMM M , and a sequencing error model. Notations of
M and the error model will be described below.
Output: the optimal alignment between DNA sequence x and M , as well as error positions
in r.
Algorithm: we �rst de�ne notations that will be used in the dynamic programming equa-
tions.
• Notations about the pro�le HMM M : States Mj , Ij , and Dj are matching, in-
sertion, and deletion states in M . as1s2 is the transition probability from state s1
to s2. es(T (xi−2xi−1xi)) is the emission probability for state s to emit amino acid
T (xi−2xi−1xi), which is translated from the codon xi−2xi−1xi. For a detailed de-
scription of a pro�le HMM M , we refer the reader to the textbook [17] and the users'
guide of HMMER [18]. State Gj is the only state that is not de�ned in pro�le HMMs
11
from HMMER 3.0. It encodes insertions of nucleotides between codons. aMjGj is the
transition probability from matching state Mj to nucleotide insertion state Gj . It is
set to the insertion error probability. aGjGj is the self-transition probability for Gj ,
encoding the probability of consecutive insertions. When consecutive insertion is not
allowed, it is set to 0. aGj−1Mj is the transition probability from Gj−1 to the next
matching state Mj . When only one insertion error is allowed, it is set to 1.0.
• Notations about the sequencing error model: pI(xi) is the probability that base xi
is an insertion error. pD(xi) is the probability that there is a deletion error after base
xi.
• Subproblems and the recursive equations: Based on our analysis of error patterns, it
is very rare that there are consecutive insertions or deletions in a sequence read. Thus,
the following DP algorithm assumes that there is at most one insertion or deletion
inside a codon. The algorithm can be extended to handle all possible cases.
� VMj (i) is the score of the best alignment matching subsequence x1..i to the sub-
model up to the matching state Mj , given that xi is the third base of a codon
and this codon encodes an amino acid emitted by Mj .
� V Ij (i) is the score of the best alignment matching subsequence x1..i to the sub-
model up to the insertion state Ij , given that T (xi−2xi−1xi) is emitted by Ij .
� V Gj (i) is the score of the best alignment ending in xi being emitted by state Gj ,
which encodes an insertion of nucleotides between codons.
� V Dj (i) is the score of the best alignment matching subsequence x1..i to the sub-
model up to the deletion state Dj .
12
VMj (i) = max{
case I : no sequencing error in the codon xi−2xi−1xi :
eMj (T (xi−2xi−1xi))× VMj−1(i− 3)× aMj−1Mj ,
eMj (T (xi−2xi−1xi))× VIj−1(i− 3)× aIj−1Mj ,
eMj (T (xi−2xi−1xi))× VDj−1(i− 3)× aDj−1Mj ,
eMj (T (xi−2xi−1xi))× pI(xi−3)× VGj−1(i− 3)× aGj−1Mj ,
case II : nucleotide xi−1 is an insertion :
eMj (T (xi−3xi−2xi))× pI(xi−1)× VMj−1(i− 4)× aMj−1Mj ,
eMj (T (xi−3xi−2xi))× pI(xi−1)× VIj−1(i− 4)× aIj−1Mj ,
eMj (T (xi−3xi−2xi))× pI(xi−1)× VDj−1(i− 4)× aDj−1Mj ,
eMj (T (xi−3xi−2xi))× pI(xi−1)× VGj−1(i− 4)× aGj−1Mj ,
case III : nucleotide xi−2 is an insertion :
Repeat the above four equations for eMj (T (xi−3xi−1xi)),
case IV : there is a deleted nucleotide (represented by d) between xi−1 and xi :
eMj (T (xi−1d xi))× pD(xi−1)× VMj−1(i− 3)× aMj−1Mj ,
eMj (T (xi−1d xi))× pD(xi−1)× VIj−1(i− 3)× aIj−1Mj ,
eMj (T (xi−1d xi))× pD(xi−1)× VDj−1(i− 3)× aDj−1Mj ,
eMj (T (xi−1d xi))× pD(xi−1)× VGj−1(i− 3)× aGj−1Mj ,
case V : there is a deleted nucleotide between xi−2 and xi−1 :
Repeat the above four equations for eMj (T (d xi−1xi)).
}
13
In cases IV and V, we use d to represent the deleted bases. We choose d to maximize the
emission probability of T (xi−1d xi) (or T (d xi−1xi)) in the matching state Mj .
V Ij (i) = max{eIj (T (xi−2xi−1xi))× VMj (i− 3)× aMjIj , eIj (T (xi−2xi−1xi))× V
Ij (i− 3)× aIjIj}
V Gj (i) = max{pI(xi)× VMj (i− 1)× aMjGj , pI(xi)× V
Gj (i− 1)× aGjGj}
V Dj (i) = max{VMj−1(i)× aMj−1Dj , V
Dj−1(i)× aDj−1Dj}
2.3.3 Running time analysis
The time complexity of the above dynamic programming algorithm is O(δ|x||M |), where |x|
is the length of input DNA sequence and |M | is the number of states in M . δ is the number
of di�erent types of errors inside a codon plus the case of insertions between two codons. In
our current implementation, δ = 26, which renders a longer running time than the standard
Viterbi algorithm. Thus, it is not practical to compare millions of metagenomic sequence
reads to over 10,000 protein families in Pfam. Instead, we only run HMM-FRAME on
sequences that are likely to contain insertion or deletion errors. For large-scale applications,
we suggest applying HMMER, which is as fast as BLAST, to all input sequence reads using
a big E-value cuto� (such as 100). Alignments covering at least 80% of the translated DNA
sequence with signi�cant E-values can be classi�ed by HMMER in this step. Sequence reads
that do not yield any partial alignments are unlikely to be members of any protein family.
14
Thus, we only apply HMM-FRAME to reads yielding partial alignment with marginal scores
because these reads could potentially contain sequencing errors.
2.4 Results
In this section, we compare the sensitivity and false positive rates (FP rates) of HMM-
FRAME with GeneWise [33] and FragGeneScan [32]. We then apply HMM-FRAME to
Targeted Metagenomics and a published metagenomic data set. Our experimental results
show that the length, scores, and E-values of pro�le HMM alignments are signi�cantly im-
proved after error correction. As pro�le HMM-based alignment tools determine membership
by comparing E-value or length with user-de�ned thresholds, the improvement of these pa-
rameters enables more error-containing sequences to be classi�ed into their native families.
2.4.1 Accuracy of HMM-FRAME
In order to evaluate the accuracy of HMM-FRAME in detecting insertion and deletion errors,
we obtained a control data set with annotated error positions from RDP (Cole and Wang,
unpublished). In this data set, NifH gene families from the Desul�tobacterium hafniense
strain DCB-2, the Burkholderia xenovorans strain LB40, and the PCC 7120 strain of An-
abaena were ampli�ed and then sequenced using 454 Titanium. The sequenced gene families
were aligned with the nifH genes in these three organisms using the Needleman-Wunsch
algorithm. Insertion and deletion errors were identi�ed from the alignments. After contami-
nation and chimera screening, we had 18,900 sequences, of which 3,408 sequences contained
4,623 insertion or deletion errors. We conducted the protein domain analysis on the 18,900
sequences using HMM-FRAME under the two error models presented in the Method Sec-
15
tion. The input pro�le HMM was trained on 25 nifH genes obtained from RDP's functional
gene repository website [38].
We evaluated the performance of error-prediction tools using two types of sensitivity
and FP rates. Let S+ be the set of error-containing sequences in the control data set. Let
S be the set of predicted error-containing sequences. The Sequence-level sensitivity and
FP rate are S∩S+
S+and S−S
+
S , respectively. Similarly, let Q+ be the set of insertion and
deletion positions in error-containing sequences from the control data set. Let Q be the set
of predicted error positions. The Base-level sensitivity and FP rate are Q+∩QQ+
and Q−Q+
Q ,
respectively.
Using the control data set, we �rst evaluated the performance of HMM-FRAME under
the published GS20 and our self-trained Titanium error models. Then we compared the
performance of HMM-FRAME with GeneWise [33] and FragGeneScan [32]. Similar to HMM-
FRAME, GeneWise can directly compare DNA sequences with a pro�le HMM and can accept
user-de�ned error rates. We tested GeneWise using di�erent parameters including error rates
and the alignment score thresholds (ranging from 0 to 20). The results with the best tradeo�
between sensitivity and FP rate were kept for comparison with HMM-FRAME.
FragGeneScan [32] is a newly developed gene prediction tool for short and error-prone
sequences. It predicts genes and identi�es sequencing errors inside predicted genes. We
applied FragGeneScan on the above sequence set (all genes) and tested its sensitivity and
FP rate. FragGeneScan successfully recognized all input as protein-coding genes, rendering
a high gene-prediction sensitivity in this data set. However, FragGeneScan had higher FP
rates than HMM-FRAME in error detection. The results are summarized in Table 2.1.
Sensitivity and FP rate of each program when detecting annotated insertion and deletion
errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-
16
FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is
set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we
did not evaluate its performance at the base-level.
Table 2.1: Comparing the error detection performance of HMM-FRAME,GeneWise, and FragGeneScan.
HMM-FRAME: HMM-FRAME: GeneWise FragGeneScanG20 self-trained
seq-sen 95.25% 90.6% 53.8% 83.04%base-sen 85.08% 82.4% 53.39%seq-FP 0.154% 0 0.001% 0.7%base-FP 2.1% 0.003% 59.57%
Sensitivity and FP rate of each program when detecting annotated insertionand deletion errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is set to zero to maximize thesensitivity. As GeneWise has low sequence-level sensitivity, we did not eval-uate its performance at the base-level.
As shown in Table 2.1, each tool has higher sensitivity and smaller FP rates in identifying
error-containing sequences than in locating error positions. HMM-FRAME has a better
tradeo� between sensitivity and FP rate than both GeneWise and FragGeneScan. Both GS20
and our self-trained Titanium error models have small FP rates in predicting error positions,
but GS20 has higher sensitivity. Thus, we plan to use GS20 in all further experiments.
2.4.2 Using HMM-FRAME in �Targeted Metagenomic�
In this section, we present the utility of HMM-FRAME in two applications of �Targeted
Metagenomics", where one or several gene families are ampli�ed from environmental DNA
and these amplicons are sequenced using high-throughput sequencing platforms. One typical
application of Targeted Metagenomics is to sequence the amplicons of the 16S rRNA gene
for phylogenetic complexity analysis. Besides 16S rRNA, protein-coding genes that are
17
important to a particular habitat can be ampli�ed and sequenced for targeted functional
analysis in metagenomic data sets. For example, Targeted Metagenomics of the nifH gene,
which encodes nitrogenase reductase, is important for analyzing microbial genomes sequenced
from soil. Although these sequences are sampled from one or several targeted gene families,
frameshift errors can cause short alignments with marginal scores between the input and the
targeted gene families. As a result, sequences lacking signi�cant alignment length and scores
will be regarded as contaminants and be discarded. Thus, it is desirable to �x frameshift
errors to maximize the number of usable samples. Given a DNA read and a pro�le HMM
built from a set of known protein sequences, HMM-FRAME can be applied to detect and
correct frameshift errors in amplicon reads.
2.4.2.1 Protein domain analysis of nifH sequences
In the �rst experiment, we obtained 3,937 nifH sequences of an average length of 76 bases
generated by the 454 FLX sequencing technology. In order to discard contaminants that
originated from non-target genes, we aligned the 3,937 sequences with the nifH gene family,
which was built on a small set of 25 expert-veri�ed full-length nifH protein reference se-
quences from RDP's functional gene repository [38]. In the gene family building process, we
�rst applied ClustalW [39] to align the 25 reference sequences. Then we applied HMMER
3.0's hmmbuild program to derive a pro�le HMM from the multiple sequence alignment. Of
the 3,937 454 FLX sequences, 111 were found to be contaminants and were excluded from
further analysis. Of the remaining 3,826 sequences, HMM-FRAME detected 296 insertions
and deletions in 256 sequences. Thus, approximately, 7% of the samples contained frameshift
errors. Of the 256 sequences containing insertion or deletion errors, 224 (87.5%) only con-
tained one insertion or deletion error. 24 (9.4%) sequences contained two errors, and eight
18
(3.1%) contained three errors. Of the 296 insertions or deletions, 224 (75.7%) were inside or
beside homopolymer regions.
Because protein domain classi�cation tools compare alignment lengths, scores, and E-
values with pre-de�ned thresholds to determine a sequence's membership, the changes in
the alignments a�ect the �nal domain composition analysis. After error correction, pro�le
HMM-based alignment tools are expected to generate longer alignments with bigger scores
and smaller E-values. This gives error-containing sequences a better chance of being classi�ed
into the correct families rather than being labeled contaminants.
In order to conduct a fair comparison on alignments before and after error correction,
we choose a third-party tool HMMER to generate alignments for original and corrected
sequences. The changes of alignments' E-values and lengths due to error correction are
presented in Figure 2.2. In this �gure, the changes of alignments are presented for 256
sequences in which HMM-FRAME detects errors. �Original" refers to HMMER alignments
on sequences before error correction. �Corrected" refers to HMMER alignments on sequences
after error correction by HMM-FRAME. As a comparison, we also plot the length of the
original sequence reads (with the legend �sequence read"). They largely overlap with the
length of corrected alignments, indicating that complete sequence reads can be aligned with
the nifH pro�le HMM after error correction.
In order to test whether the improvement was statistically signi�cant, we conducted a
two-sample Kolmogorov-Smirnov test (K-S test) on the alignments' lengths and E-values
before and after error correction. The p-values for the alignments' length and E-value dis-
tributions were 3.1037e-010 and 1.1802e-045, respectively. In particular, the comparison
between alignments' lengths and the sequence reads' lengths shows that most partial align-
ments generated by error-containing sequences become complete alignments after error cor-
19
-50
-30
-10
10
30
50
70
90
1 21 41 61 81 101 121 141 161 181 201 221 241
Len
gth
s an
d E
-val
ue
s o
f p
HM
M a
lign
me
nts
NifH family reads
LOG(original E-value)
LOG(corrected E-value)
original length
corrected length
Sequence read length
Figure 2.2: Change of HMMER alignments' scores, lengths, and E-values (in log space)before and after error correction for nifH sequences. (For interpretation of the referencesto color in this and all other �gures, the reader is referred to the electronic version of thisdissertation)
rection. Thus, when comparatively longer alignments (e.g., 23 amino acids or 69 bases) are
required for domain classi�cation, more sequence reads (213 more under when the threshold
is 69 bases) will be classi�ed into their native families.
2.4.2.2 Protein domain analysis of the bacterial aromatic dioxygenase genes
In the second experiment, we obtained 2486 pyrosequencing samples of an average length
of 224 bases from the bacterial aromatic dioxygenase genes in a soil sample [40]. Although
these pyrosequencing reads were sequenced from the 5' end of PCR amplicons of bacterial
aromatic dioxygenase genes, we were interested in classifying them into three sub-families
of dioxygenase genes: toluene/biphenyl, naphthalene, and benzoate [41]. Note that there
20
is another subfamily (phthalate). However, due to lack of training proteins for this family
(Dr. Iwai, personal communication), we only searched for members of three sub-families.
Three sets of reference protein sequences were extracted from Pfam [12] for toluene/biphenyl,
naphthalene, and benzoate [41]. Based on these training sets, we built three pro�le HMMs
using ClustalW and HMMER. Then we applied HMM-FRAME to align the 2486 reads with
the three pro�le HMMs. HMM-FRAME detected 77 insertions and 52 deletions, which were
distributed in 121 sequences. Of the 121 error-containing sequences, 77 could not be classi�ed
into any subfamily by HMMER under the E-value threshold 0.1. After error correction using
HMM-FRAME, these 77 sequences were classi�ed into di�erent families with an average E-
value of 3.3e-06, indicating that they were highly likely to be true members of the underlying
families. For other error-containing sequences, the pro�le HMM alignments' E-values and
lengths were signi�cantly increased after error correction. The change is plotted in Figure 2.3.
In this �gure, the data sets is sequenced from bacterial aromatic dioxygenase genes in a soil
sample. All alignments are generated by HMMER for a fair comparison. �Original" refers
to HMMER alignments on sequences before error correction. �Corrected" refers to HMMER
alignments on sequences after error correction by HMM-FRAME.
We also applied a two-sample K-S test on the alignments' lengths and E-values before and
after error correction. The p-values for the length and E-value distributions were 8.0609e-
011and 1.9776e-040, respectively. The improved alignment lengths and E-values provide
stronger evidence for the membership of the input samples. In total, after error correction
by HMM-FRAME, we could classify 1,214 sequences into three subfamilies. 1,042 reads
were members of the naphthalene subfamily. 96 reads belonged to the benzoate subfam-
ily. 76 reads belonged to the toluene/biphenyl subfamily. The remaining 1272 reads could
potentially be members of the subfamily phthalate (Dr. Iwai, personal communication).
21
-51
-31
-11
9
29
49
69
89
1 10 19 28 37 46 55
Len
gth
s an
d E
-val
ue
s o
f P
HM
M a
lign
me
nts
Soil sample sequence reads
original length corrected length LOG(original E-value) LOG(corrected E-value)
Figure 2.3: Change of HMMER alignments' lengths, scores, and E-values (in log space) beforeand after error correction for the bacterial aromatic dioxygenase genes in a soil sample.
2.4.3 Protein domain classi�cation in the deep mine data set
In order to show the utility of HMM-FRAME in a metagenomic data set containing members
of multiple domain families, we applied HMM-FRAME to the �rst 454 sequencing project
for environment samples, which were sequenced from two sites in the Soudan Mine, Min-
nesota, USA [42]. In this experiment, we downloaded the Black Sample from the paper's
supplementary data website. This data set contains 388,627 sequence reads with an average
length of 99 bases.
22
There were two steps in the annotation. First, we applied gene-prediction tools. Second,
we conducted the domain classi�cation on predicted genes. A number of gene-prediction
tools are available for metagenomic data sets. However, not every tool can handle short
reads. Glimmer [43] did not output meaningful predictions when it was applied to this data
set. The sensitivity of Metagene [44] drops to 59% for 100-base sequences [45]. We thus
chose FragGeneScan, a newly developed gene-prediction tool for short reads. FragGeneScan
predicted 281,658 genes, of which 72,355 contained errors. For convenience in discussion, let
S be the set of genes predicted by FragGeneScan. Let S' be the raw read set corresponding
to genes in S. Thus, 72,355 sequences in S were di�erent from their raw reads in S' because
FragGeneScan predicted and corrected errors in S'. We compared three domain classi�cation
pipelines: 1) apply HMMER 3.0 on raw reads S', 2) apply FragGeneScan and then HMMER
on corrected reads S, and 3) apply HMM-FRAME on raw reads S'. We recorded how many
reads could be classi�ed into one of the 2,558 Pfam domain families that contain the keyword
�bacteria". The number of classi�able reads for the three pipelines were: 13,544 for HMMER,
12,328 for FragGeneScan + HMMER, and 17,496 for HMM-FRAME. The classi�cation
results have large overlaps, which are illustrated in Figure 2.4. In this �gure, sequence
sets that can be classi�ed by HMM-FRAME, HMMER, and FragGeneScan+HMMER are
represented by three sets A, B, and C. |A| = 17, 496. |B| = 13, 544. |C| = 12, 328. B −C =
2224. C −B = 1008. C − A = 4. A− (B + C) = 2948.
In summary, HMM-FRAME was able to classify 2,948 more reads than the other two
annotation pipelines. HMM-FRAME found errors in all of these 2,948 reads. Thus, it is likely
that other two pipelines failed to classify them because of frameshifts. HMM-FRAME failed
to classify four reads that can be aligned by FrageGeneScan+HMMER. A closer examination
showed that FragGeneScan and HMM-FRAME output di�erent error positions in these four
23
B: HMMER
alone
A: HMM-
FRAME
C:
FragGeneScan
+HMMER
Figure 2.4: Protein domain classi�cation results for the black sample in the deep mine dataset.
sequences.
The performance evaluation of FragGeneScan must consider both gene-prediction and
error-prediction. Of the 281,658 predicted genes, only 12,328 could be classi�ed into existing
domain families. Further analysis is needed to examine whether other predictions are novel
genes or wrong predictions. It is worth noting that FragGeneScan could classify 1,008 more
sequences after its error correction than applying HMMER 3.0 alone on raw reads. However,
while 2,224 raw reads could be classi�ed into existing domain families by HMMER 3.0,
they could not be aligned with any family after error correction by FragGeneScan. This
indicates that FragGeneScan might have over-predicted errors in the 2,224 sequences. This
is consistent with our observation that FragGeneScan has a high FP rate in the control data
set.
24
2.5 Conclusion
Despite the advances of high-throughput sequencing technologies, sequencing errors still
pose challenges for data annotation. In particular, our error model analysis shows that
454 FLX Titanium only slightly decreases the insertion and deletion error rates compared to
GS20. Thus, correcting frameshifts caused by insertion or deletion errors is still important for
metagenomic sequence annotation. In this work, we introduce a protein domain classi�cation
tool HMM-FRAME, which can classify error-prone DNA sequence reads into protein domain
families. HMM-FRAME can accept any error model trained on data from high-throughput
sequencing technologies and thus achieve high detection sensitivity while maintaining a low
false positive rate.
Applying HMM-FRAME to a data set with annotated errors shows its high sensitivity
and accuracy in error detection. In particular, by �xing frameshift errors, we can obtain
signi�cantly longer pro�le HMM alignments with smaller E-values. As alignments' lengths,
scores, and E-values are often used to determine family membership, improving them helps
to classify more sequences into the native domain families. In our experiments, sequences
that fail HMMER 3.0 under the default E-value or score threshold are classi�ed into correct
domain families using HMM-FRAME. Thus, HMM-FRAME can be used as a complementary
tool to HMMER 3.0 on error-prone sequences.
25
Chapter 3
Pro�le HMM-based protein domain
classi�cation for short sequences
3.1 Background
With the advent of next-generation sequencing and culture-independent methods, an enor-
mous amount of metagenomic data have been sequenced from microbial communities from
di�erent habitats. In order to understand the phylogenetic complexity and biological func-
tions of microbial communities, as well as their interactions with the host, automatic an-
notation tools such as CAMERA [5], MG-RAST [6], and MEGAN [7] are being used for
annotating metagenomic data sets. As an important component of these metagenomic an-
notation tools, protein homology search provides basis for identifying putative genes and
assigning those genes to annotated functional categories (e.g. protein domain families).
Because of the high sensitivity of remote homology recognition, HMMER has been suc-
cessfully applied to genome-wide domain analysis. However, its sensitivity is signi�cantly
limited by the short reads of metagenomic data sets and poorly conserved domains. In order
to investigate how read length and domain identity a�ect the sensitivity of HMMER, we
randomly sampled 200 peptides with lengths of 12, 20, and 28 amino acids from the seed
sequences of each of the 2,558 Pfam domains, which contain the word �Bacteria" in their de-
26
scriptions. The peptides were aligned with the domain families using HMMER. We used the
E-value cuto� 1000 in order to boost the sensitivity. For each domain, the read classi�cation
sensitivity of HMMER is measured as the ratio of the number of aligned reads to the total
number of sampled reads. We sort all data points by domain identity in ascending order
and plot them in Figure 3.1. For domains with the same identity, their average sensitivity
is reported.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Average sequence identity of domain
Sens
itiv
ity
of H
MM
ER
3
read length: 36 bpread length: 60 bpread length: 84 bp
Figure 3.1: Change of the read classi�cation sensitivity of HMMER over read length and theaverage sequence identity of domain families.
Figure 3.1 shows that the sensitivity of HMMER deteriorates with the decrease of the
query sequence length and domain identity. The sensitivity is decreased from 90% to 65-70%
when the lengths of reads change from 28 residues (i.e., 84 bp for corresponding DNA reads)
27
to 20 residues (i.e., 60 bp for DNA reads) for domains with identity around 40%.
Although next-generation sequencing technologies are producing longer reads and assem-
bly tools may be available to assemble short reads into longer contigs, there is still a need
for a protein domain analysis tool for short reads. First, many �nished or on-going metage-
nomic sequencing projects contain reads with lengths from 35 to around 400 bp depending
on the chosen sequencing technologies. In addition, peptide sequences encoded in individual
metagenomic sequence reads may share only small overlaps with existing domain families.
Thus, a sizable portion of many available data still contains short reads. Second, the sheer
amount of data and the complexity of many metagenomic data sets pose a great challenge
for assembly tools [46]. A large portion of short reads cannot be correctly assembled into
longer contigs. Third, many domain families exhibit low average sequence identity, which
poses a challenge for short and medium-sized reads. Figure 3.2 shows the histogram of pair-
wise sequence identity for domains related to bacteria. Of 2558 domains, there are about
43% domains with average identity no greater than 0.3. For these domains, the sensitivity
of HMMER is between 0.7 and 0.8 for reads of length 84 bp, between 0.4 and 0.6 for reads
of length 60 bp, and smaller than 0.1 for reads of length 36 bp. As a result, although a large
number of reads are sequenced from genes, which are highly compact in microbial genomes,
only a small percentage of the short reads can be classi�ed into their native domains using
existing tools.
In this work, we introduce MetaDomain, a protein domain classi�cation tool designed
for short reads in metagenomic data sets. MetaDomain provides a complementary protein
analysis tool to HMMER on assigning short reads into their native families.
28
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
100
200
300
400
500
600
700
800
Average sequence identity of domain
Num
ber
of d
omai
ns
Figure 3.2: Histogram of the average pairwise sequence identity for 2558 domains
29
3.2 Related Work
Pro�le HMM-based protein homology search is widely used for mining microbial genomes.
Knowing the composition of di�erent domain families encoded in a metagenomic data set
helps us understand which functions are important for a particular habitat. For example,
Ellrott et al. [8] investigated the distribution of protein families in the available human gut
genomic and metagenomic data. As the data set contains assembled contigs, using HMMER
is expected to achieve high sensitivity. Schlüter et al. [9] used HMMER to understand the
genetic diversity and composition of a plasmid metagenome from a wastewater treatment
plant. The reads have an average length of 104 bp, which is also adequate for HMMER to
achieve high sensitivity.
Besides providing a basis for functional pro�ling, pro�le HMM-based homology search
was also used for phylogenetic complexity analysis in metagenomic data. The phylogenetic
algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers
to identify the source organisms of environmental DNA fragments as short as 80 bp. As we
show in Figure 3.1, pro�le HMM-based tools have sensitivity of at least 0.9 in classifying
reads of 80 bp into domains with average sequence identity above 40%. However, for poorly-
conserved domains, a signi�cant number of reads might be missed. A similar but faster tool
Treephyler [47] conducted community pro�ling in metagenomics and metatranscriptomics
based on Pfam domain assignments. Treephyler was applied to a data set with average read
length of 200 bp. It is unclear how shorter reads a�ect its performance.
Our previous work designed a tool HMM-FRAME [48], which can identify and correct
frame-shift errors in pyrosequencing reads during protein domain classi�cation using pro�le
HMM-based alignment. However, it was not speci�cally designed to handle short reads.
30
Finally, we note that the method used in MetaDomain shares a similar rationale to the
recent work by Weng et al. [49]. Weng et al. reported that taxonomic binning tools for
metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST
cut-o�s. Thus, they re-analyzed the discarded reads using less stringent cut-o�s. In or-
der to control the false positive matches introduced by the relaxed cut-o�s, they used the
evolutionary conservation of adjacency between neighboring genes as an additional criterion.
3.3 Method
HMMER uses E-values as the discrimination threshold to determine the membership of a
query sequence. However, short reads may only generate low alignment scores and thus
insigni�cant E-values. In particular, the conservation across the entire length of a domain
family can be highly variable, posing a great challenge for classifying reads sequenced from
poorly conserved sub-regions. In order to increase the sensitivity of aligning remotely-related
short reads, we propose position-speci�c score cuto�s, by which poorly conserved regions
allow more relaxed discrimination thresholds than well-conserved regions. However, the low
thresholds can easily incur random matches. In order to control the false positive rate,
we examine the position distribution of read alignments. The position distribution of read
alignments on a truly encoded domain is expected to be more uniform than a domain that
incurs random read alignments [50, 51]. Figure 3.3 shows the schematic representations of
three types of distributions of read alignments along a domain. The alignments in (A) and
(B) are more likely to be random. Thus the domains may not be encoded in the data set. The
alignment distribution in (C) exhibits a much more uniform distribution, providing strong
evidence for the existence of the underlying domain in the data set. Thus, by using relaxed
31
position-speci�c score cuto�s and inspecting the distribution of alignments, we expect to
classify more short reads into the correct domain families while not falsely reporting domains
that are not characterized in the data.
Figure 3.3: Three types of alignment distributions.
3.3.1 Pipeline of MetaDomain
The input to MetaDomain includes sequence reads and a list of protein domains. The
output is a list of domains encoded in the underlying data set and the number of aligned
reads. Figure 3.4 shows a schematic representation of the pipeline of MetaDomain.
MetaDomain consists of three main stages: short read alignment, �ltering, and classi-
�cation. In the alignment stage, we use the Viterbi algorithm [17] to search for the best
local alignment between a query sequence and a pro�le HMM-represented domain family. In
the �ltering stage, we �rst apply a position-speci�c score threshold to eliminate insigni�cant
alignments. Then we remove stacked alignments with the same alignment positions inside
a poorly conserved region. In the �nal stage, we use the number of aligned reads and the
distribution of alignment positions to determine whether a domain is encoded.
32
Viterbi algorithm
Sequence reads
Optimal local
alignments
Filtering
Trimmed alignments
Pfam domains
Position-specific
threshold
Read number and
domain coverage
thresholds
Transcribed or encoded
protein domains
Classification
Figure 3.4: Pipeline of MetaDomain.
33
3.3.2 The Viterbi algorithm
The Viterbi algorithm aligns a query sequence to a pro�le HMM by searching for the most
probable state path in the model. Unlike HMMER, MetaDomain directly aligns a DNA
sequence to a pro�le HMM. To do so, we implicitly align translated peptides under di�erent
reading frames with a pro�le HMM. Let π be a state path in a pro�le HMM M and let x
be a query DNA sequence. The Viterbi algorithm searches for the most probable path π∗
such that π∗ = argmaxπ(x, π). The output of the Viterbi algorithm includes the optimal
alignment and its score. As Viterbi is a standard algorithm designed for HMMs, we refer
readers to Durbin et al.[17] for a detailed illustration of the dynamic programming equations
for �nding π∗. The major di�erence between our implementation and the standard Viterbi
algorithm includes : 1) our implementation accepts a DNA rather than a peptide sequence
as input; 2) a local alignment can start and end with any state without incurring insertion
or deletion penalties.
3.3.3 Alignment Filtering
MetaDomain employs two �ltering mechanisms to increase its sensitivity in aligning short
reads while maintaining a low false positive rate: position-speci�c thresholds (PSTs) and
trimming.
3.3.3.1 Position speci�c threshold
PST allows di�erent alignment thresholds for well conserved and poorly conserved regions.
Let the length of a query DNA sequence be L (in bp). Denote the pro�le HMM as M . Let
Mi,j be a sub-model formed by all consecutive states from the ith match state Mi to the
34
jth match state Mj . The upper bound of the alignment score against Mi,j is the maximum
score that can be generated by aligning any input sequence of length j − i + 1 with Mi,j .
Let ai,j denote the transition probability from state Mi to state Mj . Let ei(a) denote the
probability of state Mi emitting amino acid a. Then the upper bound Ui,j for sub-model
Mi,j is calculated as follows:
Ui,j =
j∏k=i
ak,k+1 ×max(ek(a))
where aj,j+1 is set to 1 because j is the ending state of the sub-model.
We de�ne PST for the submodel Mi,j as:
PSTi,j = γUi,j
where the coe�cient γ is a user-speci�ed parameter in the range of [0,1]. It can be �exibly
adjusted to control the trade-o� between sensitivity and false positive rate of MetaDomain.
The default value is 0.6, which is used in our experiments.
3.3.3.2 Alignment trimming
Alignment with scores larger than their corresponding PSTs will pass the �rst �ltering stage.
As each domain has various conservation along the entire length of the model, well-conserved
sub-regions have high PSTs while poorly-conserved sub-regions yield low PSTs. Thus, ran-
dom sequences tend to be aligned to poorly-conserved regions by MetaDomain, incurring a
high FP rate. Our empirical experiments show that dozens of reads that are not sequenced
from the underlying domain can be aligned to the same position in a poorly-conserved sub-
region. In order to minimize the e�ects of noise, we discard stacked alignments that have
35
the same alignment positions.
3.3.4 Protein domain classi�cation
In this stage we extract two features from the collected read alignments for each domain:
the number of aligned reads and the domain coverage. The domain coverage is the fraction
of positions covered by at least one read alignment in a domain. MetaDomain then applies
a simple decision tree to classify all the target domains into two classes: encoded domains
and non-encoded domains. If both features of a domain are equal to or bigger than their
corresponding thresholds, this domain will be classi�ed as encoded. Otherwise it is not
encoded in the sample. By default, the cuto� for domain coverage is 30%. Ideally, the cuto�
for the number of aligned read should be determined based on the properties of data such
as sequencing depth. If users do not specify this value, we use 20 by default.
3.4 Results
In order to evaluate the performance of MetaDomain on real data generated by next-
generation sequencing technologies, we applied MetaDomain to protein domain analysis in
two data sets. The �rst one is the transcriptome generated using RNA-seq for Burkholderia
cenocepacia. As both the reference genome and its domain annotations are available, we
can quantify the sensitivity and false positive (FP) rate of MetaDomain. The second one
is metagenome data sequenced from soil. We applied MetaDomain to identify domains en-
coded in the underlying data. In addition, we compared HMMER and MetaDomain in both
applications.
36
3.4.1 Identifying transcribed protein domains in transcriptome
In this experiment, we conducted transcribed domain analysis in the transcriptome from
one strain of B. cenocepacia named AU1054 [52]. By using Illumina RNA-seq, the authors
generated multiple samples for AU1054 in two growth media. We used one replicate of cDNA
sample of AU1054 in the growth medium cystic �brosis. In total, 3,361,008 reads of a length
of 41 bp were downloaded from the website provided by the authors. We evaluated the
performance of read classi�cation and domain identi�cation of MetaDomain and HMMER.
3.4.1.1 Performance of read classi�cation
The performance of read classi�cation is quanti�ed using both read classi�cation sensitivity
and FP (false positive) rate. In this experiment, the read classi�cation performance is
computed on reads that can be mapped to annotated domains. Below we sketch the main
steps to obtain mapped reads for a domain using the reference genome and the domain
annotations. First, we downloaded the genome of AU1054 and the annotated genes and
domains from the IMG website [53]. There are 2,181 annotated Pfam domains. Second, the
reads were mapped to the reference genome using Bowtie [54] with two mismatches allowed.
Third, we compared the positions of read mapping and annotated domains. For a domain, all
reads that fall into it are de�ned as �mapped" reads. Denote the set of mapped reads as M .
All other (unmapped) reads constitute set U . For a domain classi�cation tool, let the set of
aligned reads for a domain be A. Thus, the sensitivity and FP rate of read classi�cation for
a domain are A∩MM andA−MU , respectively. A perfect sensitivity indicates that all mapped
reads can be aligned. A zero FP rate indicates that only mapped reads can be aligned to a
domain.
Of the 2,181 annotated families, we evaluated the performance of HMMER and Meta-
37
Domain on 1406 families which have at least 1 mapped reads. Of the 1406 tested domains,
HMMER could not align any read to 1150 domains, resulting in zero sensitivity and FP
rate. For the rest 256 domains, all aligned reads by HMMER are non-mappable reads, re-
sulting in zero sensitivity and a positive FP rate. The comparison between HMMER and
MetaDomain is summarized using a bubble chart in Figure 3.5. The biggest bubble indicates
that HMMER has zero sensitivity and zero FP rate for 1150 domains. As we can see, it
is highly di�cult for HMMER to correctly align reads as short as 41 bp. There are two
reasons for the low sensitivity of HMMER on short reads. First, the parameter training in
E-value calculation of HMMER are based on much longer reads (100 amino acids). Thus,
the small alignment scores generated by the short reads yield large E-values and cannot pass
the E-value threshold. Second, the small alignment scores of short reads may not pass the
�ltration stage of HMMER.
3.4.1.2 Identifying transcribed domains in the transcriptome
Figure 3.5 only shows the read classi�cation performance. MetaDomain uses both aligned
read number and domain coverage as thresholds for domain identi�cation. We expect that
the additional constraint will reduce the false positive rate in domain identi�cation. Because
of the low read classi�cation sensitivity, we speculate that HMMER will have low sensitivity
in identifying transcribed domains.
In order to quantify the performance of domain identi�cation, we need to build positive
and negative test sets, which include transcribed and non-transcribed domains based on
mapped reads. There is no commonly accepted criterion to de�ne transcribed genes using
the number of mapped reads. Various expression scores such as an average coverage depth
across the entire length of each gene [55] and reads per kilobase of exon model per million
38
0 1 2 3 4 5 6 7
x 10−5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FP rate
Sens
itiv
ity
HMMER3MetaDomain
Figure 3.5: Read classi�cation sensitivity and FP rate of HMMER and MetaDomain. Thesize of each bubble represents the number of data points (i.e., domains) with the samesensitivity and FP rate.
mapped reads (RPKM) are used to quantify transcriptional level. In addition, the cuto�s
of de�ning highly transcribed, lowly transcribed, or non-transcribed genes are variable in
di�erent applications [56]. In this work, we de�ne transcribed domains based on the rationale
that a truly transcribed domain should be mapped by a number of reads at di�erent positions.
Correspondingly, we use the following criteria to determine whether a domain inside a gene
is transcribed: 1) at least N reads are mapped to a domain; 2) at least 30% of positions
in a domain are mapped by reads. A domain is labeled �non-transcribed" if the number
of mapped read is zero. For domains that fall between the criteria for transcribed and
non-transcribed domains, they are labeled �unknown" and are excluded from the test sets.
Table 3.1 shows the size change of the positive and negative test sets over the cuto� N.
39
Table 3.1: Number of transcribed and non-transcribed domains using di�erent cuto�s (N)for the number of mapped reads.
N transcribed unknown none-transcribed
10 318 1317 54615 262 1373 54620 226 1409 54625 195 1440 54630 169 1466 546
Intuitively, bigger N creates an easier case for domain classi�cation than smaller N.
We align all reads to the transcribed and non-transcribed domains using MetaDomain
and HMMER. The �unknown" domains are removed due to their ambiguity. For HMMER,
we �rst translated the short reads into peptide sequences using 6-frame translations. We
then aligned the domains with the translated sequences using 1000 as the E-value threshold,
which is chosen to maximize the sensitivity. For MetaDomain we directly aligned the short
reads with the domains. The pipeline in Figure 3.4 was used to output a list of transcribed
domains for MetaDomain. LetD+ andD− be the number of transcribed and non-transcribed
domains identi�ed using the read mapping results in Section 3.4.1.2. LetM+ andM− be the
predicted number of transcribed and non-transcribed domains by MetaDomain or HMMER.
The sensitivity and FP rate of domain classi�cation tools are de�ned using the following
equations:
Sensitivity = D+∩M+D+
FP rate = D−∩M+D−
The values of D and M are a�ected by several options. First, D+ and D− can change
over the cuto� N as shown in Table 3.1. Second, we used both the domain coverage and the
40
number of aligned reads to determine whether a domain is encoded or transcribed. In this
experiment, the cuto� for domain coverage is 30%, which we found reasonable across di�erent
experiments. Thus, M+ and M− mainly change over the required number of aligned reads
to a domain. For simplicity, we denote the cuto� as τ . Increasing τ implies a more stringent
constraint for de�ning transcribed domains, and thus might result in lower sensitivity and a
smaller FP rate. Decreasing τ is likely to increase the sensitivity while incurring a higher FP
rate. In order to compare the performance of MetaDomain and HMMER under di�erent τ ,
we plotted the ROC curves by changing τ from 1 to N for N=10, 20, and 30 in Figure 3.6.
0 0.1 0.20
0.2
0.4
0.6
0.8
FP rate
Sens
itiv
ity
N=10
0 0.1 0.20
0.2
0.4
0.6
0.8
1
FP rate
Sens
itiv
ity
N=20
0 0.1 0.20
0.2
0.4
0.6
0.8
1
FP rate
Sens
itiv
ity
N=30
HMMER3
MetaDomain
HMMER3
MetaDomain HMMER3MetaDomain
Figure 3.6: ROC curves of HMMER and MetaDomain.
Figure 3.6 shows that HMMER is highly speci�c (FP rate ≤ 1.3%). However, as we
speculated, its sensitivity is low, with the highest sensitivity being only 0.135. HMMER
misses a large portion of short reads that can be mapped to protein domains even when we
use a very relaxed E-value cuto�. When both tools incur an FP rate of 0.02, the sensitivity
of MetaDomain is 0.53 vs. 0.13 for HMMER. When N decreases from 30 to 10, the size of the
positive test set D+ becomes larger and the sensitivity of both HMMER and MetaDomain
41
decreases. Note that the sensitivity and FP rate of HMMER keep the same for many di�erent
thresholds (i.e., τ), resulting in compact ROC curves. Overall, the ROC curves show that
MetaDomain can achieve higher sensitivity while keeping a similar FP rate as HMMER
for domain classi�cation in this experiment. In addition, Figure 3.6 provides guidance on
determining appropriate τ for MetaDomain in order to achieve desired sensitivity and FP
rate.
On average, it took MetaDomain 280 seconds to align 752,156 reads with one domain on
a 2.