PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF ......categories, such as protein families or protein...

PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATIONSEQUENCING DATA

By

Yuan Zhang

A DISSERTATION

Submitted toMichigan State University

in partial ful�llment of the requirementsfor the degree of

Computer Science � Doctor of Philosophy

2013

ABSTRACT

PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OFNEXT-GENERATION SEQUENCING DATA

By

Yuan Zhang

Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a

wide range of methodologies in order to understand their functions, structures or evolu-

tion history. Next generation sequencing (NGS) technologies generate large-scale sequence

data of high coverage and nucleotide level resolution at low costs, bene�ting a variety of

research areas such as gene expression pro�ling, metagenomic annotation, ncRNA identi�ca-

tion, etc. Therefore, functional analysis of NGS sequences becomes increasingly important

because it provides insightful information, such as gene expression, protein composition, and

phylogenetic complexity, of the species from which the sequences are generated. One basic

step during the functional analysis is to classify genomic sequences into di�erent functional

categories, such as protein families or protein domains (or domains for short), which are

independent functional units in a majority of annotated protein sequences.

The state-of-the-art method for protein domain analysis is based on comparative sequence

analysis, which classi�es query sequences into annotated protein or domain databases. There

are two types of domain analysis methods, pairwise alignment and pro�le-based similarity

search. The �rst one uses pairwise alignment tools such as BLAST to search query genomic

sequences against reference protein sequences in databases such as NCBI-nr. The second one

uses pro�le HMM-based tools such as HMMER to classify query sequences into annotated

domain families such as Pfam. Compared to the �rst method, the pro�le HMM-based method

has smaller search space and higher sensitivity with remote homolog detection. Therefore, I

focus on pro�le HMM-based protein domain analysis.

There are several challenges with protein domain analysis of NGS sequences. First, se-

quences generated by some NGS platforms such as pyrosequencing have relatively high error

rates, making it di�cult to classify the sequences into their native domain families. Second,

existing protein domain analysis tools have low sensitivity with short query sequences and

poorly conserved domain families. Third, the volume of NGS data is usually very large,

making it di�cult to assemble short reads into longer contigs. In this work, I focus on ad-

dressing these three challenges using di�erent methods. To be speci�c, we have proposed four

tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on

detecting and correcting frameshift errors in sequences generated by pyrosequencing technol-

ogy, thus accurately classifying metagenomic sequences containing frameshift errors into their

native protein domain families. MetaDomain and SALT are both designed for short reads

generated by NGS technologies. MetaDomain uses relaxed position-speci�c score thresholds

and alignment positions to increase the sensitivity while keeping the false positive rate at

a low level. SALT combines both position-speci�c score thresholds and graph algorithms

and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene

assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage,

and lower chimera rate compared with existing tools. Finally, I will make a conclusion on

my work and brie�y talk about some future work.

ACKNOWLEDGMENTS

First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to ad-

mit me as her PhD student four years ago provided me the precious opportunity to study in

Michigan State University and led me to the world of bioinformatics. During these four years

under her guidance I have made continuous progress in several aspects, including reading

research papers, proposing research topics, developing methods, designing experiments, to

writing papers. More importantly, I have gradually improved my ability to both indepen-

dently and collaboratively conduct in-depth analysis into sophisticated research problems

and use scienti�c methodologies to solve the challenging problems. She also gave a lot of

suggestions on how to e�ectively demonstrate our work to the audience, especially to people

from other research areas. This ability is very important in that it will profoundly determine

my capability to collaborate in a team of people from di�erent background.

I also want to thank other committee members Dr. C. Titus Brown, Dr. Pang-Ning

Tan, and Dr. James R. Cole. They gave a lot of useful suggestions during the course of

my PhD program. I also thank my lab mates Rujira Achawanantakun, Jikai Lei, Cheng

Yuan, Prapaporn Techa-angkoon, and Jiao He. During these years, we have productive

discussion and cooperation on various research topics and I have obtained great help from

them. I would like to acknowledge my colleagues from BeachMint Inc. during my summer

internship, especially Douglas Cohen, Je� Cooper, and Manunya Rozelle. With their help,

I learned how to apply theories and methods to solve challenging problems in industry. I

gratefully acknowledge other faculties and sta�s of CSE department, especially Dr. Rong

Jin, Dr. Jin Chen, Linda Moore, and Norma Teague. I also owe a lot of thanks to my friends

in MSU and in China. All of them give me a lot of support and help during these years.

iv

My �nal and most important acknowledgement must go to my family. They always give

me persistent and determined love and support.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Next-generation sequencing technologies . . . . . . . . . . . . . . . . . . . . 11.2 Protein domain analysis of NGS sequences . . . . . . . . . . . . . . . . . . . 11.3 Challenges with protein domain analysis of NGS sequences . . . . . . . . . . 3

Chapter 2 Protein domain classi�cation for metagenomic sequences con-taining frameshift errors . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Error models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 The augmented Viterbi algorithm for sequencing error correction . . . 112.3.3 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Accuracy of HMM-FRAME . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Using HMM-FRAME in �Targeted Metagenomic� . . . . . . . . . . . 17

2.4.2.1 Protein domain analysis of nifH sequences . . . . . . . . . . 182.4.2.2 Protein domain analysis of the bacterial aromatic dioxyge-

nase genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Protein domain classi�cation in the deep mine data set . . . . . . . . 22

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Chapter 3 Pro�le HMM-based protein domain classi�cation for short se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Pipeline of MetaDomain . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Alignment Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3.1 Position speci�c threshold . . . . . . . . . . . . . . . . . . . 343.3.3.2 Alignment trimming . . . . . . . . . . . . . . . . . . . . . . 35

3.3.4 Protein domain classi�cation . . . . . . . . . . . . . . . . . . . . . . . 36

vi

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Identifying transcribed protein domains in transcriptome . . . . . . . 37

3.4.1.1 Performance of read classi�cation . . . . . . . . . . . . . . . 373.4.1.2 Identifying transcribed domains in the transcriptome . . . . 38

3.4.2 Protein domain analysis in a soil metagenomic data set . . . . . . . . 423.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 4 A Sensitive and accurate protein domain classi�cation tool (SALT)for short reads based on pro�le HMMs and graph algorithms . 47

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.1 Overview of SALT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Stage 1: pro�le HMM-based �ltration . . . . . . . . . . . . . . . . . . 54

4.3.2.1 Position-speci�c score threshold . . . . . . . . . . . . . . . . 544.3.3 Stage 2: contig generation . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.3.1 Constructing a hit graph for a family . . . . . . . . . . . . . 574.3.3.2 Find the K longest paths . . . . . . . . . . . . . . . . . . . 60

4.3.4 Stage 3: E-value computation and contig selection . . . . . . . . . . . 614.3.5 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Protein domain classi�cation of very short reads . . . . . . . . . . . . 64

4.4.1.1 Determining the true membership of reads . . . . . . . . . . 654.4.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . 65

4.4.2 Protein domain classi�cation of an RNA-Seq data of Arabidopsis . . . 694.4.3 Protein domain classi�cation of a non-model organism . . . . . . . . 72

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SAT-Assembler) for NGS data . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.1 Overview of SAT-assembler . . . . . . . . . . . . . . . . . . . . . . . 825.3.2 Pro�le HMM-based homology search . . . . . . . . . . . . . . . . . . 835.3.3 Alignment informed graph construction . . . . . . . . . . . . . . . . . 845.3.4 Pruning and optimization of overlap graphs . . . . . . . . . . . . . . 875.3.5 Guided graph traversal using multiple information . . . . . . . . . . . 885.3.6 Contig sca�olding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3.7 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1 Gene assembly in an RNA-Seq data set of Arabidopsis . . . . . . . . 93

5.4.1.1 Edge creation performance . . . . . . . . . . . . . . . . . . . 935.4.1.2 Performance comparison with other assembly tools . . . . . 95

vii

5.4.2 Targeted gene assembly in a human gut metagenomic data set . . . . 985.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Chapter 6 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . 101

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

viii

LIST OF TABLES

Table 2.1 Comparing the error detection performance of HMM-FRAME, Ge-neWise, and FragGeneScan. . . . . . . . . . . . . . . . . . . . . . . 17

Table 3.1 Number of transcribed and non-transcribed domains using di�erentcuto�s (N) for the number of mapped reads. . . . . . . . . . . . . . 40

Table 4.1 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Burkholderia cenocepacia. . . . . . . . . . . . . 68

Table 4.2 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . 71

Table 4.3 Classi�cation results generated by di�erent classi�ers on the Radixbalthica data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Table 4.4 Description of transcribed families uniquely identi�ed by SALT. . . 74

Table 5.1 Edge creation performance of three strategies on the RNA-Seq dataset of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Table 5.2 Performance comparison between di�erent assembly tools on the RNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . 98

Table 5.3 Performance comparison between di�erent assembly tools in assem-bling genes from butyrate kinase family on the human gut metage-nomic data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

ix

LIST OF FIGURES

Figure 2.1 Frameshifts cause short alignments with marginal scores . . . . . . . 6

Figure 2.2 Change of HMMER alignments' scores, lengths, and E-values (in logspace) before and after error correction for nifH sequences. (For in-terpretation of the references to color in this and all other �gures, thereader is referred to the electronic version of this dissertation) . . . . 20

Figure 2.3 Change of HMMER alignments' lengths, scores, and E-values (in logspace) before and after error correction for the bacterial aromaticdioxygenase genes in a soil sample. . . . . . . . . . . . . . . . . . . . 22

Figure 2.4 Protein domain classi�cation results for the black sample in the deepmine data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.1 Change of the read classi�cation sensitivity of HMMER over readlength and the average sequence identity of domain families. . . . . 27

Figure 3.2 Histogram of the average pairwise sequence identity for 2558 domains 29

Figure 3.3 Three types of alignment distributions. . . . . . . . . . . . . . . . . 32

Figure 3.4 Pipeline of MetaDomain. . . . . . . . . . . . . . . . . . . . . . . . . 33

Figure 3.5 Read classi�cation sensitivity and FP rate of HMMER and MetaDo-main. The size of each bubble represents the number of data points(i.e., domains) with the same sensitivity and FP rate. . . . . . . . . 39

Figure 3.6 ROC curves of HMMER and MetaDomain. . . . . . . . . . . . . . . 41

Figure 3.7 Read length distribution in the soil data set. . . . . . . . . . . . . . 43

Figure 3.8 Reads aligned by HMMER and MetaDomain. . . . . . . . . . . . . 44

Figure 3.9 The distributions of aligned reads for PF09703 by HMMER and Meta-Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

x

Figure 4.1 Two genes, their domain organizations, and the sequenced reads. Do-main X occurs in two di�erent genes. Both genes are transcribed andsequenced. Red lines: positive reads. Blue lines: negative reads. . . 52

Figure 4.2 The pipeline of SALT. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 4.3 (A) Thirteen reads and their alignment layout w.r.t. the pro�le HMMrepresented by its matching states. The alignment scores are shownin the table. Blue reads: negative reads. Red reads: positive reads.(B) The constructed hit graph when k∗ = 4. For simplicity of ex-planation, mismatches are not allowed in this simple example (i.e.e = 0). Red nodes are created by positive reads. Blue nodes are cre-ated by negative reads. (C) The hit graph after removing transitiveoverlaps and adding the root node. . . . . . . . . . . . . . . . . . . . 56

Figure 4.4 ROC curves of di�erent classi�ers. HHblits and SSAKE+HMMERare listed in separate embedded windows because their FP rates areorders of magnitude larger than others. . . . . . . . . . . . . . . . . 67

Figure 4.5 A Venn diagram of the transcribed families identi�ed by di�erentclassi�ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 5.1 The pipeline of SAT-assembler. Reads of the same color belong tothe same gene family. Reads from di�erent genes of the same familyare distinguished using di�erent patterns. Reads shared by multiplegenes from the same family have multiple patterns. . . . . . . . . . . 83

Figure 5.2 (A)Two reads a and b sequenced from di�erent genes of the samefamily are aligned to the pro�le HMM of the family. Their sequenceoverlap is indicated in red. (B) Read a and read b have an align-ment overlap of 66 and a sequence overlap of 25 (in bold). (C)Thealignment between the translated peptides of a and b is 22 residues. 86

Figure 5.3 A graph containing reads from two di�erent genes A and B. Nodes inred (v1, v4, and v7) and in blue (v2, v5, and v8) are from gene A andgene B respectively. Nodes in black (v3 and v6) are chimeric nodesbecause they are shared by the two genes. Arrows with solid linesare real edges. Arrows with dotted lines and dashed lines indicatepaired-end reads and transitive edges between two nodes respectively. 88

Figure 5.4 Three contigs generated from a metagenomic data set. The greenparts of the contigs are contained in the target gene and thus aregene segments. The blue parts of the contigs are not gene segmetns. 92

xi

Figure 5.5 Chimera rate versus gene coverage when k-mer size or overlap thresh-old changes for di�erent assembly tools. These values are averagevalues of the assemblers' performance on 3,188 input families. . . . . 96

xii

Chapter 1

Introduction

1.1 Next-generation sequencing technologies

In bioinformatics, sequencing means to determine the primary structure of a biological se-

quence. Prior to new sequencing technologies, Sanger sequencing is the main method for

sequencing DNA. However, this technique has several limitations. For example, Sanger se-

quencing is not applicable to sequencing a small amount of DNA, making it expensive and

not accessible to most small labs. Also, the length of the DNA being sequenced is limited.

Next generation sequencing (NGS) technologies are developed at the demand of low-cost

sequencing technologies. These new sequencing technologies make use of massive paral-

lel method. They can produce large-scale sequence data at low costs. These advantages

make large scale sequencing within the reach of many scientists. Moreover, new sequencing

technologies generate much more sequence data that has high coverage and nucleotide level

resolution per run.

1.2 Protein domain analysis of NGS sequences

Inferring functions from sequences is important in analyzing di�erent types of data gen-

erated by NGS technologies. One basic step during the functional analysis is to classify

NGS sequences into annotated functional categories, such as protein families or protein do-

1

main families. Protein domain analysis has been widely used for functional annotations

of RNA-Seq data [1, 2, 3, 4]. In particular, quantifying the expression levels of protein

domains helps us understand how transcriptional changes of domains are associated with

sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For

example, computational domain analysis was applied to identify domains that play a role

in vernalization and e�ux transporters in the gibberellin response in sugar beet [1]. Do-

main analysis is also frequently used to evaluate and compare gene annotation quality of

di�erent gene-�nding tools [3] or to compare domain composition of data sampled using

di�erent techniques [4]. Protein domain analysis has also been used to understand the phy-

logenetic complexity and biological functions of mycrobial communities, as well as their

interactions with the host [5, 6, 7]. For example, Ellrott et al. investigated the distribution

of protein families in the currently available human gut genomic and metagenomic data [8].

Schlüter et al. applied HMMER to understand the genetic diversity and composition of a

plasmid metagenome from a wastewater treatment plant [9]. The phylogenetic algorithm

CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify

the source organisms of environmental DNA fragments.

There are two major comparative methods for protein domain analysis. The �rst method

is based on pairwise sequence alignment tools such as BLAST software suite [11]. Query

sequences are classi�ed via comparison with annotated protein databases such as NCBI-nr

using BLASTX [11]. The second method is pro�le-based similarity search, which classi�es

queries into characterized protein domain or family databases such as Pfam [12], TIGR-

FAM [13], FIGfams [14], etc. There also exist comprehensive protein domain search tools

such as InterProScan [15], which combines di�erent sequence and pro�le-based domain recog-

nition methods from the InterPro [16] consortium member databases into one resource.

2

Although BLAST is one of the most e�cient protein homology search tools, probabilistic

model-based methods have much better sensitivity for remote protein homology recogni-

tion. Using using pro�le hidden Markov models (HMMs) to represent a protein family

greatly improves homology search sensitivity between highly diverged sequences [17]. Thus

it is desirable to conduct protein domain classi�cation using pro�le HMM-based tools such as

HMMER [18]. In conjunction with a fast-growing protein domain family database Pfam [12],

which contains over 10,000 annotated protein domain families, HMMER is able to classify

sequences into di�erent domain families with high accuracy. In addition, the latest imple-

mentation of pro�le HMM-based domain classi�cation tool HMMER 3.0 [18] has achieved

comparable speed to BLAST, making it suitable for large-scale protein compositional anal-

ysis. For the convenience of discussion, we use HMMER to refer to HMMER 3.0 hereafter

unless otherwise speci�ed.

1.3 Challenges with protein domain analysis of NGS se-

quences

Although pro�le HMM-based methods have been successfully applied to genome-wide do-

main analysis, there are still many challenges with protein domain analysis of NGS sequences,

especially complex metagenomic data. First, sequences generated by some NGS platforms

such as pyrosequencing technology have sequencing errors, including insertions or deletions of

nucleotides, especially in homopolymer regions. These errors create frameshifts during trans-

lation, making it di�cult to classify the derived peptide sequences into their native families.

Second, when the length of the query reads decreases, existing tools have low sensitivity

in classifying these short reads, especially for domain families of poor conservation. Many

3

sequencing technologies such as Illumina still generate short reads of 35 bp to 150 bp. More-

over, protein sequences encoded in individual metagenomic sequence reads may share only

a small overlap with existing protein families. Therefore, a sizable portion of various data

set still contain short reads. Third, microbial communities usually contain a large number

of di�erent microbial species, complicating the functional annotation of metagenomic data.

In order to address these challenges and improve performance of protein domain classi-

�cation, We have proposed three tools: HMM-FRAME, MetaDomain, and SALT. HMM-

FRAME is designed to accurately classify metagenomic sequences containing frameshift er-

rors. MetaDomain and SALT are designed to directly classify short reads into their native

protein domain families with better sensitivity than existing tools. Compared to MetaDo-

main, SALT incorporates graph algorithms to improve accuracy of protein domain classi�-

cation. In my future work, I will focus on accurate and scalable gene assembly from complex

metagenomic data.

4

Chapter 2

Protein domain classi�cation for

metagenomic sequences containing

frameshift errors

2.1 Background

Culture-independent methods and high-throughput sequencing technologies now enable us

to obtain community random genomes (metagenomes) from di�erent habitats such as arctic

soils and mammalian gut. Currently, metagenomic annotation focuses on phylogenetic com-

plexity and protein composition analysis. An important component in protein composition

analysis is protein domain classi�cation, which classi�es a putative protein sequence into an-

notated domain families and thus aids in functional analysis. Pro�le HMM-based alignment

is the state-of-the-art method for protein domain classi�cation because of its high sensitivity

in classifying remote homologs. In conjunction with the Pfam database, HMMER [18] can

accurately classify query protein sequences into existing domain families. In addition, the

latest version of HMMER can achieve comparable speed to BLAST, making it applicable to

large-scale metagenomic data sets.

However, HMMER cannot optimally classify sequences containing frameshift errors. In

5

HMMER's domain analysis, six-frame translations of a sequence read or a predicted gene

fragment are aligned with annotated protein domain families using HMMER. One problem

of this method is that sequencing errors, including insertions or deletions of nucleotides,

create frameshifts during translation. As a result, the derived peptide sequences are likely

to generate alignments with marginal scores. As HMMER uses alignment scores, E-values,

or lengths to determine family membership, these reads become unclassi�able or can be

falsely recognized as �novel" proteins during downstream analysis. Figure 2.1 illustrates how

insertion or deletion errors cause marginal alignment scores.

X1X2X3 X4X5X6 X7X8X9 X10X11X12 X13X14X15 X16X17X18 X19X20X21

X1X2X3 X4X5X6 X X7X8X9 X10X11X12 X13X14X15 Y X16X17X18 X19X20X21

aa11 aa12 aa13 aa14 aa15 aa16 aa17

aa11 aa12aa23 aa24 aa25 aa36 aa37

Figure 2.1: Frameshifts cause short alignments with marginal scores

In Figure 2.1, Xi is the ith base of a DNA sequence. Every codon is underscored. aij

is the jth amino acid of a peptide sequence derived under reading frame i. The correct

peptide sequence can be derived from the error-free sequence (shown on the top of the

�gure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y),

the correct peptide sequence is the concatenation of three short peptide sequences derived

using di�erent reading frames. Thus, each peptide sequence derived using one reading frame

can only generate short alignments with insigni�cant scores.

This problem is more serious in domain analysis for metagenomic data sets. Given the

high complexity of many metagenomic data sets, high-quality genome assembly is not always

6

available. Thus, protein annotation can only be conducted on short sequence reads. The

average read length varies from 25-35 to around 400 bases for the next-generation sequencing

methods currently in use. On average there is about one open reading frame per 1000 base

pairs in bacteria genomes. Depending on gene size, many gene fragments in metagenomic

sequence reads may share only a small overlap with existing domain families, generating

even shorter pro�le HMM alignments with signi�cantly lower scores.

Although a number of tools [19, 20, 21, 22, 23, 24] exist for frameshift detection, they are

not designed for protein domain classi�cation using pro�le HMMs. In addition, these tools

have not incorporated sequencing error patterns associated with next generation sequencing

technologies. A clear disadvantage is that they do not distinguish between error rates in and

out of homopolymer regions in pyrosequencing reads. The goal of this work is to design an

accurate pro�le HMM alignment method that can incorporate any given error pattern. Our

experiments show that our tool has high sensitivity (> 95%) in detecting sequencing errors

and has a low false positive rate (∼ 0.15%). By correcting insertion and deletion errors, it

can generate longer alignments with signi�cantly higher alignment scores, and thus provide

more accurate protein domain classi�cation.

2.2 Related work

A number of programs exist to handle frameshifts through DNA versus protein sequence

alignment. The simplest methods discard sequences that might contain frameshifts rather

than trying to correct them. For example, BLASTX provides insightful information about

whether a query DNA sequence contains frameshifts using six-frame translations. However,

it neither explicitly outputs positions of insertions or deletions that create frameshifts, nor

7

does it try to �x them by constructing an alignment from pieces obtained from di�erent

reading frames. Other tools are available to detect and �x frameshift errors automatically.

Frame [19] uses BLASTX to compare all six reading frames of the query nucleotide sequence

against protein sequences. Then the aligned regions are combined for frameshift detection.

Guan et al. [20], Zhang et al. [21], and Halperin et al. [22] describe dynamic programming

algorithms for frameshift detection during pairwise DNA and protein sequence alignment.

Instead of using all reading frames of a DNA sequence to maximize the alignment score,

another group of tools [23, 24] translate a protein sequence back into DNA sequences and

formulate the alignment problem as a network matching problem. Frameshift detection has

also been applied to �nding distant protein homologies where the divergence is the result of

frameshift mutations and substitutions [25, 26, 27].

Some gene-�nding tools detect frameshifts. FrameD [28] relies on a directed acyclic

graph for gene prediction in the presence of frameshifts. Kislyuk et al. [29] apply an ab initio

method to detect possible frameshifts from coding potential generated by GeneMark [30].

GeneTack [31] and FragGeneScan [32] use hidden Markov models for ab initio frameshift

detection in gene �nding.

Despite the extensive study of frameshift detection, the above programs are not designed

for protein family classi�cation through DNA versus protein family alignment. Alternatively,

GeneWise [33], a widely used DNA versus protein alignment tool, allows comparison of

a DNA sequence with a pro�le HMM. Our algorithm di�ers from GeneWise by explicitly

incorporating a position-speci�c error model that is trained on data from di�erent sequencing

platforms such as 454 GS FLX Titanium.

8

2.3 Method

The representative protein domain classi�cation tool HMMER [18] classi�es a query protein

sequence into a pro�le HMM-represented protein family using the Viterbi or the Forward

algorithm [17]. The Viterbi algorithm aligns a query protein sequence to a pro�le HMM

by searching for the most probable state path in the model. If the alignment score or E-

value meets the pre-de�ned threshold, the query is classi�ed into the corresponding family.

The alignment generated by the Viterbi algorithm only accounts for the di�erence caused

by evolutionary divergence between a sequence and a protein family. In order to classify

error-containing sequences into their native families, the alignment algorithm must detect

the di�erences resulted from both evolution and sequencing errors.

In this section, we describe HMM-FRAME, the implementation of an augmented Viterbi

algorithm that searches for the optimal alignment between a DNA query and a pro�le HMM

by considering both evolutionary divergence and sequencing errors. HMM-FRAME di�ers

from HMMER in the following ways: 1) HMM-FRAME directly accepts a DNA sequence

as input, 2) HMM-FRAME accepts a sequencing error model as input, 3) HMM-FRAME

can detect and �x frameshifts caused by sequencing errors in the DNA sequence. The

output alignment indicates which bases are inserted or deleted due to evolutionary change

or sequencing error.

2.3.1 Error models

Here we describe the error models used in our experiments. Di�erent sequencing technologies

may have di�erent types of errors. For example, previous work [34, 35, 36] has shown that

insertions and deletions occur more often in homopolymer regions than in non-homopolymer

9

regions for pyrosequencing reads. Substitution errors occur more often than insertions or

deletions in Illumina sequencing reads. Because deletion or insertion errors cause frameshifts,

we focus on applying HMM-FRAME to pyrosequencing data sets.

In this work, we consider two error models. The �rst one is a published model trained

from GS20 sequencing reads [34]. The insertion and deletion error rates in non-homopolymer

and homopolymer regions are 0.0007 and 0.0044, respectively. The second error model is

computed on data from FLX Titanium sequencing platform. We obtained a set of Titanium

sequence reads (Cole and Wang, unpublished) extracted from the region H of the 16S rRNA,

which were ampli�ed from the Baylor mock community (22 strains, 24 sequences). Then

we computed error rates using insertions and deletions that were annotated by generating

careful Needleman-Wunsch alignments between the Titanium sequencing reads and the con-

trol sequences. In total, 7,040 sequences passed the initial quality control of RDP [37] after

contamination and chimera detection. There were 1,721 insertion and deletion errors. Note

that PCR, which was used to generate the amplicons of the sample, can introduce errors.

However, because most of the errors introduced by PCR are substitution errors, we assumed

that the deletions and insertions were mainly sequencing errors. The derived error rates for

homopolymers of di�erent sizes were: 1: 0.000532, 2: 0.000698, 3: 0.00102, 4: 0.000688,

5: 0.0372, 6: 0.00167, 7: 0.143, where the �rst number is the size of homopolymer regions

(1 means non-homopolymer) and the second number is the rate of insertion and deletion

errors. If we sum the error rates for homopolymer regions of di�erent sizes, the insertion

and deletion error rates for non-homopolymer and homopolymer regions were 0.0005 and

0.001, respectively. They are slightly smaller than the published G20 error rates [34]. We

will compare their performance on a data set with annotated errors in the section 2.4.

10

2.3.2 The augmented Viterbi algorithm for sequencing error cor-

rection

Let π be a state path in a pro�le HMM M . Let r be a set of insertion and deletion positions

in a DNA sequence x. The augmented Viterbi algorithm searches for the most probable path

π∗ and the most probably error position set r∗ such that (π∗, r∗) = argmax(π,r)P (x, π, r).

Intuitively this algorithm searches for an optimal alignment between a DNA sequence and

a pro�le HMM by simultaneously considering 1) evolutionary divergence (i.e. the insertion,

deletion, and substitution of amino acids) and 2) sequencing errors (i.e. insertion and deletion

of nucleotides). To solve the above equation, we �rst divide the search space according to

di�erent types of sequencing errors inside a codon and between two consecutive codons. For

each type of error, we search for the most probable state path.

Input: a DNA sequence x, a pro�le HMM M , and a sequencing error model. Notations of

M and the error model will be described below.

Output: the optimal alignment between DNA sequence x and M , as well as error positions

in r.

Algorithm: we �rst de�ne notations that will be used in the dynamic programming equa-

tions.

• Notations about the pro�le HMM M : States Mj , Ij , and Dj are matching, in-

sertion, and deletion states in M . as1s2 is the transition probability from state s1

to s2. es(T (xi−2xi−1xi)) is the emission probability for state s to emit amino acid

T (xi−2xi−1xi), which is translated from the codon xi−2xi−1xi. For a detailed de-

scription of a pro�le HMM M , we refer the reader to the textbook [17] and the users'

guide of HMMER [18]. State Gj is the only state that is not de�ned in pro�le HMMs

11

from HMMER 3.0. It encodes insertions of nucleotides between codons. aMjGj is the

transition probability from matching state Mj to nucleotide insertion state Gj . It is

set to the insertion error probability. aGjGj is the self-transition probability for Gj ,

encoding the probability of consecutive insertions. When consecutive insertion is not

allowed, it is set to 0. aGj−1Mj is the transition probability from Gj−1 to the next

matching state Mj . When only one insertion error is allowed, it is set to 1.0.

• Notations about the sequencing error model: pI(xi) is the probability that base xi

is an insertion error. pD(xi) is the probability that there is a deletion error after base

xi.

• Subproblems and the recursive equations: Based on our analysis of error patterns, it

is very rare that there are consecutive insertions or deletions in a sequence read. Thus,

the following DP algorithm assumes that there is at most one insertion or deletion

inside a codon. The algorithm can be extended to handle all possible cases.

� VMj (i) is the score of the best alignment matching subsequence x1..i to the sub-

model up to the matching state Mj , given that xi is the third base of a codon

and this codon encodes an amino acid emitted by Mj .

� V Ij (i) is the score of the best alignment matching subsequence x1..i to the sub-

model up to the insertion state Ij , given that T (xi−2xi−1xi) is emitted by Ij .

� V Gj (i) is the score of the best alignment ending in xi being emitted by state Gj ,

which encodes an insertion of nucleotides between codons.

� V Dj (i) is the score of the best alignment matching subsequence x1..i to the sub-

model up to the deletion state Dj .

12

VMj (i) = max{

case I : no sequencing error in the codon xi−2xi−1xi :

eMj (T (xi−2xi−1xi))× VMj−1(i− 3)× aMj−1Mj ,

eMj (T (xi−2xi−1xi))× VIj−1(i− 3)× aIj−1Mj ,

eMj (T (xi−2xi−1xi))× VDj−1(i− 3)× aDj−1Mj ,

eMj (T (xi−2xi−1xi))× pI(xi−3)× VGj−1(i− 3)× aGj−1Mj ,

case II : nucleotide xi−1 is an insertion :

eMj (T (xi−3xi−2xi))× pI(xi−1)× VMj−1(i− 4)× aMj−1Mj ,

eMj (T (xi−3xi−2xi))× pI(xi−1)× VIj−1(i− 4)× aIj−1Mj ,

eMj (T (xi−3xi−2xi))× pI(xi−1)× VDj−1(i− 4)× aDj−1Mj ,

eMj (T (xi−3xi−2xi))× pI(xi−1)× VGj−1(i− 4)× aGj−1Mj ,

case III : nucleotide xi−2 is an insertion :

Repeat the above four equations for eMj (T (xi−3xi−1xi)),

case IV : there is a deleted nucleotide (represented by d) between xi−1 and xi :

eMj (T (xi−1d xi))× pD(xi−1)× VMj−1(i− 3)× aMj−1Mj ,

eMj (T (xi−1d xi))× pD(xi−1)× VIj−1(i− 3)× aIj−1Mj ,

eMj (T (xi−1d xi))× pD(xi−1)× VDj−1(i− 3)× aDj−1Mj ,

eMj (T (xi−1d xi))× pD(xi−1)× VGj−1(i− 3)× aGj−1Mj ,

case V : there is a deleted nucleotide between xi−2 and xi−1 :

Repeat the above four equations for eMj (T (d xi−1xi)).

}

13

In cases IV and V, we use d to represent the deleted bases. We choose d to maximize the

emission probability of T (xi−1d xi) (or T (d xi−1xi)) in the matching state Mj .

V Ij (i) = max{eIj (T (xi−2xi−1xi))× VMj (i− 3)× aMjIj , eIj (T (xi−2xi−1xi))× V

Ij (i− 3)× aIjIj}

V Gj (i) = max{pI(xi)× VMj (i− 1)× aMjGj , pI(xi)× V

Gj (i− 1)× aGjGj}

V Dj (i) = max{VMj−1(i)× aMj−1Dj , V

Dj−1(i)× aDj−1Dj}

2.3.3 Running time analysis

The time complexity of the above dynamic programming algorithm is O(δ|x||M |), where |x|

is the length of input DNA sequence and |M | is the number of states in M . δ is the number

of di�erent types of errors inside a codon plus the case of insertions between two codons. In

our current implementation, δ = 26, which renders a longer running time than the standard

Viterbi algorithm. Thus, it is not practical to compare millions of metagenomic sequence

reads to over 10,000 protein families in Pfam. Instead, we only run HMM-FRAME on

sequences that are likely to contain insertion or deletion errors. For large-scale applications,

we suggest applying HMMER, which is as fast as BLAST, to all input sequence reads using

a big E-value cuto� (such as 100). Alignments covering at least 80% of the translated DNA

sequence with signi�cant E-values can be classi�ed by HMMER in this step. Sequence reads

that do not yield any partial alignments are unlikely to be members of any protein family.

14

Thus, we only apply HMM-FRAME to reads yielding partial alignment with marginal scores

because these reads could potentially contain sequencing errors.

2.4 Results

In this section, we compare the sensitivity and false positive rates (FP rates) of HMM-

FRAME with GeneWise [33] and FragGeneScan [32]. We then apply HMM-FRAME to

Targeted Metagenomics and a published metagenomic data set. Our experimental results

show that the length, scores, and E-values of pro�le HMM alignments are signi�cantly im-

proved after error correction. As pro�le HMM-based alignment tools determine membership

by comparing E-value or length with user-de�ned thresholds, the improvement of these pa-

rameters enables more error-containing sequences to be classi�ed into their native families.

2.4.1 Accuracy of HMM-FRAME

In order to evaluate the accuracy of HMM-FRAME in detecting insertion and deletion errors,

we obtained a control data set with annotated error positions from RDP (Cole and Wang,

unpublished). In this data set, NifH gene families from the Desul�tobacterium hafniense

strain DCB-2, the Burkholderia xenovorans strain LB40, and the PCC 7120 strain of An-

abaena were ampli�ed and then sequenced using 454 Titanium. The sequenced gene families

were aligned with the nifH genes in these three organisms using the Needleman-Wunsch

algorithm. Insertion and deletion errors were identi�ed from the alignments. After contami-

nation and chimera screening, we had 18,900 sequences, of which 3,408 sequences contained

4,623 insertion or deletion errors. We conducted the protein domain analysis on the 18,900

sequences using HMM-FRAME under the two error models presented in the Method Sec-

15

tion. The input pro�le HMM was trained on 25 nifH genes obtained from RDP's functional

gene repository website [38].

We evaluated the performance of error-prediction tools using two types of sensitivity

and FP rates. Let S+ be the set of error-containing sequences in the control data set. Let

S be the set of predicted error-containing sequences. The Sequence-level sensitivity and

FP rate are S∩S+

S+and S−S

+

S , respectively. Similarly, let Q+ be the set of insertion and

deletion positions in error-containing sequences from the control data set. Let Q be the set

of predicted error positions. The Base-level sensitivity and FP rate are Q+∩QQ+

and Q−Q+

Q ,

respectively.

Using the control data set, we �rst evaluated the performance of HMM-FRAME under

the published GS20 and our self-trained Titanium error models. Then we compared the

performance of HMM-FRAME with GeneWise [33] and FragGeneScan [32]. Similar to HMM-

FRAME, GeneWise can directly compare DNA sequences with a pro�le HMM and can accept

user-de�ned error rates. We tested GeneWise using di�erent parameters including error rates

and the alignment score thresholds (ranging from 0 to 20). The results with the best tradeo�

between sensitivity and FP rate were kept for comparison with HMM-FRAME.

FragGeneScan [32] is a newly developed gene prediction tool for short and error-prone

sequences. It predicts genes and identi�es sequencing errors inside predicted genes. We

applied FragGeneScan on the above sequence set (all genes) and tested its sensitivity and

FP rate. FragGeneScan successfully recognized all input as protein-coding genes, rendering

a high gene-prediction sensitivity in this data set. However, FragGeneScan had higher FP

rates than HMM-FRAME in error detection. The results are summarized in Table 2.1.

Sensitivity and FP rate of each program when detecting annotated insertion and deletion

errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-

16

FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is

set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we

did not evaluate its performance at the base-level.

Table 2.1: Comparing the error detection performance of HMM-FRAME,GeneWise, and FragGeneScan.

HMM-FRAME: HMM-FRAME: GeneWise FragGeneScanG20 self-trained

seq-sen 95.25% 90.6% 53.8% 83.04%base-sen 85.08% 82.4% 53.39%seq-FP 0.154% 0 0.001% 0.7%base-FP 2.1% 0.003% 59.57%

Sensitivity and FP rate of each program when detecting annotated insertionand deletion errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is set to zero to maximize thesensitivity. As GeneWise has low sequence-level sensitivity, we did not eval-uate its performance at the base-level.

As shown in Table 2.1, each tool has higher sensitivity and smaller FP rates in identifying

error-containing sequences than in locating error positions. HMM-FRAME has a better

tradeo� between sensitivity and FP rate than both GeneWise and FragGeneScan. Both GS20

and our self-trained Titanium error models have small FP rates in predicting error positions,

but GS20 has higher sensitivity. Thus, we plan to use GS20 in all further experiments.

2.4.2 Using HMM-FRAME in �Targeted Metagenomic�

In this section, we present the utility of HMM-FRAME in two applications of �Targeted

Metagenomics", where one or several gene families are ampli�ed from environmental DNA

and these amplicons are sequenced using high-throughput sequencing platforms. One typical

application of Targeted Metagenomics is to sequence the amplicons of the 16S rRNA gene

for phylogenetic complexity analysis. Besides 16S rRNA, protein-coding genes that are

17

important to a particular habitat can be ampli�ed and sequenced for targeted functional

analysis in metagenomic data sets. For example, Targeted Metagenomics of the nifH gene,

which encodes nitrogenase reductase, is important for analyzing microbial genomes sequenced

from soil. Although these sequences are sampled from one or several targeted gene families,

frameshift errors can cause short alignments with marginal scores between the input and the

targeted gene families. As a result, sequences lacking signi�cant alignment length and scores

will be regarded as contaminants and be discarded. Thus, it is desirable to �x frameshift

errors to maximize the number of usable samples. Given a DNA read and a pro�le HMM

built from a set of known protein sequences, HMM-FRAME can be applied to detect and

correct frameshift errors in amplicon reads.

2.4.2.1 Protein domain analysis of nifH sequences

In the �rst experiment, we obtained 3,937 nifH sequences of an average length of 76 bases

generated by the 454 FLX sequencing technology. In order to discard contaminants that

originated from non-target genes, we aligned the 3,937 sequences with the nifH gene family,

which was built on a small set of 25 expert-veri�ed full-length nifH protein reference se-

quences from RDP's functional gene repository [38]. In the gene family building process, we

�rst applied ClustalW [39] to align the 25 reference sequences. Then we applied HMMER

3.0's hmmbuild program to derive a pro�le HMM from the multiple sequence alignment. Of

the 3,937 454 FLX sequences, 111 were found to be contaminants and were excluded from

further analysis. Of the remaining 3,826 sequences, HMM-FRAME detected 296 insertions

and deletions in 256 sequences. Thus, approximately, 7% of the samples contained frameshift

errors. Of the 256 sequences containing insertion or deletion errors, 224 (87.5%) only con-

tained one insertion or deletion error. 24 (9.4%) sequences contained two errors, and eight

18

(3.1%) contained three errors. Of the 296 insertions or deletions, 224 (75.7%) were inside or

beside homopolymer regions.

Because protein domain classi�cation tools compare alignment lengths, scores, and E-

values with pre-de�ned thresholds to determine a sequence's membership, the changes in

the alignments a�ect the �nal domain composition analysis. After error correction, pro�le

HMM-based alignment tools are expected to generate longer alignments with bigger scores

and smaller E-values. This gives error-containing sequences a better chance of being classi�ed

into the correct families rather than being labeled contaminants.

In order to conduct a fair comparison on alignments before and after error correction,

we choose a third-party tool HMMER to generate alignments for original and corrected

sequences. The changes of alignments' E-values and lengths due to error correction are

presented in Figure 2.2. In this �gure, the changes of alignments are presented for 256

sequences in which HMM-FRAME detects errors. �Original" refers to HMMER alignments

on sequences before error correction. �Corrected" refers to HMMER alignments on sequences

after error correction by HMM-FRAME. As a comparison, we also plot the length of the

original sequence reads (with the legend �sequence read"). They largely overlap with the

length of corrected alignments, indicating that complete sequence reads can be aligned with

the nifH pro�le HMM after error correction.

In order to test whether the improvement was statistically signi�cant, we conducted a

two-sample Kolmogorov-Smirnov test (K-S test) on the alignments' lengths and E-values

before and after error correction. The p-values for the alignments' length and E-value dis-

tributions were 3.1037e-010 and 1.1802e-045, respectively. In particular, the comparison

between alignments' lengths and the sequence reads' lengths shows that most partial align-

ments generated by error-containing sequences become complete alignments after error cor-

19

-50

-30

-10

10

30

50

70

90

1 21 41 61 81 101 121 141 161 181 201 221 241

Len

gth

s an

d E

-val

ue

s o

f p

HM

M a

lign

me

nts

NifH family reads

LOG(original E-value)

LOG(corrected E-value)

original length

corrected length

Sequence read length

Figure 2.2: Change of HMMER alignments' scores, lengths, and E-values (in log space)before and after error correction for nifH sequences. (For interpretation of the referencesto color in this and all other �gures, the reader is referred to the electronic version of thisdissertation)

rection. Thus, when comparatively longer alignments (e.g., 23 amino acids or 69 bases) are

required for domain classi�cation, more sequence reads (213 more under when the threshold

is 69 bases) will be classi�ed into their native families.

2.4.2.2 Protein domain analysis of the bacterial aromatic dioxygenase genes

In the second experiment, we obtained 2486 pyrosequencing samples of an average length

of 224 bases from the bacterial aromatic dioxygenase genes in a soil sample [40]. Although

these pyrosequencing reads were sequenced from the 5' end of PCR amplicons of bacterial

aromatic dioxygenase genes, we were interested in classifying them into three sub-families

of dioxygenase genes: toluene/biphenyl, naphthalene, and benzoate [41]. Note that there

20

is another subfamily (phthalate). However, due to lack of training proteins for this family

(Dr. Iwai, personal communication), we only searched for members of three sub-families.

Three sets of reference protein sequences were extracted from Pfam [12] for toluene/biphenyl,

naphthalene, and benzoate [41]. Based on these training sets, we built three pro�le HMMs

using ClustalW and HMMER. Then we applied HMM-FRAME to align the 2486 reads with

the three pro�le HMMs. HMM-FRAME detected 77 insertions and 52 deletions, which were

distributed in 121 sequences. Of the 121 error-containing sequences, 77 could not be classi�ed

into any subfamily by HMMER under the E-value threshold 0.1. After error correction using

HMM-FRAME, these 77 sequences were classi�ed into di�erent families with an average E-

value of 3.3e-06, indicating that they were highly likely to be true members of the underlying

families. For other error-containing sequences, the pro�le HMM alignments' E-values and

lengths were signi�cantly increased after error correction. The change is plotted in Figure 2.3.

In this �gure, the data sets is sequenced from bacterial aromatic dioxygenase genes in a soil

sample. All alignments are generated by HMMER for a fair comparison. �Original" refers

to HMMER alignments on sequences before error correction. �Corrected" refers to HMMER

alignments on sequences after error correction by HMM-FRAME.

We also applied a two-sample K-S test on the alignments' lengths and E-values before and

after error correction. The p-values for the length and E-value distributions were 8.0609e-

011and 1.9776e-040, respectively. The improved alignment lengths and E-values provide

stronger evidence for the membership of the input samples. In total, after error correction

by HMM-FRAME, we could classify 1,214 sequences into three subfamilies. 1,042 reads

were members of the naphthalene subfamily. 96 reads belonged to the benzoate subfam-

ily. 76 reads belonged to the toluene/biphenyl subfamily. The remaining 1272 reads could

potentially be members of the subfamily phthalate (Dr. Iwai, personal communication).

21

-51

-31

-11

9

29

49

69

89

1 10 19 28 37 46 55

Len

gth

s an

d E

-val

ue

s o

f P

HM

M a

lign

me

nts

Soil sample sequence reads

original length corrected length LOG(original E-value) LOG(corrected E-value)

Figure 2.3: Change of HMMER alignments' lengths, scores, and E-values (in log space) beforeand after error correction for the bacterial aromatic dioxygenase genes in a soil sample.

2.4.3 Protein domain classi�cation in the deep mine data set

In order to show the utility of HMM-FRAME in a metagenomic data set containing members

of multiple domain families, we applied HMM-FRAME to the �rst 454 sequencing project

for environment samples, which were sequenced from two sites in the Soudan Mine, Min-

nesota, USA [42]. In this experiment, we downloaded the Black Sample from the paper's

supplementary data website. This data set contains 388,627 sequence reads with an average

length of 99 bases.

22

There were two steps in the annotation. First, we applied gene-prediction tools. Second,

we conducted the domain classi�cation on predicted genes. A number of gene-prediction

tools are available for metagenomic data sets. However, not every tool can handle short

reads. Glimmer [43] did not output meaningful predictions when it was applied to this data

set. The sensitivity of Metagene [44] drops to 59% for 100-base sequences [45]. We thus

chose FragGeneScan, a newly developed gene-prediction tool for short reads. FragGeneScan

predicted 281,658 genes, of which 72,355 contained errors. For convenience in discussion, let

S be the set of genes predicted by FragGeneScan. Let S' be the raw read set corresponding

to genes in S. Thus, 72,355 sequences in S were di�erent from their raw reads in S' because

FragGeneScan predicted and corrected errors in S'. We compared three domain classi�cation

pipelines: 1) apply HMMER 3.0 on raw reads S', 2) apply FragGeneScan and then HMMER

on corrected reads S, and 3) apply HMM-FRAME on raw reads S'. We recorded how many

reads could be classi�ed into one of the 2,558 Pfam domain families that contain the keyword

�bacteria". The number of classi�able reads for the three pipelines were: 13,544 for HMMER,

12,328 for FragGeneScan + HMMER, and 17,496 for HMM-FRAME. The classi�cation

results have large overlaps, which are illustrated in Figure 2.4. In this �gure, sequence

sets that can be classi�ed by HMM-FRAME, HMMER, and FragGeneScan+HMMER are

represented by three sets A, B, and C. |A| = 17, 496. |B| = 13, 544. |C| = 12, 328. B −C =

2224. C −B = 1008. C − A = 4. A− (B + C) = 2948.

In summary, HMM-FRAME was able to classify 2,948 more reads than the other two

annotation pipelines. HMM-FRAME found errors in all of these 2,948 reads. Thus, it is likely

that other two pipelines failed to classify them because of frameshifts. HMM-FRAME failed

to classify four reads that can be aligned by FrageGeneScan+HMMER. A closer examination

showed that FragGeneScan and HMM-FRAME output di�erent error positions in these four

23

B: HMMER

alone

A: HMM-

FRAME

C:

FragGeneScan

+HMMER

Figure 2.4: Protein domain classi�cation results for the black sample in the deep mine dataset.

sequences.

The performance evaluation of FragGeneScan must consider both gene-prediction and

error-prediction. Of the 281,658 predicted genes, only 12,328 could be classi�ed into existing

domain families. Further analysis is needed to examine whether other predictions are novel

genes or wrong predictions. It is worth noting that FragGeneScan could classify 1,008 more

sequences after its error correction than applying HMMER 3.0 alone on raw reads. However,

while 2,224 raw reads could be classi�ed into existing domain families by HMMER 3.0,

they could not be aligned with any family after error correction by FragGeneScan. This

indicates that FragGeneScan might have over-predicted errors in the 2,224 sequences. This

is consistent with our observation that FragGeneScan has a high FP rate in the control data

set.

24

2.5 Conclusion

Despite the advances of high-throughput sequencing technologies, sequencing errors still

pose challenges for data annotation. In particular, our error model analysis shows that

454 FLX Titanium only slightly decreases the insertion and deletion error rates compared to

GS20. Thus, correcting frameshifts caused by insertion or deletion errors is still important for

metagenomic sequence annotation. In this work, we introduce a protein domain classi�cation

tool HMM-FRAME, which can classify error-prone DNA sequence reads into protein domain

families. HMM-FRAME can accept any error model trained on data from high-throughput

sequencing technologies and thus achieve high detection sensitivity while maintaining a low

false positive rate.

Applying HMM-FRAME to a data set with annotated errors shows its high sensitivity

and accuracy in error detection. In particular, by �xing frameshift errors, we can obtain

signi�cantly longer pro�le HMM alignments with smaller E-values. As alignments' lengths,

scores, and E-values are often used to determine family membership, improving them helps

to classify more sequences into the native domain families. In our experiments, sequences

that fail HMMER 3.0 under the default E-value or score threshold are classi�ed into correct

domain families using HMM-FRAME. Thus, HMM-FRAME can be used as a complementary

tool to HMMER 3.0 on error-prone sequences.

25

Chapter 3

Pro�le HMM-based protein domain

classi�cation for short sequences

3.1 Background

With the advent of next-generation sequencing and culture-independent methods, an enor-

mous amount of metagenomic data have been sequenced from microbial communities from

di�erent habitats. In order to understand the phylogenetic complexity and biological func-

tions of microbial communities, as well as their interactions with the host, automatic an-

notation tools such as CAMERA [5], MG-RAST [6], and MEGAN [7] are being used for

annotating metagenomic data sets. As an important component of these metagenomic an-

notation tools, protein homology search provides basis for identifying putative genes and

assigning those genes to annotated functional categories (e.g. protein domain families).

Because of the high sensitivity of remote homology recognition, HMMER has been suc-

cessfully applied to genome-wide domain analysis. However, its sensitivity is signi�cantly

limited by the short reads of metagenomic data sets and poorly conserved domains. In order

to investigate how read length and domain identity a�ect the sensitivity of HMMER, we

randomly sampled 200 peptides with lengths of 12, 20, and 28 amino acids from the seed

sequences of each of the 2,558 Pfam domains, which contain the word �Bacteria" in their de-

26

scriptions. The peptides were aligned with the domain families using HMMER. We used the

E-value cuto� 1000 in order to boost the sensitivity. For each domain, the read classi�cation

sensitivity of HMMER is measured as the ratio of the number of aligned reads to the total

number of sampled reads. We sort all data points by domain identity in ascending order

and plot them in Figure 3.1. For domains with the same identity, their average sensitivity

is reported.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average sequence identity of domain

Sens

itiv

ity

of H

MM

ER

3

read length: 36 bpread length: 60 bpread length: 84 bp

Figure 3.1: Change of the read classi�cation sensitivity of HMMER over read length and theaverage sequence identity of domain families.

Figure 3.1 shows that the sensitivity of HMMER deteriorates with the decrease of the

query sequence length and domain identity. The sensitivity is decreased from 90% to 65-70%

when the lengths of reads change from 28 residues (i.e., 84 bp for corresponding DNA reads)

27

to 20 residues (i.e., 60 bp for DNA reads) for domains with identity around 40%.

Although next-generation sequencing technologies are producing longer reads and assem-

bly tools may be available to assemble short reads into longer contigs, there is still a need

for a protein domain analysis tool for short reads. First, many �nished or on-going metage-

nomic sequencing projects contain reads with lengths from 35 to around 400 bp depending

on the chosen sequencing technologies. In addition, peptide sequences encoded in individual

metagenomic sequence reads may share only small overlaps with existing domain families.

Thus, a sizable portion of many available data still contains short reads. Second, the sheer

amount of data and the complexity of many metagenomic data sets pose a great challenge

for assembly tools [46]. A large portion of short reads cannot be correctly assembled into

longer contigs. Third, many domain families exhibit low average sequence identity, which

poses a challenge for short and medium-sized reads. Figure 3.2 shows the histogram of pair-

wise sequence identity for domains related to bacteria. Of 2558 domains, there are about

43% domains with average identity no greater than 0.3. For these domains, the sensitivity

of HMMER is between 0.7 and 0.8 for reads of length 84 bp, between 0.4 and 0.6 for reads

of length 60 bp, and smaller than 0.1 for reads of length 36 bp. As a result, although a large

number of reads are sequenced from genes, which are highly compact in microbial genomes,

only a small percentage of the short reads can be classi�ed into their native domains using

existing tools.

In this work, we introduce MetaDomain, a protein domain classi�cation tool designed

for short reads in metagenomic data sets. MetaDomain provides a complementary protein

analysis tool to HMMER on assigning short reads into their native families.

28

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

600

700

800

Average sequence identity of domain

Num

ber

of d

omai

ns

Figure 3.2: Histogram of the average pairwise sequence identity for 2558 domains

29

3.2 Related Work

Pro�le HMM-based protein homology search is widely used for mining microbial genomes.

Knowing the composition of di�erent domain families encoded in a metagenomic data set

helps us understand which functions are important for a particular habitat. For example,

Ellrott et al. [8] investigated the distribution of protein families in the available human gut

genomic and metagenomic data. As the data set contains assembled contigs, using HMMER

is expected to achieve high sensitivity. Schlüter et al. [9] used HMMER to understand the

genetic diversity and composition of a plasmid metagenome from a wastewater treatment

plant. The reads have an average length of 104 bp, which is also adequate for HMMER to

achieve high sensitivity.

Besides providing a basis for functional pro�ling, pro�le HMM-based homology search

was also used for phylogenetic complexity analysis in metagenomic data. The phylogenetic

algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers

to identify the source organisms of environmental DNA fragments as short as 80 bp. As we

show in Figure 3.1, pro�le HMM-based tools have sensitivity of at least 0.9 in classifying

reads of 80 bp into domains with average sequence identity above 40%. However, for poorly-

conserved domains, a signi�cant number of reads might be missed. A similar but faster tool

Treephyler [47] conducted community pro�ling in metagenomics and metatranscriptomics

based on Pfam domain assignments. Treephyler was applied to a data set with average read

length of 200 bp. It is unclear how shorter reads a�ect its performance.

Our previous work designed a tool HMM-FRAME [48], which can identify and correct

frame-shift errors in pyrosequencing reads during protein domain classi�cation using pro�le

HMM-based alignment. However, it was not speci�cally designed to handle short reads.

30

Finally, we note that the method used in MetaDomain shares a similar rationale to the

recent work by Weng et al. [49]. Weng et al. reported that taxonomic binning tools for

metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST

cut-o�s. Thus, they re-analyzed the discarded reads using less stringent cut-o�s. In or-

der to control the false positive matches introduced by the relaxed cut-o�s, they used the

evolutionary conservation of adjacency between neighboring genes as an additional criterion.

3.3 Method

HMMER uses E-values as the discrimination threshold to determine the membership of a

query sequence. However, short reads may only generate low alignment scores and thus

insigni�cant E-values. In particular, the conservation across the entire length of a domain

family can be highly variable, posing a great challenge for classifying reads sequenced from

poorly conserved sub-regions. In order to increase the sensitivity of aligning remotely-related

short reads, we propose position-speci�c score cuto�s, by which poorly conserved regions

allow more relaxed discrimination thresholds than well-conserved regions. However, the low

thresholds can easily incur random matches. In order to control the false positive rate,

we examine the position distribution of read alignments. The position distribution of read

alignments on a truly encoded domain is expected to be more uniform than a domain that

incurs random read alignments [50, 51]. Figure 3.3 shows the schematic representations of

three types of distributions of read alignments along a domain. The alignments in (A) and

(B) are more likely to be random. Thus the domains may not be encoded in the data set. The

alignment distribution in (C) exhibits a much more uniform distribution, providing strong

evidence for the existence of the underlying domain in the data set. Thus, by using relaxed

31

position-speci�c score cuto�s and inspecting the distribution of alignments, we expect to

classify more short reads into the correct domain families while not falsely reporting domains

that are not characterized in the data.

Figure 3.3: Three types of alignment distributions.

3.3.1 Pipeline of MetaDomain

The input to MetaDomain includes sequence reads and a list of protein domains. The

output is a list of domains encoded in the underlying data set and the number of aligned

reads. Figure 3.4 shows a schematic representation of the pipeline of MetaDomain.

MetaDomain consists of three main stages: short read alignment, �ltering, and classi-

�cation. In the alignment stage, we use the Viterbi algorithm [17] to search for the best

local alignment between a query sequence and a pro�le HMM-represented domain family. In

the �ltering stage, we �rst apply a position-speci�c score threshold to eliminate insigni�cant

alignments. Then we remove stacked alignments with the same alignment positions inside

a poorly conserved region. In the �nal stage, we use the number of aligned reads and the

distribution of alignment positions to determine whether a domain is encoded.

32

Viterbi algorithm

Sequence reads

Optimal local

alignments

Filtering

Trimmed alignments

Pfam domains

Position-specific

threshold

Read number and

domain coverage

thresholds

Transcribed or encoded

protein domains

Classification

Figure 3.4: Pipeline of MetaDomain.

33

3.3.2 The Viterbi algorithm

The Viterbi algorithm aligns a query sequence to a pro�le HMM by searching for the most

probable state path in the model. Unlike HMMER, MetaDomain directly aligns a DNA

sequence to a pro�le HMM. To do so, we implicitly align translated peptides under di�erent

reading frames with a pro�le HMM. Let π be a state path in a pro�le HMM M and let x

be a query DNA sequence. The Viterbi algorithm searches for the most probable path π∗

such that π∗ = argmaxπ(x, π). The output of the Viterbi algorithm includes the optimal

alignment and its score. As Viterbi is a standard algorithm designed for HMMs, we refer

readers to Durbin et al.[17] for a detailed illustration of the dynamic programming equations

for �nding π∗. The major di�erence between our implementation and the standard Viterbi

algorithm includes : 1) our implementation accepts a DNA rather than a peptide sequence

as input; 2) a local alignment can start and end with any state without incurring insertion

or deletion penalties.

3.3.3 Alignment Filtering

MetaDomain employs two �ltering mechanisms to increase its sensitivity in aligning short

reads while maintaining a low false positive rate: position-speci�c thresholds (PSTs) and

trimming.

3.3.3.1 Position speci�c threshold

PST allows di�erent alignment thresholds for well conserved and poorly conserved regions.

Let the length of a query DNA sequence be L (in bp). Denote the pro�le HMM as M . Let

Mi,j be a sub-model formed by all consecutive states from the ith match state Mi to the

34

jth match state Mj . The upper bound of the alignment score against Mi,j is the maximum

score that can be generated by aligning any input sequence of length j − i + 1 with Mi,j .

Let ai,j denote the transition probability from state Mi to state Mj . Let ei(a) denote the

probability of state Mi emitting amino acid a. Then the upper bound Ui,j for sub-model

Mi,j is calculated as follows:

Ui,j =

j∏k=i

ak,k+1 ×max(ek(a))

where aj,j+1 is set to 1 because j is the ending state of the sub-model.

We de�ne PST for the submodel Mi,j as:

PSTi,j = γUi,j

where the coe�cient γ is a user-speci�ed parameter in the range of [0,1]. It can be �exibly

adjusted to control the trade-o� between sensitivity and false positive rate of MetaDomain.

The default value is 0.6, which is used in our experiments.

3.3.3.2 Alignment trimming

Alignment with scores larger than their corresponding PSTs will pass the �rst �ltering stage.

As each domain has various conservation along the entire length of the model, well-conserved

sub-regions have high PSTs while poorly-conserved sub-regions yield low PSTs. Thus, ran-

dom sequences tend to be aligned to poorly-conserved regions by MetaDomain, incurring a

high FP rate. Our empirical experiments show that dozens of reads that are not sequenced

from the underlying domain can be aligned to the same position in a poorly-conserved sub-

region. In order to minimize the e�ects of noise, we discard stacked alignments that have

35

the same alignment positions.

3.3.4 Protein domain classi�cation

In this stage we extract two features from the collected read alignments for each domain:

the number of aligned reads and the domain coverage. The domain coverage is the fraction

of positions covered by at least one read alignment in a domain. MetaDomain then applies

a simple decision tree to classify all the target domains into two classes: encoded domains

and non-encoded domains. If both features of a domain are equal to or bigger than their

corresponding thresholds, this domain will be classi�ed as encoded. Otherwise it is not

encoded in the sample. By default, the cuto� for domain coverage is 30%. Ideally, the cuto�

for the number of aligned read should be determined based on the properties of data such

as sequencing depth. If users do not specify this value, we use 20 by default.

3.4 Results

In order to evaluate the performance of MetaDomain on real data generated by next-

generation sequencing technologies, we applied MetaDomain to protein domain analysis in

two data sets. The �rst one is the transcriptome generated using RNA-seq for Burkholderia

cenocepacia. As both the reference genome and its domain annotations are available, we

can quantify the sensitivity and false positive (FP) rate of MetaDomain. The second one

is metagenome data sequenced from soil. We applied MetaDomain to identify domains en-

coded in the underlying data. In addition, we compared HMMER and MetaDomain in both

applications.

36

3.4.1 Identifying transcribed protein domains in transcriptome

In this experiment, we conducted transcribed domain analysis in the transcriptome from

one strain of B. cenocepacia named AU1054 [52]. By using Illumina RNA-seq, the authors

generated multiple samples for AU1054 in two growth media. We used one replicate of cDNA

sample of AU1054 in the growth medium cystic �brosis. In total, 3,361,008 reads of a length

of 41 bp were downloaded from the website provided by the authors. We evaluated the

performance of read classi�cation and domain identi�cation of MetaDomain and HMMER.

3.4.1.1 Performance of read classi�cation

The performance of read classi�cation is quanti�ed using both read classi�cation sensitivity

and FP (false positive) rate. In this experiment, the read classi�cation performance is

computed on reads that can be mapped to annotated domains. Below we sketch the main

steps to obtain mapped reads for a domain using the reference genome and the domain

annotations. First, we downloaded the genome of AU1054 and the annotated genes and

domains from the IMG website [53]. There are 2,181 annotated Pfam domains. Second, the

reads were mapped to the reference genome using Bowtie [54] with two mismatches allowed.

Third, we compared the positions of read mapping and annotated domains. For a domain, all

reads that fall into it are de�ned as �mapped" reads. Denote the set of mapped reads as M .

All other (unmapped) reads constitute set U . For a domain classi�cation tool, let the set of

aligned reads for a domain be A. Thus, the sensitivity and FP rate of read classi�cation for

a domain are A∩MM andA−MU , respectively. A perfect sensitivity indicates that all mapped

reads can be aligned. A zero FP rate indicates that only mapped reads can be aligned to a

domain.

Of the 2,181 annotated families, we evaluated the performance of HMMER and Meta-

37

Domain on 1406 families which have at least 1 mapped reads. Of the 1406 tested domains,

HMMER could not align any read to 1150 domains, resulting in zero sensitivity and FP

rate. For the rest 256 domains, all aligned reads by HMMER are non-mappable reads, re-

sulting in zero sensitivity and a positive FP rate. The comparison between HMMER and

MetaDomain is summarized using a bubble chart in Figure 3.5. The biggest bubble indicates

that HMMER has zero sensitivity and zero FP rate for 1150 domains. As we can see, it

is highly di�cult for HMMER to correctly align reads as short as 41 bp. There are two

reasons for the low sensitivity of HMMER on short reads. First, the parameter training in

E-value calculation of HMMER are based on much longer reads (100 amino acids). Thus,

the small alignment scores generated by the short reads yield large E-values and cannot pass

the E-value threshold. Second, the small alignment scores of short reads may not pass the

�ltration stage of HMMER.

3.4.1.2 Identifying transcribed domains in the transcriptome

Figure 3.5 only shows the read classi�cation performance. MetaDomain uses both aligned

read number and domain coverage as thresholds for domain identi�cation. We expect that

the additional constraint will reduce the false positive rate in domain identi�cation. Because

of the low read classi�cation sensitivity, we speculate that HMMER will have low sensitivity

in identifying transcribed domains.

In order to quantify the performance of domain identi�cation, we need to build positive

and negative test sets, which include transcribed and non-transcribed domains based on

mapped reads. There is no commonly accepted criterion to de�ne transcribed genes using

the number of mapped reads. Various expression scores such as an average coverage depth

across the entire length of each gene [55] and reads per kilobase of exon model per million

38

0 1 2 3 4 5 6 7

x 10−5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FP rate

Sens

itiv

ity

HMMER3MetaDomain

Figure 3.5: Read classi�cation sensitivity and FP rate of HMMER and MetaDomain. Thesize of each bubble represents the number of data points (i.e., domains) with the samesensitivity and FP rate.

mapped reads (RPKM) are used to quantify transcriptional level. In addition, the cuto�s

of de�ning highly transcribed, lowly transcribed, or non-transcribed genes are variable in

di�erent applications [56]. In this work, we de�ne transcribed domains based on the rationale

that a truly transcribed domain should be mapped by a number of reads at di�erent positions.

Correspondingly, we use the following criteria to determine whether a domain inside a gene

is transcribed: 1) at least N reads are mapped to a domain; 2) at least 30% of positions

in a domain are mapped by reads. A domain is labeled �non-transcribed" if the number

of mapped read is zero. For domains that fall between the criteria for transcribed and

non-transcribed domains, they are labeled �unknown" and are excluded from the test sets.

Table 3.1 shows the size change of the positive and negative test sets over the cuto� N.

39

Table 3.1: Number of transcribed and non-transcribed domains using di�erent cuto�s (N)for the number of mapped reads.

N transcribed unknown none-transcribed

10 318 1317 54615 262 1373 54620 226 1409 54625 195 1440 54630 169 1466 546

Intuitively, bigger N creates an easier case for domain classi�cation than smaller N.

We align all reads to the transcribed and non-transcribed domains using MetaDomain

and HMMER. The �unknown" domains are removed due to their ambiguity. For HMMER,

we �rst translated the short reads into peptide sequences using 6-frame translations. We

then aligned the domains with the translated sequences using 1000 as the E-value threshold,

which is chosen to maximize the sensitivity. For MetaDomain we directly aligned the short

reads with the domains. The pipeline in Figure 3.4 was used to output a list of transcribed

domains for MetaDomain. LetD+ andD− be the number of transcribed and non-transcribed

domains identi�ed using the read mapping results in Section 3.4.1.2. LetM+ andM− be the

predicted number of transcribed and non-transcribed domains by MetaDomain or HMMER.

The sensitivity and FP rate of domain classi�cation tools are de�ned using the following

equations:

Sensitivity = D+∩M+D+

FP rate = D−∩M+D−

The values of D and M are a�ected by several options. First, D+ and D− can change

over the cuto� N as shown in Table 3.1. Second, we used both the domain coverage and the

40

number of aligned reads to determine whether a domain is encoded or transcribed. In this

experiment, the cuto� for domain coverage is 30%, which we found reasonable across di�erent

experiments. Thus, M+ and M− mainly change over the required number of aligned reads

to a domain. For simplicity, we denote the cuto� as τ . Increasing τ implies a more stringent

constraint for de�ning transcribed domains, and thus might result in lower sensitivity and a

smaller FP rate. Decreasing τ is likely to increase the sensitivity while incurring a higher FP

rate. In order to compare the performance of MetaDomain and HMMER under di�erent τ ,

we plotted the ROC curves by changing τ from 1 to N for N=10, 20, and 30 in Figure 3.6.

0 0.1 0.20

0.2

0.4

0.6

0.8

FP rate

Sens

itiv

ity

N=10

0 0.1 0.20

0.2

0.4

0.6

0.8

1

FP rate

Sens

itiv

ity

N=20

0 0.1 0.20

0.2

0.4

0.6

0.8

1

FP rate

Sens

itiv

ity

N=30

HMMER3

MetaDomain

HMMER3

MetaDomain HMMER3MetaDomain

Figure 3.6: ROC curves of HMMER and MetaDomain.

Figure 3.6 shows that HMMER is highly speci�c (FP rate ≤ 1.3%). However, as we

speculated, its sensitivity is low, with the highest sensitivity being only 0.135. HMMER

misses a large portion of short reads that can be mapped to protein domains even when we

use a very relaxed E-value cuto�. When both tools incur an FP rate of 0.02, the sensitivity

of MetaDomain is 0.53 vs. 0.13 for HMMER. When N decreases from 30 to 10, the size of the

positive test set D+ becomes larger and the sensitivity of both HMMER and MetaDomain

41

decreases. Note that the sensitivity and FP rate of HMMER keep the same for many di�erent

thresholds (i.e., τ), resulting in compact ROC curves. Overall, the ROC curves show that

MetaDomain can achieve higher sensitivity while keeping a similar FP rate as HMMER

for domain classi�cation in this experiment. In addition, Figure 3.6 provides guidance on

determining appropriate τ for MetaDomain in order to achieve desired sensitivity and FP

rate.

On average, it took MetaDomain 280 seconds to align 752,156 reads with one domain on

a 2.

Date post:	01-Feb-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF ......categories, such as protein families or protein...

Documents