+ All Categories
Home > Documents > PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF ......categories, such as protein families or protein...

PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF ......categories, such as protein families or protein...

Date post: 01-Feb-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
127
Transcript
  • PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATIONSEQUENCING DATA

    By

    Yuan Zhang

    A DISSERTATION

    Submitted toMichigan State University

    in partial ful�llment of the requirementsfor the degree of

    Computer Science � Doctor of Philosophy

    2013

  • ABSTRACT

    PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OFNEXT-GENERATION SEQUENCING DATA

    By

    Yuan Zhang

    Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a

    wide range of methodologies in order to understand their functions, structures or evolu-

    tion history. Next generation sequencing (NGS) technologies generate large-scale sequence

    data of high coverage and nucleotide level resolution at low costs, bene�ting a variety of

    research areas such as gene expression pro�ling, metagenomic annotation, ncRNA identi�ca-

    tion, etc. Therefore, functional analysis of NGS sequences becomes increasingly important

    because it provides insightful information, such as gene expression, protein composition, and

    phylogenetic complexity, of the species from which the sequences are generated. One basic

    step during the functional analysis is to classify genomic sequences into di�erent functional

    categories, such as protein families or protein domains (or domains for short), which are

    independent functional units in a majority of annotated protein sequences.

    The state-of-the-art method for protein domain analysis is based on comparative sequence

    analysis, which classi�es query sequences into annotated protein or domain databases. There

    are two types of domain analysis methods, pairwise alignment and pro�le-based similarity

    search. The �rst one uses pairwise alignment tools such as BLAST to search query genomic

    sequences against reference protein sequences in databases such as NCBI-nr. The second one

    uses pro�le HMM-based tools such as HMMER to classify query sequences into annotated

    domain families such as Pfam. Compared to the �rst method, the pro�le HMM-based method

    has smaller search space and higher sensitivity with remote homolog detection. Therefore, I

  • focus on pro�le HMM-based protein domain analysis.

    There are several challenges with protein domain analysis of NGS sequences. First, se-

    quences generated by some NGS platforms such as pyrosequencing have relatively high error

    rates, making it di�cult to classify the sequences into their native domain families. Second,

    existing protein domain analysis tools have low sensitivity with short query sequences and

    poorly conserved domain families. Third, the volume of NGS data is usually very large,

    making it di�cult to assemble short reads into longer contigs. In this work, I focus on ad-

    dressing these three challenges using di�erent methods. To be speci�c, we have proposed four

    tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on

    detecting and correcting frameshift errors in sequences generated by pyrosequencing technol-

    ogy, thus accurately classifying metagenomic sequences containing frameshift errors into their

    native protein domain families. MetaDomain and SALT are both designed for short reads

    generated by NGS technologies. MetaDomain uses relaxed position-speci�c score thresholds

    and alignment positions to increase the sensitivity while keeping the false positive rate at

    a low level. SALT combines both position-speci�c score thresholds and graph algorithms

    and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene

    assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage,

    and lower chimera rate compared with existing tools. Finally, I will make a conclusion on

    my work and brie�y talk about some future work.

  • ACKNOWLEDGMENTS

    First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to ad-

    mit me as her PhD student four years ago provided me the precious opportunity to study in

    Michigan State University and led me to the world of bioinformatics. During these four years

    under her guidance I have made continuous progress in several aspects, including reading

    research papers, proposing research topics, developing methods, designing experiments, to

    writing papers. More importantly, I have gradually improved my ability to both indepen-

    dently and collaboratively conduct in-depth analysis into sophisticated research problems

    and use scienti�c methodologies to solve the challenging problems. She also gave a lot of

    suggestions on how to e�ectively demonstrate our work to the audience, especially to people

    from other research areas. This ability is very important in that it will profoundly determine

    my capability to collaborate in a team of people from di�erent background.

    I also want to thank other committee members Dr. C. Titus Brown, Dr. Pang-Ning

    Tan, and Dr. James R. Cole. They gave a lot of useful suggestions during the course of

    my PhD program. I also thank my lab mates Rujira Achawanantakun, Jikai Lei, Cheng

    Yuan, Prapaporn Techa-angkoon, and Jiao He. During these years, we have productive

    discussion and cooperation on various research topics and I have obtained great help from

    them. I would like to acknowledge my colleagues from BeachMint Inc. during my summer

    internship, especially Douglas Cohen, Je� Cooper, and Manunya Rozelle. With their help,

    I learned how to apply theories and methods to solve challenging problems in industry. I

    gratefully acknowledge other faculties and sta�s of CSE department, especially Dr. Rong

    Jin, Dr. Jin Chen, Linda Moore, and Norma Teague. I also owe a lot of thanks to my friends

    in MSU and in China. All of them give me a lot of support and help during these years.

    iv

  • My �nal and most important acknowledgement must go to my family. They always give

    me persistent and determined love and support.

    v

  • TABLE OF CONTENTS

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

    Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Next-generation sequencing technologies . . . . . . . . . . . . . . . . . . . . 11.2 Protein domain analysis of NGS sequences . . . . . . . . . . . . . . . . . . . 11.3 Challenges with protein domain analysis of NGS sequences . . . . . . . . . . 3

    Chapter 2 Protein domain classi�cation for metagenomic sequences con-taining frameshift errors . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3.1 Error models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 The augmented Viterbi algorithm for sequencing error correction . . . 112.3.3 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Accuracy of HMM-FRAME . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Using HMM-FRAME in �Targeted Metagenomic� . . . . . . . . . . . 17

    2.4.2.1 Protein domain analysis of nifH sequences . . . . . . . . . . 182.4.2.2 Protein domain analysis of the bacterial aromatic dioxyge-

    nase genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Protein domain classi�cation in the deep mine data set . . . . . . . . 22

    2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Chapter 3 Pro�le HMM-based protein domain classi�cation for short se-quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3.1 Pipeline of MetaDomain . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Alignment Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.3.3.1 Position speci�c threshold . . . . . . . . . . . . . . . . . . . 343.3.3.2 Alignment trimming . . . . . . . . . . . . . . . . . . . . . . 35

    3.3.4 Protein domain classi�cation . . . . . . . . . . . . . . . . . . . . . . . 36

    vi

  • 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Identifying transcribed protein domains in transcriptome . . . . . . . 37

    3.4.1.1 Performance of read classi�cation . . . . . . . . . . . . . . . 373.4.1.2 Identifying transcribed domains in the transcriptome . . . . 38

    3.4.2 Protein domain analysis in a soil metagenomic data set . . . . . . . . 423.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Chapter 4 A Sensitive and accurate protein domain classi�cation tool (SALT)for short reads based on pro�le HMMs and graph algorithms . 47

    4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    4.3.1 Overview of SALT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Stage 1: pro�le HMM-based �ltration . . . . . . . . . . . . . . . . . . 54

    4.3.2.1 Position-speci�c score threshold . . . . . . . . . . . . . . . . 544.3.3 Stage 2: contig generation . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.3.3.1 Constructing a hit graph for a family . . . . . . . . . . . . . 574.3.3.2 Find the K longest paths . . . . . . . . . . . . . . . . . . . 60

    4.3.4 Stage 3: E-value computation and contig selection . . . . . . . . . . . 614.3.5 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4.1 Protein domain classi�cation of very short reads . . . . . . . . . . . . 64

    4.4.1.1 Determining the true membership of reads . . . . . . . . . . 654.4.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . 65

    4.4.2 Protein domain classi�cation of an RNA-Seq data of Arabidopsis . . . 694.4.3 Protein domain classi�cation of a non-model organism . . . . . . . . 72

    4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SAT-Assembler) for NGS data . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.3.1 Overview of SAT-assembler . . . . . . . . . . . . . . . . . . . . . . . 825.3.2 Pro�le HMM-based homology search . . . . . . . . . . . . . . . . . . 835.3.3 Alignment informed graph construction . . . . . . . . . . . . . . . . . 845.3.4 Pruning and optimization of overlap graphs . . . . . . . . . . . . . . 875.3.5 Guided graph traversal using multiple information . . . . . . . . . . . 885.3.6 Contig sca�olding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3.7 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.4.1 Gene assembly in an RNA-Seq data set of Arabidopsis . . . . . . . . 93

    5.4.1.1 Edge creation performance . . . . . . . . . . . . . . . . . . . 935.4.1.2 Performance comparison with other assembly tools . . . . . 95

    vii

  • 5.4.2 Targeted gene assembly in a human gut metagenomic data set . . . . 985.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    Chapter 6 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . 101

    Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    viii

  • LIST OF TABLES

    Table 2.1 Comparing the error detection performance of HMM-FRAME, Ge-neWise, and FragGeneScan. . . . . . . . . . . . . . . . . . . . . . . 17

    Table 3.1 Number of transcribed and non-transcribed domains using di�erentcuto�s (N) for the number of mapped reads. . . . . . . . . . . . . . 40

    Table 4.1 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Burkholderia cenocepacia. . . . . . . . . . . . . 68

    Table 4.2 Performance comparison of SALT against the other classi�ers on theRNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . 71

    Table 4.3 Classi�cation results generated by di�erent classi�ers on the Radixbalthica data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    Table 4.4 Description of transcribed families uniquely identi�ed by SALT. . . 74

    Table 5.1 Edge creation performance of three strategies on the RNA-Seq dataset of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    Table 5.2 Performance comparison between di�erent assembly tools on the RNA-Seq data set of Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . 98

    Table 5.3 Performance comparison between di�erent assembly tools in assem-bling genes from butyrate kinase family on the human gut metage-nomic data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

    ix

  • LIST OF FIGURES

    Figure 2.1 Frameshifts cause short alignments with marginal scores . . . . . . . 6

    Figure 2.2 Change of HMMER alignments' scores, lengths, and E-values (in logspace) before and after error correction for nifH sequences. (For in-terpretation of the references to color in this and all other �gures, thereader is referred to the electronic version of this dissertation) . . . . 20

    Figure 2.3 Change of HMMER alignments' lengths, scores, and E-values (in logspace) before and after error correction for the bacterial aromaticdioxygenase genes in a soil sample. . . . . . . . . . . . . . . . . . . . 22

    Figure 2.4 Protein domain classi�cation results for the black sample in the deepmine data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    Figure 3.1 Change of the read classi�cation sensitivity of HMMER over readlength and the average sequence identity of domain families. . . . . 27

    Figure 3.2 Histogram of the average pairwise sequence identity for 2558 domains 29

    Figure 3.3 Three types of alignment distributions. . . . . . . . . . . . . . . . . 32

    Figure 3.4 Pipeline of MetaDomain. . . . . . . . . . . . . . . . . . . . . . . . . 33

    Figure 3.5 Read classi�cation sensitivity and FP rate of HMMER and MetaDo-main. The size of each bubble represents the number of data points(i.e., domains) with the same sensitivity and FP rate. . . . . . . . . 39

    Figure 3.6 ROC curves of HMMER and MetaDomain. . . . . . . . . . . . . . . 41

    Figure 3.7 Read length distribution in the soil data set. . . . . . . . . . . . . . 43

    Figure 3.8 Reads aligned by HMMER and MetaDomain. . . . . . . . . . . . . 44

    Figure 3.9 The distributions of aligned reads for PF09703 by HMMER and Meta-Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    x

  • Figure 4.1 Two genes, their domain organizations, and the sequenced reads. Do-main X occurs in two di�erent genes. Both genes are transcribed andsequenced. Red lines: positive reads. Blue lines: negative reads. . . 52

    Figure 4.2 The pipeline of SALT. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    Figure 4.3 (A) Thirteen reads and their alignment layout w.r.t. the pro�le HMMrepresented by its matching states. The alignment scores are shownin the table. Blue reads: negative reads. Red reads: positive reads.(B) The constructed hit graph when k∗ = 4. For simplicity of ex-planation, mismatches are not allowed in this simple example (i.e.e = 0). Red nodes are created by positive reads. Blue nodes are cre-ated by negative reads. (C) The hit graph after removing transitiveoverlaps and adding the root node. . . . . . . . . . . . . . . . . . . . 56

    Figure 4.4 ROC curves of di�erent classi�ers. HHblits and SSAKE+HMMERare listed in separate embedded windows because their FP rates areorders of magnitude larger than others. . . . . . . . . . . . . . . . . 67

    Figure 4.5 A Venn diagram of the transcribed families identi�ed by di�erentclassi�ers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    Figure 5.1 The pipeline of SAT-assembler. Reads of the same color belong tothe same gene family. Reads from di�erent genes of the same familyare distinguished using di�erent patterns. Reads shared by multiplegenes from the same family have multiple patterns. . . . . . . . . . . 83

    Figure 5.2 (A)Two reads a and b sequenced from di�erent genes of the samefamily are aligned to the pro�le HMM of the family. Their sequenceoverlap is indicated in red. (B) Read a and read b have an align-ment overlap of 66 and a sequence overlap of 25 (in bold). (C)Thealignment between the translated peptides of a and b is 22 residues. 86

    Figure 5.3 A graph containing reads from two di�erent genes A and B. Nodes inred (v1, v4, and v7) and in blue (v2, v5, and v8) are from gene A andgene B respectively. Nodes in black (v3 and v6) are chimeric nodesbecause they are shared by the two genes. Arrows with solid linesare real edges. Arrows with dotted lines and dashed lines indicatepaired-end reads and transitive edges between two nodes respectively. 88

    Figure 5.4 Three contigs generated from a metagenomic data set. The greenparts of the contigs are contained in the target gene and thus aregene segments. The blue parts of the contigs are not gene segmetns. 92

    xi

  • Figure 5.5 Chimera rate versus gene coverage when k-mer size or overlap thresh-old changes for di�erent assembly tools. These values are averagevalues of the assemblers' performance on 3,188 input families. . . . . 96

    xii

  • Chapter 1

    Introduction

    1.1 Next-generation sequencing technologies

    In bioinformatics, sequencing means to determine the primary structure of a biological se-

    quence. Prior to new sequencing technologies, Sanger sequencing is the main method for

    sequencing DNA. However, this technique has several limitations. For example, Sanger se-

    quencing is not applicable to sequencing a small amount of DNA, making it expensive and

    not accessible to most small labs. Also, the length of the DNA being sequenced is limited.

    Next generation sequencing (NGS) technologies are developed at the demand of low-cost

    sequencing technologies. These new sequencing technologies make use of massive paral-

    lel method. They can produce large-scale sequence data at low costs. These advantages

    make large scale sequencing within the reach of many scientists. Moreover, new sequencing

    technologies generate much more sequence data that has high coverage and nucleotide level

    resolution per run.

    1.2 Protein domain analysis of NGS sequences

    Inferring functions from sequences is important in analyzing di�erent types of data gen-

    erated by NGS technologies. One basic step during the functional analysis is to classify

    NGS sequences into annotated functional categories, such as protein families or protein do-

    1

  • main families. Protein domain analysis has been widely used for functional annotations

    of RNA-Seq data [1, 2, 3, 4]. In particular, quantifying the expression levels of protein

    domains helps us understand how transcriptional changes of domains are associated with

    sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For

    example, computational domain analysis was applied to identify domains that play a role

    in vernalization and e�ux transporters in the gibberellin response in sugar beet [1]. Do-

    main analysis is also frequently used to evaluate and compare gene annotation quality of

    di�erent gene-�nding tools [3] or to compare domain composition of data sampled using

    di�erent techniques [4]. Protein domain analysis has also been used to understand the phy-

    logenetic complexity and biological functions of mycrobial communities, as well as their

    interactions with the host [5, 6, 7]. For example, Ellrott et al. investigated the distribution

    of protein families in the currently available human gut genomic and metagenomic data [8].

    Schlüter et al. applied HMMER to understand the genetic diversity and composition of a

    plasmid metagenome from a wastewater treatment plant [9]. The phylogenetic algorithm

    CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify

    the source organisms of environmental DNA fragments.

    There are two major comparative methods for protein domain analysis. The �rst method

    is based on pairwise sequence alignment tools such as BLAST software suite [11]. Query

    sequences are classi�ed via comparison with annotated protein databases such as NCBI-nr

    using BLASTX [11]. The second method is pro�le-based similarity search, which classi�es

    queries into characterized protein domain or family databases such as Pfam [12], TIGR-

    FAM [13], FIGfams [14], etc. There also exist comprehensive protein domain search tools

    such as InterProScan [15], which combines di�erent sequence and pro�le-based domain recog-

    nition methods from the InterPro [16] consortium member databases into one resource.

    2

  • Although BLAST is one of the most e�cient protein homology search tools, probabilistic

    model-based methods have much better sensitivity for remote protein homology recogni-

    tion. Using using pro�le hidden Markov models (HMMs) to represent a protein family

    greatly improves homology search sensitivity between highly diverged sequences [17]. Thus

    it is desirable to conduct protein domain classi�cation using pro�le HMM-based tools such as

    HMMER [18]. In conjunction with a fast-growing protein domain family database Pfam [12],

    which contains over 10,000 annotated protein domain families, HMMER is able to classify

    sequences into di�erent domain families with high accuracy. In addition, the latest imple-

    mentation of pro�le HMM-based domain classi�cation tool HMMER 3.0 [18] has achieved

    comparable speed to BLAST, making it suitable for large-scale protein compositional anal-

    ysis. For the convenience of discussion, we use HMMER to refer to HMMER 3.0 hereafter

    unless otherwise speci�ed.

    1.3 Challenges with protein domain analysis of NGS se-

    quences

    Although pro�le HMM-based methods have been successfully applied to genome-wide do-

    main analysis, there are still many challenges with protein domain analysis of NGS sequences,

    especially complex metagenomic data. First, sequences generated by some NGS platforms

    such as pyrosequencing technology have sequencing errors, including insertions or deletions of

    nucleotides, especially in homopolymer regions. These errors create frameshifts during trans-

    lation, making it di�cult to classify the derived peptide sequences into their native families.

    Second, when the length of the query reads decreases, existing tools have low sensitivity

    in classifying these short reads, especially for domain families of poor conservation. Many

    3

  • sequencing technologies such as Illumina still generate short reads of 35 bp to 150 bp. More-

    over, protein sequences encoded in individual metagenomic sequence reads may share only

    a small overlap with existing protein families. Therefore, a sizable portion of various data

    set still contain short reads. Third, microbial communities usually contain a large number

    of di�erent microbial species, complicating the functional annotation of metagenomic data.

    In order to address these challenges and improve performance of protein domain classi-

    �cation, We have proposed three tools: HMM-FRAME, MetaDomain, and SALT. HMM-

    FRAME is designed to accurately classify metagenomic sequences containing frameshift er-

    rors. MetaDomain and SALT are designed to directly classify short reads into their native

    protein domain families with better sensitivity than existing tools. Compared to MetaDo-

    main, SALT incorporates graph algorithms to improve accuracy of protein domain classi�-

    cation. In my future work, I will focus on accurate and scalable gene assembly from complex

    metagenomic data.

    4

  • Chapter 2

    Protein domain classi�cation for

    metagenomic sequences containing

    frameshift errors

    2.1 Background

    Culture-independent methods and high-throughput sequencing technologies now enable us

    to obtain community random genomes (metagenomes) from di�erent habitats such as arctic

    soils and mammalian gut. Currently, metagenomic annotation focuses on phylogenetic com-

    plexity and protein composition analysis. An important component in protein composition

    analysis is protein domain classi�cation, which classi�es a putative protein sequence into an-

    notated domain families and thus aids in functional analysis. Pro�le HMM-based alignment

    is the state-of-the-art method for protein domain classi�cation because of its high sensitivity

    in classifying remote homologs. In conjunction with the Pfam database, HMMER [18] can

    accurately classify query protein sequences into existing domain families. In addition, the

    latest version of HMMER can achieve comparable speed to BLAST, making it applicable to

    large-scale metagenomic data sets.

    However, HMMER cannot optimally classify sequences containing frameshift errors. In

    5

  • HMMER's domain analysis, six-frame translations of a sequence read or a predicted gene

    fragment are aligned with annotated protein domain families using HMMER. One problem

    of this method is that sequencing errors, including insertions or deletions of nucleotides,

    create frameshifts during translation. As a result, the derived peptide sequences are likely

    to generate alignments with marginal scores. As HMMER uses alignment scores, E-values,

    or lengths to determine family membership, these reads become unclassi�able or can be

    falsely recognized as �novel" proteins during downstream analysis. Figure 2.1 illustrates how

    insertion or deletion errors cause marginal alignment scores.

    X1X2X3 X4X5X6 X7X8X9 X10X11X12 X13X14X15 X16X17X18 X19X20X21

    X1X2X3 X4X5X6 X X7X8X9 X10X11X12 X13X14X15 Y X16X17X18 X19X20X21

    aa11 aa12 aa13 aa14 aa15 aa16 aa17

    aa11 aa12aa23 aa24 aa25 aa36 aa37

    Figure 2.1: Frameshifts cause short alignments with marginal scores

    In Figure 2.1, Xi is the ith base of a DNA sequence. Every codon is underscored. aij

    is the jth amino acid of a peptide sequence derived under reading frame i. The correct

    peptide sequence can be derived from the error-free sequence (shown on the top of the

    �gure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y),

    the correct peptide sequence is the concatenation of three short peptide sequences derived

    using di�erent reading frames. Thus, each peptide sequence derived using one reading frame

    can only generate short alignments with insigni�cant scores.

    This problem is more serious in domain analysis for metagenomic data sets. Given the

    high complexity of many metagenomic data sets, high-quality genome assembly is not always

    6

  • available. Thus, protein annotation can only be conducted on short sequence reads. The

    average read length varies from 25-35 to around 400 bases for the next-generation sequencing

    methods currently in use. On average there is about one open reading frame per 1000 base

    pairs in bacteria genomes. Depending on gene size, many gene fragments in metagenomic

    sequence reads may share only a small overlap with existing domain families, generating

    even shorter pro�le HMM alignments with signi�cantly lower scores.

    Although a number of tools [19, 20, 21, 22, 23, 24] exist for frameshift detection, they are

    not designed for protein domain classi�cation using pro�le HMMs. In addition, these tools

    have not incorporated sequencing error patterns associated with next generation sequencing

    technologies. A clear disadvantage is that they do not distinguish between error rates in and

    out of homopolymer regions in pyrosequencing reads. The goal of this work is to design an

    accurate pro�le HMM alignment method that can incorporate any given error pattern. Our

    experiments show that our tool has high sensitivity (> 95%) in detecting sequencing errors

    and has a low false positive rate (∼ 0.15%). By correcting insertion and deletion errors, it

    can generate longer alignments with signi�cantly higher alignment scores, and thus provide

    more accurate protein domain classi�cation.

    2.2 Related work

    A number of programs exist to handle frameshifts through DNA versus protein sequence

    alignment. The simplest methods discard sequences that might contain frameshifts rather

    than trying to correct them. For example, BLASTX provides insightful information about

    whether a query DNA sequence contains frameshifts using six-frame translations. However,

    it neither explicitly outputs positions of insertions or deletions that create frameshifts, nor

    7

  • does it try to �x them by constructing an alignment from pieces obtained from di�erent

    reading frames. Other tools are available to detect and �x frameshift errors automatically.

    Frame [19] uses BLASTX to compare all six reading frames of the query nucleotide sequence

    against protein sequences. Then the aligned regions are combined for frameshift detection.

    Guan et al. [20], Zhang et al. [21], and Halperin et al. [22] describe dynamic programming

    algorithms for frameshift detection during pairwise DNA and protein sequence alignment.

    Instead of using all reading frames of a DNA sequence to maximize the alignment score,

    another group of tools [23, 24] translate a protein sequence back into DNA sequences and

    formulate the alignment problem as a network matching problem. Frameshift detection has

    also been applied to �nding distant protein homologies where the divergence is the result of

    frameshift mutations and substitutions [25, 26, 27].

    Some gene-�nding tools detect frameshifts. FrameD [28] relies on a directed acyclic

    graph for gene prediction in the presence of frameshifts. Kislyuk et al. [29] apply an ab initio

    method to detect possible frameshifts from coding potential generated by GeneMark [30].

    GeneTack [31] and FragGeneScan [32] use hidden Markov models for ab initio frameshift

    detection in gene �nding.

    Despite the extensive study of frameshift detection, the above programs are not designed

    for protein family classi�cation through DNA versus protein family alignment. Alternatively,

    GeneWise [33], a widely used DNA versus protein alignment tool, allows comparison of

    a DNA sequence with a pro�le HMM. Our algorithm di�ers from GeneWise by explicitly

    incorporating a position-speci�c error model that is trained on data from di�erent sequencing

    platforms such as 454 GS FLX Titanium.

    8

  • 2.3 Method

    The representative protein domain classi�cation tool HMMER [18] classi�es a query protein

    sequence into a pro�le HMM-represented protein family using the Viterbi or the Forward

    algorithm [17]. The Viterbi algorithm aligns a query protein sequence to a pro�le HMM

    by searching for the most probable state path in the model. If the alignment score or E-

    value meets the pre-de�ned threshold, the query is classi�ed into the corresponding family.

    The alignment generated by the Viterbi algorithm only accounts for the di�erence caused

    by evolutionary divergence between a sequence and a protein family. In order to classify

    error-containing sequences into their native families, the alignment algorithm must detect

    the di�erences resulted from both evolution and sequencing errors.

    In this section, we describe HMM-FRAME, the implementation of an augmented Viterbi

    algorithm that searches for the optimal alignment between a DNA query and a pro�le HMM

    by considering both evolutionary divergence and sequencing errors. HMM-FRAME di�ers

    from HMMER in the following ways: 1) HMM-FRAME directly accepts a DNA sequence

    as input, 2) HMM-FRAME accepts a sequencing error model as input, 3) HMM-FRAME

    can detect and �x frameshifts caused by sequencing errors in the DNA sequence. The

    output alignment indicates which bases are inserted or deleted due to evolutionary change

    or sequencing error.

    2.3.1 Error models

    Here we describe the error models used in our experiments. Di�erent sequencing technologies

    may have di�erent types of errors. For example, previous work [34, 35, 36] has shown that

    insertions and deletions occur more often in homopolymer regions than in non-homopolymer

    9

  • regions for pyrosequencing reads. Substitution errors occur more often than insertions or

    deletions in Illumina sequencing reads. Because deletion or insertion errors cause frameshifts,

    we focus on applying HMM-FRAME to pyrosequencing data sets.

    In this work, we consider two error models. The �rst one is a published model trained

    from GS20 sequencing reads [34]. The insertion and deletion error rates in non-homopolymer

    and homopolymer regions are 0.0007 and 0.0044, respectively. The second error model is

    computed on data from FLX Titanium sequencing platform. We obtained a set of Titanium

    sequence reads (Cole and Wang, unpublished) extracted from the region H of the 16S rRNA,

    which were ampli�ed from the Baylor mock community (22 strains, 24 sequences). Then

    we computed error rates using insertions and deletions that were annotated by generating

    careful Needleman-Wunsch alignments between the Titanium sequencing reads and the con-

    trol sequences. In total, 7,040 sequences passed the initial quality control of RDP [37] after

    contamination and chimera detection. There were 1,721 insertion and deletion errors. Note

    that PCR, which was used to generate the amplicons of the sample, can introduce errors.

    However, because most of the errors introduced by PCR are substitution errors, we assumed

    that the deletions and insertions were mainly sequencing errors. The derived error rates for

    homopolymers of di�erent sizes were: 1: 0.000532, 2: 0.000698, 3: 0.00102, 4: 0.000688,

    5: 0.0372, 6: 0.00167, 7: 0.143, where the �rst number is the size of homopolymer regions

    (1 means non-homopolymer) and the second number is the rate of insertion and deletion

    errors. If we sum the error rates for homopolymer regions of di�erent sizes, the insertion

    and deletion error rates for non-homopolymer and homopolymer regions were 0.0005 and

    0.001, respectively. They are slightly smaller than the published G20 error rates [34]. We

    will compare their performance on a data set with annotated errors in the section 2.4.

    10

  • 2.3.2 The augmented Viterbi algorithm for sequencing error cor-

    rection

    Let π be a state path in a pro�le HMM M . Let r be a set of insertion and deletion positions

    in a DNA sequence x. The augmented Viterbi algorithm searches for the most probable path

    π∗ and the most probably error position set r∗ such that (π∗, r∗) = argmax(π,r)P (x, π, r).

    Intuitively this algorithm searches for an optimal alignment between a DNA sequence and

    a pro�le HMM by simultaneously considering 1) evolutionary divergence (i.e. the insertion,

    deletion, and substitution of amino acids) and 2) sequencing errors (i.e. insertion and deletion

    of nucleotides). To solve the above equation, we �rst divide the search space according to

    di�erent types of sequencing errors inside a codon and between two consecutive codons. For

    each type of error, we search for the most probable state path.

    Input: a DNA sequence x, a pro�le HMM M , and a sequencing error model. Notations of

    M and the error model will be described below.

    Output: the optimal alignment between DNA sequence x and M , as well as error positions

    in r.

    Algorithm: we �rst de�ne notations that will be used in the dynamic programming equa-

    tions.

    • Notations about the pro�le HMM M : States Mj , Ij , and Dj are matching, in-

    sertion, and deletion states in M . as1s2 is the transition probability from state s1

    to s2. es(T (xi−2xi−1xi)) is the emission probability for state s to emit amino acid

    T (xi−2xi−1xi), which is translated from the codon xi−2xi−1xi. For a detailed de-

    scription of a pro�le HMM M , we refer the reader to the textbook [17] and the users'

    guide of HMMER [18]. State Gj is the only state that is not de�ned in pro�le HMMs

    11

  • from HMMER 3.0. It encodes insertions of nucleotides between codons. aMjGj is the

    transition probability from matching state Mj to nucleotide insertion state Gj . It is

    set to the insertion error probability. aGjGj is the self-transition probability for Gj ,

    encoding the probability of consecutive insertions. When consecutive insertion is not

    allowed, it is set to 0. aGj−1Mj is the transition probability from Gj−1 to the next

    matching state Mj . When only one insertion error is allowed, it is set to 1.0.

    • Notations about the sequencing error model: pI(xi) is the probability that base xi

    is an insertion error. pD(xi) is the probability that there is a deletion error after base

    xi.

    • Subproblems and the recursive equations: Based on our analysis of error patterns, it

    is very rare that there are consecutive insertions or deletions in a sequence read. Thus,

    the following DP algorithm assumes that there is at most one insertion or deletion

    inside a codon. The algorithm can be extended to handle all possible cases.

    � VMj (i) is the score of the best alignment matching subsequence x1..i to the sub-

    model up to the matching state Mj , given that xi is the third base of a codon

    and this codon encodes an amino acid emitted by Mj .

    � V Ij (i) is the score of the best alignment matching subsequence x1..i to the sub-

    model up to the insertion state Ij , given that T (xi−2xi−1xi) is emitted by Ij .

    � V Gj (i) is the score of the best alignment ending in xi being emitted by state Gj ,

    which encodes an insertion of nucleotides between codons.

    � V Dj (i) is the score of the best alignment matching subsequence x1..i to the sub-

    model up to the deletion state Dj .

    12

  • VMj (i) = max{

    case I : no sequencing error in the codon xi−2xi−1xi :

    eMj (T (xi−2xi−1xi))× VMj−1(i− 3)× aMj−1Mj ,

    eMj (T (xi−2xi−1xi))× VIj−1(i− 3)× aIj−1Mj ,

    eMj (T (xi−2xi−1xi))× VDj−1(i− 3)× aDj−1Mj ,

    eMj (T (xi−2xi−1xi))× pI(xi−3)× VGj−1(i− 3)× aGj−1Mj ,

    case II : nucleotide xi−1 is an insertion :

    eMj (T (xi−3xi−2xi))× pI(xi−1)× VMj−1(i− 4)× aMj−1Mj ,

    eMj (T (xi−3xi−2xi))× pI(xi−1)× VIj−1(i− 4)× aIj−1Mj ,

    eMj (T (xi−3xi−2xi))× pI(xi−1)× VDj−1(i− 4)× aDj−1Mj ,

    eMj (T (xi−3xi−2xi))× pI(xi−1)× VGj−1(i− 4)× aGj−1Mj ,

    case III : nucleotide xi−2 is an insertion :

    Repeat the above four equations for eMj (T (xi−3xi−1xi)),

    case IV : there is a deleted nucleotide (represented by d) between xi−1 and xi :

    eMj (T (xi−1d xi))× pD(xi−1)× VMj−1(i− 3)× aMj−1Mj ,

    eMj (T (xi−1d xi))× pD(xi−1)× VIj−1(i− 3)× aIj−1Mj ,

    eMj (T (xi−1d xi))× pD(xi−1)× VDj−1(i− 3)× aDj−1Mj ,

    eMj (T (xi−1d xi))× pD(xi−1)× VGj−1(i− 3)× aGj−1Mj ,

    case V : there is a deleted nucleotide between xi−2 and xi−1 :

    Repeat the above four equations for eMj (T (d xi−1xi)).

    }

    13

  • In cases IV and V, we use d to represent the deleted bases. We choose d to maximize the

    emission probability of T (xi−1d xi) (or T (d xi−1xi)) in the matching state Mj .

    V Ij (i) = max{eIj (T (xi−2xi−1xi))× VMj (i− 3)× aMjIj , eIj (T (xi−2xi−1xi))× V

    Ij (i− 3)× aIjIj}

    V Gj (i) = max{pI(xi)× VMj (i− 1)× aMjGj , pI(xi)× V

    Gj (i− 1)× aGjGj}

    V Dj (i) = max{VMj−1(i)× aMj−1Dj , V

    Dj−1(i)× aDj−1Dj}

    2.3.3 Running time analysis

    The time complexity of the above dynamic programming algorithm is O(δ|x||M |), where |x|

    is the length of input DNA sequence and |M | is the number of states in M . δ is the number

    of di�erent types of errors inside a codon plus the case of insertions between two codons. In

    our current implementation, δ = 26, which renders a longer running time than the standard

    Viterbi algorithm. Thus, it is not practical to compare millions of metagenomic sequence

    reads to over 10,000 protein families in Pfam. Instead, we only run HMM-FRAME on

    sequences that are likely to contain insertion or deletion errors. For large-scale applications,

    we suggest applying HMMER, which is as fast as BLAST, to all input sequence reads using

    a big E-value cuto� (such as 100). Alignments covering at least 80% of the translated DNA

    sequence with signi�cant E-values can be classi�ed by HMMER in this step. Sequence reads

    that do not yield any partial alignments are unlikely to be members of any protein family.

    14

  • Thus, we only apply HMM-FRAME to reads yielding partial alignment with marginal scores

    because these reads could potentially contain sequencing errors.

    2.4 Results

    In this section, we compare the sensitivity and false positive rates (FP rates) of HMM-

    FRAME with GeneWise [33] and FragGeneScan [32]. We then apply HMM-FRAME to

    Targeted Metagenomics and a published metagenomic data set. Our experimental results

    show that the length, scores, and E-values of pro�le HMM alignments are signi�cantly im-

    proved after error correction. As pro�le HMM-based alignment tools determine membership

    by comparing E-value or length with user-de�ned thresholds, the improvement of these pa-

    rameters enables more error-containing sequences to be classi�ed into their native families.

    2.4.1 Accuracy of HMM-FRAME

    In order to evaluate the accuracy of HMM-FRAME in detecting insertion and deletion errors,

    we obtained a control data set with annotated error positions from RDP (Cole and Wang,

    unpublished). In this data set, NifH gene families from the Desul�tobacterium hafniense

    strain DCB-2, the Burkholderia xenovorans strain LB40, and the PCC 7120 strain of An-

    abaena were ampli�ed and then sequenced using 454 Titanium. The sequenced gene families

    were aligned with the nifH genes in these three organisms using the Needleman-Wunsch

    algorithm. Insertion and deletion errors were identi�ed from the alignments. After contami-

    nation and chimera screening, we had 18,900 sequences, of which 3,408 sequences contained

    4,623 insertion or deletion errors. We conducted the protein domain analysis on the 18,900

    sequences using HMM-FRAME under the two error models presented in the Method Sec-

    15

  • tion. The input pro�le HMM was trained on 25 nifH genes obtained from RDP's functional

    gene repository website [38].

    We evaluated the performance of error-prediction tools using two types of sensitivity

    and FP rates. Let S+ be the set of error-containing sequences in the control data set. Let

    S be the set of predicted error-containing sequences. The Sequence-level sensitivity and

    FP rate are S∩S+

    S+and S−S

    +

    S , respectively. Similarly, let Q+ be the set of insertion and

    deletion positions in error-containing sequences from the control data set. Let Q be the set

    of predicted error positions. The Base-level sensitivity and FP rate are Q+∩QQ+

    and Q−Q+

    Q ,

    respectively.

    Using the control data set, we �rst evaluated the performance of HMM-FRAME under

    the published GS20 and our self-trained Titanium error models. Then we compared the

    performance of HMM-FRAME with GeneWise [33] and FragGeneScan [32]. Similar to HMM-

    FRAME, GeneWise can directly compare DNA sequences with a pro�le HMM and can accept

    user-de�ned error rates. We tested GeneWise using di�erent parameters including error rates

    and the alignment score thresholds (ranging from 0 to 20). The results with the best tradeo�

    between sensitivity and FP rate were kept for comparison with HMM-FRAME.

    FragGeneScan [32] is a newly developed gene prediction tool for short and error-prone

    sequences. It predicts genes and identi�es sequencing errors inside predicted genes. We

    applied FragGeneScan on the above sequence set (all genes) and tested its sensitivity and

    FP rate. FragGeneScan successfully recognized all input as protein-coding genes, rendering

    a high gene-prediction sensitivity in this data set. However, FragGeneScan had higher FP

    rates than HMM-FRAME in error detection. The results are summarized in Table 2.1.

    Sensitivity and FP rate of each program when detecting annotated insertion and deletion

    errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-

    16

  • FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is

    set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we

    did not evaluate its performance at the base-level.

    Table 2.1: Comparing the error detection performance of HMM-FRAME,GeneWise, and FragGeneScan.

    HMM-FRAME: HMM-FRAME: GeneWise FragGeneScanG20 self-trained

    seq-sen 95.25% 90.6% 53.8% 83.04%base-sen 85.08% 82.4% 53.39%seq-FP 0.154% 0 0.001% 0.7%base-FP 2.1% 0.003% 59.57%

    Sensitivity and FP rate of each program when detecting annotated insertionand deletion errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto� of GeneWise is set to zero to maximize thesensitivity. As GeneWise has low sequence-level sensitivity, we did not eval-uate its performance at the base-level.

    As shown in Table 2.1, each tool has higher sensitivity and smaller FP rates in identifying

    error-containing sequences than in locating error positions. HMM-FRAME has a better

    tradeo� between sensitivity and FP rate than both GeneWise and FragGeneScan. Both GS20

    and our self-trained Titanium error models have small FP rates in predicting error positions,

    but GS20 has higher sensitivity. Thus, we plan to use GS20 in all further experiments.

    2.4.2 Using HMM-FRAME in �Targeted Metagenomic�

    In this section, we present the utility of HMM-FRAME in two applications of �Targeted

    Metagenomics", where one or several gene families are ampli�ed from environmental DNA

    and these amplicons are sequenced using high-throughput sequencing platforms. One typical

    application of Targeted Metagenomics is to sequence the amplicons of the 16S rRNA gene

    for phylogenetic complexity analysis. Besides 16S rRNA, protein-coding genes that are

    17

  • important to a particular habitat can be ampli�ed and sequenced for targeted functional

    analysis in metagenomic data sets. For example, Targeted Metagenomics of the nifH gene,

    which encodes nitrogenase reductase, is important for analyzing microbial genomes sequenced

    from soil. Although these sequences are sampled from one or several targeted gene families,

    frameshift errors can cause short alignments with marginal scores between the input and the

    targeted gene families. As a result, sequences lacking signi�cant alignment length and scores

    will be regarded as contaminants and be discarded. Thus, it is desirable to �x frameshift

    errors to maximize the number of usable samples. Given a DNA read and a pro�le HMM

    built from a set of known protein sequences, HMM-FRAME can be applied to detect and

    correct frameshift errors in amplicon reads.

    2.4.2.1 Protein domain analysis of nifH sequences

    In the �rst experiment, we obtained 3,937 nifH sequences of an average length of 76 bases

    generated by the 454 FLX sequencing technology. In order to discard contaminants that

    originated from non-target genes, we aligned the 3,937 sequences with the nifH gene family,

    which was built on a small set of 25 expert-veri�ed full-length nifH protein reference se-

    quences from RDP's functional gene repository [38]. In the gene family building process, we

    �rst applied ClustalW [39] to align the 25 reference sequences. Then we applied HMMER

    3.0's hmmbuild program to derive a pro�le HMM from the multiple sequence alignment. Of

    the 3,937 454 FLX sequences, 111 were found to be contaminants and were excluded from

    further analysis. Of the remaining 3,826 sequences, HMM-FRAME detected 296 insertions

    and deletions in 256 sequences. Thus, approximately, 7% of the samples contained frameshift

    errors. Of the 256 sequences containing insertion or deletion errors, 224 (87.5%) only con-

    tained one insertion or deletion error. 24 (9.4%) sequences contained two errors, and eight

    18

  • (3.1%) contained three errors. Of the 296 insertions or deletions, 224 (75.7%) were inside or

    beside homopolymer regions.

    Because protein domain classi�cation tools compare alignment lengths, scores, and E-

    values with pre-de�ned thresholds to determine a sequence's membership, the changes in

    the alignments a�ect the �nal domain composition analysis. After error correction, pro�le

    HMM-based alignment tools are expected to generate longer alignments with bigger scores

    and smaller E-values. This gives error-containing sequences a better chance of being classi�ed

    into the correct families rather than being labeled contaminants.

    In order to conduct a fair comparison on alignments before and after error correction,

    we choose a third-party tool HMMER to generate alignments for original and corrected

    sequences. The changes of alignments' E-values and lengths due to error correction are

    presented in Figure 2.2. In this �gure, the changes of alignments are presented for 256

    sequences in which HMM-FRAME detects errors. �Original" refers to HMMER alignments

    on sequences before error correction. �Corrected" refers to HMMER alignments on sequences

    after error correction by HMM-FRAME. As a comparison, we also plot the length of the

    original sequence reads (with the legend �sequence read"). They largely overlap with the

    length of corrected alignments, indicating that complete sequence reads can be aligned with

    the nifH pro�le HMM after error correction.

    In order to test whether the improvement was statistically signi�cant, we conducted a

    two-sample Kolmogorov-Smirnov test (K-S test) on the alignments' lengths and E-values

    before and after error correction. The p-values for the alignments' length and E-value dis-

    tributions were 3.1037e-010 and 1.1802e-045, respectively. In particular, the comparison

    between alignments' lengths and the sequence reads' lengths shows that most partial align-

    ments generated by error-containing sequences become complete alignments after error cor-

    19

  • -50

    -30

    -10

    10

    30

    50

    70

    90

    1 21 41 61 81 101 121 141 161 181 201 221 241

    Len

    gth

    s an

    d E

    -val

    ue

    s o

    f p

    HM

    M a

    lign

    me

    nts

    NifH family reads

    LOG(original E-value)

    LOG(corrected E-value)

    original length

    corrected length

    Sequence read length

    Figure 2.2: Change of HMMER alignments' scores, lengths, and E-values (in log space)before and after error correction for nifH sequences. (For interpretation of the referencesto color in this and all other �gures, the reader is referred to the electronic version of thisdissertation)

    rection. Thus, when comparatively longer alignments (e.g., 23 amino acids or 69 bases) are

    required for domain classi�cation, more sequence reads (213 more under when the threshold

    is 69 bases) will be classi�ed into their native families.

    2.4.2.2 Protein domain analysis of the bacterial aromatic dioxygenase genes

    In the second experiment, we obtained 2486 pyrosequencing samples of an average length

    of 224 bases from the bacterial aromatic dioxygenase genes in a soil sample [40]. Although

    these pyrosequencing reads were sequenced from the 5' end of PCR amplicons of bacterial

    aromatic dioxygenase genes, we were interested in classifying them into three sub-families

    of dioxygenase genes: toluene/biphenyl, naphthalene, and benzoate [41]. Note that there

    20

  • is another subfamily (phthalate). However, due to lack of training proteins for this family

    (Dr. Iwai, personal communication), we only searched for members of three sub-families.

    Three sets of reference protein sequences were extracted from Pfam [12] for toluene/biphenyl,

    naphthalene, and benzoate [41]. Based on these training sets, we built three pro�le HMMs

    using ClustalW and HMMER. Then we applied HMM-FRAME to align the 2486 reads with

    the three pro�le HMMs. HMM-FRAME detected 77 insertions and 52 deletions, which were

    distributed in 121 sequences. Of the 121 error-containing sequences, 77 could not be classi�ed

    into any subfamily by HMMER under the E-value threshold 0.1. After error correction using

    HMM-FRAME, these 77 sequences were classi�ed into di�erent families with an average E-

    value of 3.3e-06, indicating that they were highly likely to be true members of the underlying

    families. For other error-containing sequences, the pro�le HMM alignments' E-values and

    lengths were signi�cantly increased after error correction. The change is plotted in Figure 2.3.

    In this �gure, the data sets is sequenced from bacterial aromatic dioxygenase genes in a soil

    sample. All alignments are generated by HMMER for a fair comparison. �Original" refers

    to HMMER alignments on sequences before error correction. �Corrected" refers to HMMER

    alignments on sequences after error correction by HMM-FRAME.

    We also applied a two-sample K-S test on the alignments' lengths and E-values before and

    after error correction. The p-values for the length and E-value distributions were 8.0609e-

    011and 1.9776e-040, respectively. The improved alignment lengths and E-values provide

    stronger evidence for the membership of the input samples. In total, after error correction

    by HMM-FRAME, we could classify 1,214 sequences into three subfamilies. 1,042 reads

    were members of the naphthalene subfamily. 96 reads belonged to the benzoate subfam-

    ily. 76 reads belonged to the toluene/biphenyl subfamily. The remaining 1272 reads could

    potentially be members of the subfamily phthalate (Dr. Iwai, personal communication).

    21

  • -51

    -31

    -11

    9

    29

    49

    69

    89

    1 10 19 28 37 46 55

    Len

    gth

    s an

    d E

    -val

    ue

    s o

    f P

    HM

    M a

    lign

    me

    nts

    Soil sample sequence reads

    original length corrected length LOG(original E-value) LOG(corrected E-value)

    Figure 2.3: Change of HMMER alignments' lengths, scores, and E-values (in log space) beforeand after error correction for the bacterial aromatic dioxygenase genes in a soil sample.

    2.4.3 Protein domain classi�cation in the deep mine data set

    In order to show the utility of HMM-FRAME in a metagenomic data set containing members

    of multiple domain families, we applied HMM-FRAME to the �rst 454 sequencing project

    for environment samples, which were sequenced from two sites in the Soudan Mine, Min-

    nesota, USA [42]. In this experiment, we downloaded the Black Sample from the paper's

    supplementary data website. This data set contains 388,627 sequence reads with an average

    length of 99 bases.

    22

  • There were two steps in the annotation. First, we applied gene-prediction tools. Second,

    we conducted the domain classi�cation on predicted genes. A number of gene-prediction

    tools are available for metagenomic data sets. However, not every tool can handle short

    reads. Glimmer [43] did not output meaningful predictions when it was applied to this data

    set. The sensitivity of Metagene [44] drops to 59% for 100-base sequences [45]. We thus

    chose FragGeneScan, a newly developed gene-prediction tool for short reads. FragGeneScan

    predicted 281,658 genes, of which 72,355 contained errors. For convenience in discussion, let

    S be the set of genes predicted by FragGeneScan. Let S' be the raw read set corresponding

    to genes in S. Thus, 72,355 sequences in S were di�erent from their raw reads in S' because

    FragGeneScan predicted and corrected errors in S'. We compared three domain classi�cation

    pipelines: 1) apply HMMER 3.0 on raw reads S', 2) apply FragGeneScan and then HMMER

    on corrected reads S, and 3) apply HMM-FRAME on raw reads S'. We recorded how many

    reads could be classi�ed into one of the 2,558 Pfam domain families that contain the keyword

    �bacteria". The number of classi�able reads for the three pipelines were: 13,544 for HMMER,

    12,328 for FragGeneScan + HMMER, and 17,496 for HMM-FRAME. The classi�cation

    results have large overlaps, which are illustrated in Figure 2.4. In this �gure, sequence

    sets that can be classi�ed by HMM-FRAME, HMMER, and FragGeneScan+HMMER are

    represented by three sets A, B, and C. |A| = 17, 496. |B| = 13, 544. |C| = 12, 328. B −C =

    2224. C −B = 1008. C − A = 4. A− (B + C) = 2948.

    In summary, HMM-FRAME was able to classify 2,948 more reads than the other two

    annotation pipelines. HMM-FRAME found errors in all of these 2,948 reads. Thus, it is likely

    that other two pipelines failed to classify them because of frameshifts. HMM-FRAME failed

    to classify four reads that can be aligned by FrageGeneScan+HMMER. A closer examination

    showed that FragGeneScan and HMM-FRAME output di�erent error positions in these four

    23

  • B: HMMER

    alone

    A: HMM-

    FRAME

    C:

    FragGeneScan

    +HMMER

    Figure 2.4: Protein domain classi�cation results for the black sample in the deep mine dataset.

    sequences.

    The performance evaluation of FragGeneScan must consider both gene-prediction and

    error-prediction. Of the 281,658 predicted genes, only 12,328 could be classi�ed into existing

    domain families. Further analysis is needed to examine whether other predictions are novel

    genes or wrong predictions. It is worth noting that FragGeneScan could classify 1,008 more

    sequences after its error correction than applying HMMER 3.0 alone on raw reads. However,

    while 2,224 raw reads could be classi�ed into existing domain families by HMMER 3.0,

    they could not be aligned with any family after error correction by FragGeneScan. This

    indicates that FragGeneScan might have over-predicted errors in the 2,224 sequences. This

    is consistent with our observation that FragGeneScan has a high FP rate in the control data

    set.

    24

  • 2.5 Conclusion

    Despite the advances of high-throughput sequencing technologies, sequencing errors still

    pose challenges for data annotation. In particular, our error model analysis shows that

    454 FLX Titanium only slightly decreases the insertion and deletion error rates compared to

    GS20. Thus, correcting frameshifts caused by insertion or deletion errors is still important for

    metagenomic sequence annotation. In this work, we introduce a protein domain classi�cation

    tool HMM-FRAME, which can classify error-prone DNA sequence reads into protein domain

    families. HMM-FRAME can accept any error model trained on data from high-throughput

    sequencing technologies and thus achieve high detection sensitivity while maintaining a low

    false positive rate.

    Applying HMM-FRAME to a data set with annotated errors shows its high sensitivity

    and accuracy in error detection. In particular, by �xing frameshift errors, we can obtain

    signi�cantly longer pro�le HMM alignments with smaller E-values. As alignments' lengths,

    scores, and E-values are often used to determine family membership, improving them helps

    to classify more sequences into the native domain families. In our experiments, sequences

    that fail HMMER 3.0 under the default E-value or score threshold are classi�ed into correct

    domain families using HMM-FRAME. Thus, HMM-FRAME can be used as a complementary

    tool to HMMER 3.0 on error-prone sequences.

    25

  • Chapter 3

    Pro�le HMM-based protein domain

    classi�cation for short sequences

    3.1 Background

    With the advent of next-generation sequencing and culture-independent methods, an enor-

    mous amount of metagenomic data have been sequenced from microbial communities from

    di�erent habitats. In order to understand the phylogenetic complexity and biological func-

    tions of microbial communities, as well as their interactions with the host, automatic an-

    notation tools such as CAMERA [5], MG-RAST [6], and MEGAN [7] are being used for

    annotating metagenomic data sets. As an important component of these metagenomic an-

    notation tools, protein homology search provides basis for identifying putative genes and

    assigning those genes to annotated functional categories (e.g. protein domain families).

    Because of the high sensitivity of remote homology recognition, HMMER has been suc-

    cessfully applied to genome-wide domain analysis. However, its sensitivity is signi�cantly

    limited by the short reads of metagenomic data sets and poorly conserved domains. In order

    to investigate how read length and domain identity a�ect the sensitivity of HMMER, we

    randomly sampled 200 peptides with lengths of 12, 20, and 28 amino acids from the seed

    sequences of each of the 2,558 Pfam domains, which contain the word �Bacteria" in their de-

    26

  • scriptions. The peptides were aligned with the domain families using HMMER. We used the

    E-value cuto� 1000 in order to boost the sensitivity. For each domain, the read classi�cation

    sensitivity of HMMER is measured as the ratio of the number of aligned reads to the total

    number of sampled reads. We sort all data points by domain identity in ascending order

    and plot them in Figure 3.1. For domains with the same identity, their average sensitivity

    is reported.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Average sequence identity of domain

    Sens

    itiv

    ity

    of H

    MM

    ER

    3

    read length: 36 bpread length: 60 bpread length: 84 bp

    Figure 3.1: Change of the read classi�cation sensitivity of HMMER over read length and theaverage sequence identity of domain families.

    Figure 3.1 shows that the sensitivity of HMMER deteriorates with the decrease of the

    query sequence length and domain identity. The sensitivity is decreased from 90% to 65-70%

    when the lengths of reads change from 28 residues (i.e., 84 bp for corresponding DNA reads)

    27

  • to 20 residues (i.e., 60 bp for DNA reads) for domains with identity around 40%.

    Although next-generation sequencing technologies are producing longer reads and assem-

    bly tools may be available to assemble short reads into longer contigs, there is still a need

    for a protein domain analysis tool for short reads. First, many �nished or on-going metage-

    nomic sequencing projects contain reads with lengths from 35 to around 400 bp depending

    on the chosen sequencing technologies. In addition, peptide sequences encoded in individual

    metagenomic sequence reads may share only small overlaps with existing domain families.

    Thus, a sizable portion of many available data still contains short reads. Second, the sheer

    amount of data and the complexity of many metagenomic data sets pose a great challenge

    for assembly tools [46]. A large portion of short reads cannot be correctly assembled into

    longer contigs. Third, many domain families exhibit low average sequence identity, which

    poses a challenge for short and medium-sized reads. Figure 3.2 shows the histogram of pair-

    wise sequence identity for domains related to bacteria. Of 2558 domains, there are about

    43% domains with average identity no greater than 0.3. For these domains, the sensitivity

    of HMMER is between 0.7 and 0.8 for reads of length 84 bp, between 0.4 and 0.6 for reads

    of length 60 bp, and smaller than 0.1 for reads of length 36 bp. As a result, although a large

    number of reads are sequenced from genes, which are highly compact in microbial genomes,

    only a small percentage of the short reads can be classi�ed into their native domains using

    existing tools.

    In this work, we introduce MetaDomain, a protein domain classi�cation tool designed

    for short reads in metagenomic data sets. MetaDomain provides a complementary protein

    analysis tool to HMMER on assigning short reads into their native families.

    28

  • 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    100

    200

    300

    400

    500

    600

    700

    800

    Average sequence identity of domain

    Num

    ber

    of d

    omai

    ns

    Figure 3.2: Histogram of the average pairwise sequence identity for 2558 domains

    29

  • 3.2 Related Work

    Pro�le HMM-based protein homology search is widely used for mining microbial genomes.

    Knowing the composition of di�erent domain families encoded in a metagenomic data set

    helps us understand which functions are important for a particular habitat. For example,

    Ellrott et al. [8] investigated the distribution of protein families in the available human gut

    genomic and metagenomic data. As the data set contains assembled contigs, using HMMER

    is expected to achieve high sensitivity. Schlüter et al. [9] used HMMER to understand the

    genetic diversity and composition of a plasmid metagenome from a wastewater treatment

    plant. The reads have an average length of 104 bp, which is also adequate for HMMER to

    achieve high sensitivity.

    Besides providing a basis for functional pro�ling, pro�le HMM-based homology search

    was also used for phylogenetic complexity analysis in metagenomic data. The phylogenetic

    algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers

    to identify the source organisms of environmental DNA fragments as short as 80 bp. As we

    show in Figure 3.1, pro�le HMM-based tools have sensitivity of at least 0.9 in classifying

    reads of 80 bp into domains with average sequence identity above 40%. However, for poorly-

    conserved domains, a signi�cant number of reads might be missed. A similar but faster tool

    Treephyler [47] conducted community pro�ling in metagenomics and metatranscriptomics

    based on Pfam domain assignments. Treephyler was applied to a data set with average read

    length of 200 bp. It is unclear how shorter reads a�ect its performance.

    Our previous work designed a tool HMM-FRAME [48], which can identify and correct

    frame-shift errors in pyrosequencing reads during protein domain classi�cation using pro�le

    HMM-based alignment. However, it was not speci�cally designed to handle short reads.

    30

  • Finally, we note that the method used in MetaDomain shares a similar rationale to the

    recent work by Weng et al. [49]. Weng et al. reported that taxonomic binning tools for

    metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST

    cut-o�s. Thus, they re-analyzed the discarded reads using less stringent cut-o�s. In or-

    der to control the false positive matches introduced by the relaxed cut-o�s, they used the

    evolutionary conservation of adjacency between neighboring genes as an additional criterion.

    3.3 Method

    HMMER uses E-values as the discrimination threshold to determine the membership of a

    query sequence. However, short reads may only generate low alignment scores and thus

    insigni�cant E-values. In particular, the conservation across the entire length of a domain

    family can be highly variable, posing a great challenge for classifying reads sequenced from

    poorly conserved sub-regions. In order to increase the sensitivity of aligning remotely-related

    short reads, we propose position-speci�c score cuto�s, by which poorly conserved regions

    allow more relaxed discrimination thresholds than well-conserved regions. However, the low

    thresholds can easily incur random matches. In order to control the false positive rate,

    we examine the position distribution of read alignments. The position distribution of read

    alignments on a truly encoded domain is expected to be more uniform than a domain that

    incurs random read alignments [50, 51]. Figure 3.3 shows the schematic representations of

    three types of distributions of read alignments along a domain. The alignments in (A) and

    (B) are more likely to be random. Thus the domains may not be encoded in the data set. The

    alignment distribution in (C) exhibits a much more uniform distribution, providing strong

    evidence for the existence of the underlying domain in the data set. Thus, by using relaxed

    31

  • position-speci�c score cuto�s and inspecting the distribution of alignments, we expect to

    classify more short reads into the correct domain families while not falsely reporting domains

    that are not characterized in the data.

    Figure 3.3: Three types of alignment distributions.

    3.3.1 Pipeline of MetaDomain

    The input to MetaDomain includes sequence reads and a list of protein domains. The

    output is a list of domains encoded in the underlying data set and the number of aligned

    reads. Figure 3.4 shows a schematic representation of the pipeline of MetaDomain.

    MetaDomain consists of three main stages: short read alignment, �ltering, and classi-

    �cation. In the alignment stage, we use the Viterbi algorithm [17] to search for the best

    local alignment between a query sequence and a pro�le HMM-represented domain family. In

    the �ltering stage, we �rst apply a position-speci�c score threshold to eliminate insigni�cant

    alignments. Then we remove stacked alignments with the same alignment positions inside

    a poorly conserved region. In the �nal stage, we use the number of aligned reads and the

    distribution of alignment positions to determine whether a domain is encoded.

    32

  • Viterbi algorithm

    Sequence reads

    Optimal local

    alignments

    Filtering

    Trimmed alignments

    Pfam domains

    Position-specific

    threshold

    Read number and

    domain coverage

    thresholds

    Transcribed or encoded

    protein domains

    Classification

    Figure 3.4: Pipeline of MetaDomain.

    33

  • 3.3.2 The Viterbi algorithm

    The Viterbi algorithm aligns a query sequence to a pro�le HMM by searching for the most

    probable state path in the model. Unlike HMMER, MetaDomain directly aligns a DNA

    sequence to a pro�le HMM. To do so, we implicitly align translated peptides under di�erent

    reading frames with a pro�le HMM. Let π be a state path in a pro�le HMM M and let x

    be a query DNA sequence. The Viterbi algorithm searches for the most probable path π∗

    such that π∗ = argmaxπ(x, π). The output of the Viterbi algorithm includes the optimal

    alignment and its score. As Viterbi is a standard algorithm designed for HMMs, we refer

    readers to Durbin et al.[17] for a detailed illustration of the dynamic programming equations

    for �nding π∗. The major di�erence between our implementation and the standard Viterbi

    algorithm includes : 1) our implementation accepts a DNA rather than a peptide sequence

    as input; 2) a local alignment can start and end with any state without incurring insertion

    or deletion penalties.

    3.3.3 Alignment Filtering

    MetaDomain employs two �ltering mechanisms to increase its sensitivity in aligning short

    reads while maintaining a low false positive rate: position-speci�c thresholds (PSTs) and

    trimming.

    3.3.3.1 Position speci�c threshold

    PST allows di�erent alignment thresholds for well conserved and poorly conserved regions.

    Let the length of a query DNA sequence be L (in bp). Denote the pro�le HMM as M . Let

    Mi,j be a sub-model formed by all consecutive states from the ith match state Mi to the

    34

  • jth match state Mj . The upper bound of the alignment score against Mi,j is the maximum

    score that can be generated by aligning any input sequence of length j − i + 1 with Mi,j .

    Let ai,j denote the transition probability from state Mi to state Mj . Let ei(a) denote the

    probability of state Mi emitting amino acid a. Then the upper bound Ui,j for sub-model

    Mi,j is calculated as follows:

    Ui,j =

    j∏k=i

    ak,k+1 ×max(ek(a))

    where aj,j+1 is set to 1 because j is the ending state of the sub-model.

    We de�ne PST for the submodel Mi,j as:

    PSTi,j = γUi,j

    where the coe�cient γ is a user-speci�ed parameter in the range of [0,1]. It can be �exibly

    adjusted to control the trade-o� between sensitivity and false positive rate of MetaDomain.

    The default value is 0.6, which is used in our experiments.

    3.3.3.2 Alignment trimming

    Alignment with scores larger than their corresponding PSTs will pass the �rst �ltering stage.

    As each domain has various conservation along the entire length of the model, well-conserved

    sub-regions have high PSTs while poorly-conserved sub-regions yield low PSTs. Thus, ran-

    dom sequences tend to be aligned to poorly-conserved regions by MetaDomain, incurring a

    high FP rate. Our empirical experiments show that dozens of reads that are not sequenced

    from the underlying domain can be aligned to the same position in a poorly-conserved sub-

    region. In order to minimize the e�ects of noise, we discard stacked alignments that have

    35

  • the same alignment positions.

    3.3.4 Protein domain classi�cation

    In this stage we extract two features from the collected read alignments for each domain:

    the number of aligned reads and the domain coverage. The domain coverage is the fraction

    of positions covered by at least one read alignment in a domain. MetaDomain then applies

    a simple decision tree to classify all the target domains into two classes: encoded domains

    and non-encoded domains. If both features of a domain are equal to or bigger than their

    corresponding thresholds, this domain will be classi�ed as encoded. Otherwise it is not

    encoded in the sample. By default, the cuto� for domain coverage is 30%. Ideally, the cuto�

    for the number of aligned read should be determined based on the properties of data such

    as sequencing depth. If users do not specify this value, we use 20 by default.

    3.4 Results

    In order to evaluate the performance of MetaDomain on real data generated by next-

    generation sequencing technologies, we applied MetaDomain to protein domain analysis in

    two data sets. The �rst one is the transcriptome generated using RNA-seq for Burkholderia

    cenocepacia. As both the reference genome and its domain annotations are available, we

    can quantify the sensitivity and false positive (FP) rate of MetaDomain. The second one

    is metagenome data sequenced from soil. We applied MetaDomain to identify domains en-

    coded in the underlying data. In addition, we compared HMMER and MetaDomain in both

    applications.

    36

  • 3.4.1 Identifying transcribed protein domains in transcriptome

    In this experiment, we conducted transcribed domain analysis in the transcriptome from

    one strain of B. cenocepacia named AU1054 [52]. By using Illumina RNA-seq, the authors

    generated multiple samples for AU1054 in two growth media. We used one replicate of cDNA

    sample of AU1054 in the growth medium cystic �brosis. In total, 3,361,008 reads of a length

    of 41 bp were downloaded from the website provided by the authors. We evaluated the

    performance of read classi�cation and domain identi�cation of MetaDomain and HMMER.

    3.4.1.1 Performance of read classi�cation

    The performance of read classi�cation is quanti�ed using both read classi�cation sensitivity

    and FP (false positive) rate. In this experiment, the read classi�cation performance is

    computed on reads that can be mapped to annotated domains. Below we sketch the main

    steps to obtain mapped reads for a domain using the reference genome and the domain

    annotations. First, we downloaded the genome of AU1054 and the annotated genes and

    domains from the IMG website [53]. There are 2,181 annotated Pfam domains. Second, the

    reads were mapped to the reference genome using Bowtie [54] with two mismatches allowed.

    Third, we compared the positions of read mapping and annotated domains. For a domain, all

    reads that fall into it are de�ned as �mapped" reads. Denote the set of mapped reads as M .

    All other (unmapped) reads constitute set U . For a domain classi�cation tool, let the set of

    aligned reads for a domain be A. Thus, the sensitivity and FP rate of read classi�cation for

    a domain are A∩MM andA−MU , respectively. A perfect sensitivity indicates that all mapped

    reads can be aligned. A zero FP rate indicates that only mapped reads can be aligned to a

    domain.

    Of the 2,181 annotated families, we evaluated the performance of HMMER and Meta-

    37

  • Domain on 1406 families which have at least 1 mapped reads. Of the 1406 tested domains,

    HMMER could not align any read to 1150 domains, resulting in zero sensitivity and FP

    rate. For the rest 256 domains, all aligned reads by HMMER are non-mappable reads, re-

    sulting in zero sensitivity and a positive FP rate. The comparison between HMMER and

    MetaDomain is summarized using a bubble chart in Figure 3.5. The biggest bubble indicates

    that HMMER has zero sensitivity and zero FP rate for 1150 domains. As we can see, it

    is highly di�cult for HMMER to correctly align reads as short as 41 bp. There are two

    reasons for the low sensitivity of HMMER on short reads. First, the parameter training in

    E-value calculation of HMMER are based on much longer reads (100 amino acids). Thus,

    the small alignment scores generated by the short reads yield large E-values and cannot pass

    the E-value threshold. Second, the small alignment scores of short reads may not pass the

    �ltration stage of HMMER.

    3.4.1.2 Identifying transcribed domains in the transcriptome

    Figure 3.5 only shows the read classi�cation performance. MetaDomain uses both aligned

    read number and domain coverage as thresholds for domain identi�cation. We expect that

    the additional constraint will reduce the false positive rate in domain identi�cation. Because

    of the low read classi�cation sensitivity, we speculate that HMMER will have low sensitivity

    in identifying transcribed domains.

    In order to quantify the performance of domain identi�cation, we need to build positive

    and negative test sets, which include transcribed and non-transcribed domains based on

    mapped reads. There is no commonly accepted criterion to de�ne transcribed genes using

    the number of mapped reads. Various expression scores such as an average coverage depth

    across the entire length of each gene [55] and reads per kilobase of exon model per million

    38

  • 0 1 2 3 4 5 6 7

    x 10−5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    FP rate

    Sens

    itiv

    ity

    HMMER3MetaDomain

    Figure 3.5: Read classi�cation sensitivity and FP rate of HMMER and MetaDomain. Thesize of each bubble represents the number of data points (i.e., domains) with the samesensitivity and FP rate.

    mapped reads (RPKM) are used to quantify transcriptional level. In addition, the cuto�s

    of de�ning highly transcribed, lowly transcribed, or non-transcribed genes are variable in

    di�erent applications [56]. In this work, we de�ne transcribed domains based on the rationale

    that a truly transcribed domain should be mapped by a number of reads at di�erent positions.

    Correspondingly, we use the following criteria to determine whether a domain inside a gene

    is transcribed: 1) at least N reads are mapped to a domain; 2) at least 30% of positions

    in a domain are mapped by reads. A domain is labeled �non-transcribed" if the number

    of mapped read is zero. For domains that fall between the criteria for transcribed and

    non-transcribed domains, they are labeled �unknown" and are excluded from the test sets.

    Table 3.1 shows the size change of the positive and negative test sets over the cuto� N.

    39

  • Table 3.1: Number of transcribed and non-transcribed domains using di�erent cuto�s (N)for the number of mapped reads.

    N transcribed unknown none-transcribed

    10 318 1317 54615 262 1373 54620 226 1409 54625 195 1440 54630 169 1466 546

    Intuitively, bigger N creates an easier case for domain classi�cation than smaller N.

    We align all reads to the transcribed and non-transcribed domains using MetaDomain

    and HMMER. The �unknown" domains are removed due to their ambiguity. For HMMER,

    we �rst translated the short reads into peptide sequences using 6-frame translations. We

    then aligned the domains with the translated sequences using 1000 as the E-value threshold,

    which is chosen to maximize the sensitivity. For MetaDomain we directly aligned the short

    reads with the domains. The pipeline in Figure 3.4 was used to output a list of transcribed

    domains for MetaDomain. LetD+ andD− be the number of transcribed and non-transcribed

    domains identi�ed using the read mapping results in Section 3.4.1.2. LetM+ andM− be the

    predicted number of transcribed and non-transcribed domains by MetaDomain or HMMER.

    The sensitivity and FP rate of domain classi�cation tools are de�ned using the following

    equations:

    Sensitivity = D+∩M+D+

    FP rate = D−∩M+D−

    The values of D and M are a�ected by several options. First, D+ and D− can change

    over the cuto� N as shown in Table 3.1. Second, we used both the domain coverage and the

    40

  • number of aligned reads to determine whether a domain is encoded or transcribed. In this

    experiment, the cuto� for domain coverage is 30%, which we found reasonable across di�erent

    experiments. Thus, M+ and M− mainly change over the required number of aligned reads

    to a domain. For simplicity, we denote the cuto� as τ . Increasing τ implies a more stringent

    constraint for de�ning transcribed domains, and thus might result in lower sensitivity and a

    smaller FP rate. Decreasing τ is likely to increase the sensitivity while incurring a higher FP

    rate. In order to compare the performance of MetaDomain and HMMER under di�erent τ ,

    we plotted the ROC curves by changing τ from 1 to N for N=10, 20, and 30 in Figure 3.6.

    0 0.1 0.20

    0.2

    0.4

    0.6

    0.8

    FP rate

    Sens

    itiv

    ity

    N=10

    0 0.1 0.20

    0.2

    0.4

    0.6

    0.8

    1

    FP rate

    Sens

    itiv

    ity

    N=20

    0 0.1 0.20

    0.2

    0.4

    0.6

    0.8

    1

    FP rate

    Sens

    itiv

    ity

    N=30

    HMMER3

    MetaDomain

    HMMER3

    MetaDomain HMMER3MetaDomain

    Figure 3.6: ROC curves of HMMER and MetaDomain.

    Figure 3.6 shows that HMMER is highly speci�c (FP rate ≤ 1.3%). However, as we

    speculated, its sensitivity is low, with the highest sensitivity being only 0.135. HMMER

    misses a large portion of short reads that can be mapped to protein domains even when we

    use a very relaxed E-value cuto�. When both tools incur an FP rate of 0.02, the sensitivity

    of MetaDomain is 0.53 vs. 0.13 for HMMER. When N decreases from 30 to 10, the size of the

    positive test set D+ becomes larger and the sensitivity of both HMMER and MetaDomain

    41

  • decreases. Note that the sensitivity and FP rate of HMMER keep the same for many di�erent

    thresholds (i.e., τ), resulting in compact ROC curves. Overall, the ROC curves show that

    MetaDomain can achieve higher sensitivity while keeping a similar FP rate as HMMER

    for domain classi�cation in this experiment. In addition, Figure 3.6 provides guidance on

    determining appropriate τ for MetaDomain in order to achieve desired sensitivity and FP

    rate.

    On average, it took MetaDomain 280 seconds to align 752,156 reads with one domain on

    a 2.


Recommended