2011 09 Human Genebuild

8/13/2019 2011 09 Human Genebuild

1/15

Ensembl gene annotation project (e! 62 and e! 63)

Homo sapiens (human, GRCh37 assembly)

Raw computes stage: Searching for sequence patterns,

aligning proteins and cDNAs to the genome.

ppro!imate time" 3 #ee$

The annotation process of the high-coverage human assembly began with the

raw compute stage [Figure 1] whereby the genomic sequence was screened

for sequence patterns including repeats using RepeatMaser [1!]

"version #!$!%& with parameters '-nolow -species “homo sapiens”

-s’(& )ust [$!] and TRF [#!]! RepeatMaser and )ust combined mased

*+!+, of the species genome!

%igure 1" &ummary o' human gene annotation project

Transcription start sites were predicted using .ponine/scan [*!] and

First.F [%!]! 0p islands and tR23s [+!] were also predicted! enscan [4!]

1

Raw computes

Add UTR to coding models

37007 / 41383 human prot.370 / 2332 PE1,2 Uniprot prot.67232 / 131424 annotated human cDNA

3!"06 human prot.2132 Uniprot prot.

66434 human cDNA#$or codin% mode&'

46.6( repeat ma)ed 4"1720 Uniprot a&i%ned

10420" / 138000 human prot.122!0" / 12!400 human cDNA

10301/ 128724 PE1,2mamm/*ert Uniprot prot.

Proteins/cDNAs

aligned

Ensembl gene set

274!84 / 276"10 human cDNA

72!7"21 / 81743!3 human E+cDNAs and ESTs

aligned

-i&ter

131381 / 221864 human cDNAued $or U

1263"4 human cDNAued $or U

-i&ter

Human Vega gene set

Final gene set

er%e3!070 human prot.17"0 Uniprot prot.

64!3" human cDNA #$or codin% mode&'11462! human cDNA #$or U'


2/15

was run across RepeatMased sequence and the results were used as input

for 5ni6rot [7!]& 5niene [8!] and 9ertebrate R23 [1,!] alignments by :5-

;onerate "cdna2genome model( [1$!] to generate coding models [Figure $]!

3dditionally& pre-aligned annotated c)23s were re-aligned to unmased

genomic regions! This approach helped in discovering small e>ons which may

have been ignored by e>onerate because of their siBe [Figure $]! ;ecause all

c)23s used in this step had nown pairing with proteins "e!g! Ref=eq c)23s

with accession prefi> C2MDE matching Ref=eq proteins with C26DE prefi>(& it

allowed the comparison of coding models generated by e>onerate for a given

c)23 to those generated by enewise using its counterpart protein! The

3pollo software [1%!] was used to visualise the results of filtering!

:here one protein sequence had generated more than one candidate coding

model at a locus& the ;estTargetted module was used to select the codingmodel that most closely matched the source protein to tae through to the

2


3/15

ne>t stage of the gene annotation process! The generation of transcript

models using species-specific "in this case& human( data is referred to as the

CTargetted stageE! This stage resulted in 1$,+%7 coding models built from

*1#7# human proteins and 6722 cDNAs which were taen through to the

5TR addition stage!

Similarit" stage: enerating additional coding models using

proteins from related species

ppro!imate time" 2 #ee$s

Following the human Targetted alignments& additional coding models were

generated as follows! The 5ni6rot alignments from the Raw 0omputes step

were filtered to retain only those sequences belonging to 5ni6rots

CMammaliaE and C9ertebrataE ta>onomical classes as well as 5niprots 6rotein

.>istence "6.( classification level 1 and $! An genomic regions which were

not covered by any coding models from Targetted alignments& :5-;


4/15

was rerun for the 5niprot protein sequences and the results were passed to

enewise [1#!] to build coding models! An most cases& multiple coding

models built from different 5niprot proteins were generated in a single locus&

each model with a slightly different e>on-intron structure! To filter for the best

supported structures& the Transcript0onsensus module was used to compare

each enewise model against human c)23 and .=T alignments in the

region "see ne>t section on how these alignments were generated(& where

e>ons in the enewise model were scored for overlapping with e>ons of

c)23?.=T alignments& and model"s( with the highest combined score in a

region were ept! The generation of transcript models using data from related

species is referred to as the C=imilarity stageE [Figure #]! This stage resulted in

**%$ and 188% coding models supported by mammalian 5niprot proteins and

non-mammalian vertebrate 5niprot proteins respectively!

4

%igure 3" lignment and 'iltering o' mammalian and -ertebrate proteins


5/15

cDNA and #ST alignments

ppro!imate time" 2.3 #ee$s

Guman c)23 and .=T sequences were previously downloaded from

.23?enban?));@ and Ref=eq [8!]& clipped to remove poly3 tails& and

aligned to the genome using .>onerate "est2genome model( [Figure *]!

$$17+* "of $4+%1,( human c)23s aligned and 4$84%$1 "of 714*#8#( human

.=Ts had aligned to the genome! The coverage cut-offs and percentageidentity for c)23 alignments were set at 87& which were higher than those

for .=Ts "8, coverage& 84 percentage identity( because c)23s are

generally less fragmented than .=Ts! .=T alignments were used to generate

.=T-based gene models similar to those for mouse [1*!] and these are

displayed on the website in a separate trac from the .nsembl gene set!

5

%igure /" lignment o' human c*+s and E&s to the human genome


6/15

$iltering coding models


The set of coding models was finalised after another stage of filtering& which

involved manual removal of some more Targetted models supported by

dubious human protein?c)23 evidence on a case-by-case basis& and removal

of H+, of =imilarity alignments which contained non-canonical "non T?3(

splice sites using a 6erl script! The 3pollo software [1%!] was used to

visualise the results of filtering!

Addition of %TR to coding models


3fter finalising the set of coding models& those generated by enewise

alignments were e>tended into the untranslated regions "5TRs( using human

c)23s! 0oding models generated by e>onerates cdna2genome & this

includes the e>onerate$genesDregion approach where pre-aligned c)23

sequences are aligned to unmased genomic regions& already contained 5TR

annotations and hence did not go through this 5TR addition step! :here

available& human )iTag alignments were used to guide the positioning of

5TRs and add additional weight to some 5TR structures& while Ref=eq C2ME

c)23 vs C26E protein pairing information was used to ensure the correct

matching of c)23s to coding models supported by Ref=eq proteins! This

resulted in *1,14 "of *7$$#( coding models from #4,,4 human proteins with

5TR& and *,% "of $88*( coding models from #4, 5niprot proteins with 5TR!

enerating multi&transcript #nsem'l genes

ppro!imate time" /.0 #ee$s

The above steps generated a large set of potential transcript models& with or

without 5TR& many of which overlapped one another! Redundant transcript

models were collapsed and the remaining unique set of transcript models

were clustered into multi-transcript genes where each transcript in a gene has

at least one coding e>on that overlaps a coding e>on from another transcript

within the same gene! The resulting .nsembl gene set contained $#,7+

genes& of which $$#+8 contained transcripts supported by human

6


7/15

c)23s?proteins only "from the CTargettedE stage of the build(& and 414

contained transcripts supported by 5niprot proteins only from the C=imilarityE

stage of the build [Figure %]! The .nsembl genes were associated with a total

of %$%%8 .nsembl transcripts& of which %17#% were supported by human

c)23s?proteins& and 4$* had support from 5niprot proteins [Figure +]!

7

%igure 0" &upporting e-idence 'or human Ensembl gene set

Evidence for Ensembl genes

humancDNAs/proteins

Uniprotmamm./vert.proteins

%igure 6" &upporting e-idence 'or human Ensembl transcript set

Evidence for Ensembl transcripts

humancDNAs/proteins

Uniprot mamm./vert.proteins


8/15

(seudogenes, immunoglo'ulin genes, mitochondrial genes


The .nsembl gene set was screened for pseudogenes and retrotransposed

genes! 2e>t& human immunoglobulin "Ag( genes were annotated using the

.nsembl CAg genebuildE pipeline [1+!]! ;riefly& human proteins and c)23s for

Ag genes were downloaded from AMT [14!] and aligned to the human

genome using .>onerate! The .>onerate alignments were processed to Ioin

the 9?)?@?0 segments together into Ag gene models& which were then

compared to the Ag genes already present in the .nsembl gene set

"generated at the Targetted stage(! Af the models generated by the CAg

genebuildE pipeline overlapped with e>isting .nsembl genes at the e>on level&the e>isting .nsembl genes will be replaced by the new Ag gene models& for

the latter are usually more accurate representations of Ag genes! 3lso

imported into the .nsembl gene set were annotation of mitochondrial genes in

A2)=0 [17!] and short non-coding R23s "e!g! miR23s& snoR23s( generated

by the ncR23 pipeline [18!]!

)erging #nsem'l and *ega gene sets, annotating long

intergenic non&coding RNA genes and generating the

final gene set.


Following the completion of the .nsembl gene set& .nsembl annotations and

manual annotations "primarily generated by the G39323 team at the

:ellcome Trust =anger Anstitute( from the 9ega database [$,!& $1!] were

merged at the transcript level to create the final gene set! The 9ega database

"as of 1$ =eptember $,1,( contained #8%+% genes and 1##*%1 transcripts!

An the merge process& .nsembl and 9ega transcripts were merged if they had

identical exon-intron structures. If transcripts from the two annotation sources

matched at all internal exon-intron boundaries, i.e. had identical splicing

pattern, but one of them had longer terminal exons, usually the UTRs, they

were merged too, but the resulting merged transcript would adopt the exon-

intron structure of the Vega transcript as we prioritised Vega annotation over

Ensembl. Transcripts which had not been merged& either because ofdifferences in internal e>on-intron boundaries or presence of transcripts in

8


9/15

only one annotation source& were transferred from the source to the final gene

set intact!

The .nsembl-9ega merge code also too into account the biotype and

supporting evidence associated with the transcripts from both annotation

sources! For a pair of transcripts to be merged& if there was a mismatch in

biotype& e!g! the .nsembl transcript is protein-coding but the 9ega counterpart

is non-coding& the 9ega biotype would have precedence over the .nsembl

model and the .nsembl transcript would undergo a biotype change to match

its 9ega counterpart! The translation for the .nsembl transcript would then be

removed if the transcript has lost its protein-coding biotype! ;iotype conflicts

between .nsembl and 9ega were always reported to the G39323 team for

investigation& and when resolved& could improve the merged gene set in the

future! 3s for supporting evidence& the merge of .nsembl and 9ega

transcripts also involved merging of protein?c)23 supporting evidence

associated with the transcripts to ensure the basis on which the annotations

were made would not be lost!

Following the merge& long intergenic non-coding R23 genes "lincR23s( were

annotated by the .nsembl lincR23 pipeline [18!] and incorporated in the final

gene set!

3n important feature of the merged gene set is the presence of all 9ega

source transcripts! This has been made possible by allowing 9ega annotation

to tae precedence over .nsembls when merging transcripts which do not

match at their terminal e>ons or have different biotypes! Jf all 9ega

transcripts& 17!# of them were merged with .nsembl transcripts! The vast

maIority of merged transcripts "78!+( are of protein-coding biotype! 9ega

transcripts which were not merged "7$!4 of 9ega source transcripts( were

mostly alternative splice variants& pseudogenes or non-coding! These

transcripts were fully transferred into the final gene set! The final

.nsembl-9ega set consisted of **#1* genes and 1+,,,$ transcripts! Jf the

1+,,,$ transcripts& 1%!# "$**8$( were the result of merging .nsembl and

9ega annotations& 1+!1 "$%417( originated from .nsembl& +7!% "1,8+%,(

9


10/15

originated from 9ega& and the remaining H,!* were incorporated from other

sources "e!g! immunoglobulin gene segments?transcripts imported from AMT

data(!

3s a quality-control measure& .nsembl translations of protein-coding

transcripts in the final merged gene set were aligned against the 20;A

Ref=eq and 5niprot?=wiss6rot sets of public curated protein sequences

"which were used in the CTargettedE stage of the gene build( to calculate the

proportion of curated sequences covered by the merged gene set! Jver 88

of Ref=eq and =wiss6rot proteins were represented in the merged gene set&

and in the maIority of cases& there was a 1,, match between the curated

protein and .nsembl translation!

=ince .nsembl release %+ "=eptember $,,8(& the .nsembl-9ega gene set

has e>actly corresponded to a .20J). release [$#!]! The gene set in

release +$& which this document describes& corresponds to .20J).

release 4! .ach .20J). release also contains the full annotation of the

consensus coding sequence "00)=( transcript models [$*!]! 3ll 00)=

models are included in each release of the human gene set!

(rotein annotation, cross&referencing, sta'le +dentifiers

ppro!imate time" / #ee$s

;efore public release the transcripts and translations were given e>ternal

references "cross-references to e>ternal databases(& while translations were

searched for domains?signatures of interest and labelled where appropriate!

=table identifiers were assigned to each gene& transcript& e>on and

translation! :hen annotating a species for the first time& these identifiers are

auto-generated! An all subsequent annotations for a species& the stable

identifiers are propagated based on comparison of the new gene set to the

previous gene set!

10


11/15

dditional annotation and post genebuild 'iltering inEnsembl release 63

Addition of annotation on haplot"pe regions

ppro!imate time" 1.2 #ee$s

The annotation of the haplotype regions on chromosomes +& 1* and 14 were

added after the main reference genome had been annotated! Figure 4 shows

the annotation pipeline which closely follows the procedure described earlier!

The annotation resulted in a final gene set of $7#1 genes of which $*, were

pseudogenes or retrotransposed gene!

(ost gene'uild filtering

ppro!imate time" 3./ #ee$s

To eliminate and filter out poorly supported models that may have erroneously

been included in the full annotation& the human gene set undergoes an

additional filtering process after each annotation! This is to tae advantage of

the comparative genomics information that becomes available only after the

first annotation has been released!

11

%igure 7" or$'lo# 'or the annotation o' haplotype regions inchromosomes 6, 1/ and 17


12/15

3ll models annotated by .nsembl were filtered systematically by a series of

6erl scripts to remove models with erroneous structures! .>amples of such

scenarios would be where a model differed considerably in its internal

structure compared to other models in the same locus& or if e>ons were

missing or had non-consistent splice sites! An addition& models supported by

c)23 fragments with wrongly annotated short open-reading frames were

removed manually on a case-by-case basis! Further filtering of the models

was done using the following criteria at gene levelK

–


13/15

The quality of a gene set is dependent on the quality of the genome assembly!

enome assembly can be assessed in a number of ways& includingK

1! 0overage estimate

o 3 higher coverage usually indicates a more complete assembly!

o 5sing =anger sequencing only& a coverage of at least $> is

preferred!

$! 2%, of contigs and scaffolds

o 3 longer 2%, usually indicates a more complete genome

assembly!

o ;earing in mind that an average human gene may be 1,-1% b

in length& contigs shorter than this length will be unliely to hold

full-length gene models!

#! 2umber of contigs and scaffolds

o 3 lower number toplevel sequences usually indicates a more

complete genome assembly!

*! 3lignment of c)23s and .=Ts to the genome

o 3 higher number of alignments& using stringent thresholds&

usually indicates a more complete genome assembly!

More information on the .nsembl automatic gene annotation process can be

found atK

• 0urwen 9& .yras .& 3ndrews T)& 0lare tNrootOensemblPviewOco

13

http://www.ensembl.org/info/docs/genebuild/genome_annotation.htmlhttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://www.ensembl.org/info/docs/genebuild/genome_annotation.html


14/15

References

1. =mit& 3F3& Gubley& R P reen& 6K Repeat4as$er 5pen.3 188+-$,1,!

www!repeatmaser!org

2. uBio @& Tatusov R& and


15/15

13. ;irney .& 0lamp M& )urbin RK Geneise and Genome#ise Genome Res. $,,*&

1/(0)"877-88%! [6MA)K 1%1$#%8+]

14. .yras .& 0accamo M& 0urwen 9& 0lamp M! E&Genes" alternati-e splicing 'rom

E&s in Ensembl Genome Res. $,,* 1/(0)"84+-874! [6MA)K 1%1$#%8%]

15. he -ertebrate genome annotation (=ega) database! Nucleic Acid Res. $,,7 @anL

3dvance 3ccess published on 2ovember 1*& $,,4L doiK1,!1,8#?nar?gm874

22. httpK??vega!sanger!ac!u?info?about?geneDandDtranscriptDtypes!html

$#! Garrow&@!& )enoeud&F!& Franish&3!& Reymond&3!& 0hen&0!!& 0hrast&@!&

Date post:	04-Jun-2018
Category:	Documents
Upload:	gabriel-caicedo-russy
View:	216 times
Download:	0 times

2011 09 Human Genebuild

Documents