Date post: | 04-Jun-2018 |
Category: |
Documents |
Upload: | gabriel-caicedo-russy |
View: | 216 times |
Download: | 0 times |
of 15
8/13/2019 2011 09 Human Genebuild
1/15
Ensembl gene annotation project (e! 62 and e! 63)
Homo sapiens (human, GRCh37 assembly)
Raw computes stage: Searching for sequence patterns,
aligning proteins and cDNAs to the genome.
ppro!imate time" 3 #ee$
The annotation process of the high-coverage human assembly began with the
raw compute stage [Figure 1] whereby the genomic sequence was screened
for sequence patterns including repeats using RepeatMaser [1!]
"version #!$!%& with parameters '-nolow -species “homo sapiens”
-s’(& )ust [$!] and TRF [#!]! RepeatMaser and )ust combined mased
*+!+, of the species genome!
%igure 1" &ummary o' human gene annotation project
Transcription start sites were predicted using .ponine/scan [*!] and
First.F [%!]! 0p islands and tR23s [+!] were also predicted! enscan [4!]
1
Raw computes
Add UTR to coding models
37007 / 41383 human prot.370 / 2332 PE1,2 Uniprot prot.67232 / 131424 annotated human cDNA
3!"06 human prot.2132 Uniprot prot.
66434 human cDNA#$or codin% mode&'
46.6( repeat ma)ed 4"1720 Uniprot a&i%ned
10420" / 138000 human prot.122!0" / 12!400 human cDNA
10301/ 128724 PE1,2mamm/*ert Uniprot prot.
Proteins/cDNAs
aligned
Ensembl gene set
274!84 / 276"10 human cDNA
72!7"21 / 81743!3 human E+cDNAs and ESTs
aligned
-i&ter
131381 / 221864 human cDNAued $or U
1263"4 human cDNAued $or U
-i&ter
Human Vega gene set
Final gene set
er%e3!070 human prot.17"0 Uniprot prot.
64!3" human cDNA #$or codin% mode&'11462! human cDNA #$or U'
8/13/2019 2011 09 Human Genebuild
2/15
was run across RepeatMased sequence and the results were used as input
for 5ni6rot [7!]& 5niene [8!] and 9ertebrate R23 [1,!] alignments by :5-
;onerate "cdna2genome model( [1$!] to generate coding models [Figure $]!
3dditionally& pre-aligned annotated c)23s were re-aligned to unmased
genomic regions! This approach helped in discovering small e>ons which may
have been ignored by e>onerate because of their siBe [Figure $]! ;ecause all
c)23s used in this step had nown pairing with proteins "e!g! Ref=eq c)23s
with accession prefi> C2MDE matching Ref=eq proteins with C26DE prefi>(& it
allowed the comparison of coding models generated by e>onerate for a given
c)23 to those generated by enewise using its counterpart protein! The
3pollo software [1%!] was used to visualise the results of filtering!
:here one protein sequence had generated more than one candidate coding
model at a locus& the ;estTargetted module was used to select the codingmodel that most closely matched the source protein to tae through to the
2
8/13/2019 2011 09 Human Genebuild
3/15
ne>t stage of the gene annotation process! The generation of transcript
models using species-specific "in this case& human( data is referred to as the
CTargetted stageE! This stage resulted in 1$,+%7 coding models built from
*1#7# human proteins and 6722 cDNAs which were taen through to the
5TR addition stage!
Similarit" stage: enerating additional coding models using
proteins from related species
ppro!imate time" 2 #ee$s
Following the human Targetted alignments& additional coding models were
generated as follows! The 5ni6rot alignments from the Raw 0omputes step
were filtered to retain only those sequences belonging to 5ni6rots
CMammaliaE and C9ertebrataE ta>onomical classes as well as 5niprots 6rotein
.>istence "6.( classification level 1 and $! An genomic regions which were
not covered by any coding models from Targetted alignments& :5-;
8/13/2019 2011 09 Human Genebuild
4/15
was rerun for the 5niprot protein sequences and the results were passed to
enewise [1#!] to build coding models! An most cases& multiple coding
models built from different 5niprot proteins were generated in a single locus&
each model with a slightly different e>on-intron structure! To filter for the best
supported structures& the Transcript0onsensus module was used to compare
each enewise model against human c)23 and .=T alignments in the
region "see ne>t section on how these alignments were generated(& where
e>ons in the enewise model were scored for overlapping with e>ons of
c)23?.=T alignments& and model"s( with the highest combined score in a
region were ept! The generation of transcript models using data from related
species is referred to as the C=imilarity stageE [Figure #]! This stage resulted in
**%$ and 188% coding models supported by mammalian 5niprot proteins and
non-mammalian vertebrate 5niprot proteins respectively!
4
%igure 3" lignment and 'iltering o' mammalian and -ertebrate proteins
8/13/2019 2011 09 Human Genebuild
5/15
cDNA and #ST alignments
ppro!imate time" 2.3 #ee$s
Guman c)23 and .=T sequences were previously downloaded from
.23?enban?));@ and Ref=eq [8!]& clipped to remove poly3 tails& and
aligned to the genome using .>onerate "est2genome model( [Figure *]!
$$17+* "of $4+%1,( human c)23s aligned and 4$84%$1 "of 714*#8#( human
.=Ts had aligned to the genome! The coverage cut-offs and percentageidentity for c)23 alignments were set at 87& which were higher than those
for .=Ts "8, coverage& 84 percentage identity( because c)23s are
generally less fragmented than .=Ts! .=T alignments were used to generate
.=T-based gene models similar to those for mouse [1*!] and these are
displayed on the website in a separate trac from the .nsembl gene set!
5
%igure /" lignment o' human c*+s and E&s to the human genome
8/13/2019 2011 09 Human Genebuild
6/15
$iltering coding models
ppro!imate time" 2 #ee$s
The set of coding models was finalised after another stage of filtering& which
involved manual removal of some more Targetted models supported by
dubious human protein?c)23 evidence on a case-by-case basis& and removal
of H+, of =imilarity alignments which contained non-canonical "non T?3(
splice sites using a 6erl script! The 3pollo software [1%!] was used to
visualise the results of filtering!
Addition of %TR to coding models
ppro!imate time" 2 #ee$s
3fter finalising the set of coding models& those generated by enewise
alignments were e>tended into the untranslated regions "5TRs( using human
c)23s! 0oding models generated by e>onerates cdna2genome & this
includes the e>onerate$genesDregion approach where pre-aligned c)23
sequences are aligned to unmased genomic regions& already contained 5TR
annotations and hence did not go through this 5TR addition step! :here
available& human )iTag alignments were used to guide the positioning of
5TRs and add additional weight to some 5TR structures& while Ref=eq C2ME
c)23 vs C26E protein pairing information was used to ensure the correct
matching of c)23s to coding models supported by Ref=eq proteins! This
resulted in *1,14 "of *7$$#( coding models from #4,,4 human proteins with
5TR& and *,% "of $88*( coding models from #4, 5niprot proteins with 5TR!
enerating multi&transcript #nsem'l genes
ppro!imate time" /.0 #ee$s
The above steps generated a large set of potential transcript models& with or
without 5TR& many of which overlapped one another! Redundant transcript
models were collapsed and the remaining unique set of transcript models
were clustered into multi-transcript genes where each transcript in a gene has
at least one coding e>on that overlaps a coding e>on from another transcript
within the same gene! The resulting .nsembl gene set contained $#,7+
genes& of which $$#+8 contained transcripts supported by human
6
8/13/2019 2011 09 Human Genebuild
7/15
c)23s?proteins only "from the CTargettedE stage of the build(& and 414
contained transcripts supported by 5niprot proteins only from the C=imilarityE
stage of the build [Figure %]! The .nsembl genes were associated with a total
of %$%%8 .nsembl transcripts& of which %17#% were supported by human
c)23s?proteins& and 4$* had support from 5niprot proteins [Figure +]!
7
%igure 0" &upporting e-idence 'or human Ensembl gene set
Evidence for Ensembl genes
humancDNAs/proteins
Uniprotmamm./vert.proteins
%igure 6" &upporting e-idence 'or human Ensembl transcript set
Evidence for Ensembl transcripts
humancDNAs/proteins
Uniprot mamm./vert.proteins
8/13/2019 2011 09 Human Genebuild
8/15
(seudogenes, immunoglo'ulin genes, mitochondrial genes
ppro!imate time" 3 #ee$s
The .nsembl gene set was screened for pseudogenes and retrotransposed
genes! 2e>t& human immunoglobulin "Ag( genes were annotated using the
.nsembl CAg genebuildE pipeline [1+!]! ;riefly& human proteins and c)23s for
Ag genes were downloaded from AMT [14!] and aligned to the human
genome using .>onerate! The .>onerate alignments were processed to Ioin
the 9?)?@?0 segments together into Ag gene models& which were then
compared to the Ag genes already present in the .nsembl gene set
"generated at the Targetted stage(! Af the models generated by the CAg
genebuildE pipeline overlapped with e>isting .nsembl genes at the e>on level&the e>isting .nsembl genes will be replaced by the new Ag gene models& for
the latter are usually more accurate representations of Ag genes! 3lso
imported into the .nsembl gene set were annotation of mitochondrial genes in
A2)=0 [17!] and short non-coding R23s "e!g! miR23s& snoR23s( generated
by the ncR23 pipeline [18!]!
)erging #nsem'l and *ega gene sets, annotating long
intergenic non&coding RNA genes and generating the
final gene set.
ppro!imate time" 1 #ee$s
Following the completion of the .nsembl gene set& .nsembl annotations and
manual annotations "primarily generated by the G39323 team at the
:ellcome Trust =anger Anstitute( from the 9ega database [$,!& $1!] were
merged at the transcript level to create the final gene set! The 9ega database
"as of 1$ =eptember $,1,( contained #8%+% genes and 1##*%1 transcripts!
An the merge process& .nsembl and 9ega transcripts were merged if they had
identical exon-intron structures. If transcripts from the two annotation sources
matched at all internal exon-intron boundaries, i.e. had identical splicing
pattern, but one of them had longer terminal exons, usually the UTRs, they
were merged too, but the resulting merged transcript would adopt the exon-
intron structure of the Vega transcript as we prioritised Vega annotation over
Ensembl. Transcripts which had not been merged& either because ofdifferences in internal e>on-intron boundaries or presence of transcripts in
8
8/13/2019 2011 09 Human Genebuild
9/15
only one annotation source& were transferred from the source to the final gene
set intact!
The .nsembl-9ega merge code also too into account the biotype and
supporting evidence associated with the transcripts from both annotation
sources! For a pair of transcripts to be merged& if there was a mismatch in
biotype& e!g! the .nsembl transcript is protein-coding but the 9ega counterpart
is non-coding& the 9ega biotype would have precedence over the .nsembl
model and the .nsembl transcript would undergo a biotype change to match
its 9ega counterpart! The translation for the .nsembl transcript would then be
removed if the transcript has lost its protein-coding biotype! ;iotype conflicts
between .nsembl and 9ega were always reported to the G39323 team for
investigation& and when resolved& could improve the merged gene set in the
future! 3s for supporting evidence& the merge of .nsembl and 9ega
transcripts also involved merging of protein?c)23 supporting evidence
associated with the transcripts to ensure the basis on which the annotations
were made would not be lost!
Following the merge& long intergenic non-coding R23 genes "lincR23s( were
annotated by the .nsembl lincR23 pipeline [18!] and incorporated in the final
gene set!
3n important feature of the merged gene set is the presence of all 9ega
source transcripts! This has been made possible by allowing 9ega annotation
to tae precedence over .nsembls when merging transcripts which do not
match at their terminal e>ons or have different biotypes! Jf all 9ega
transcripts& 17!# of them were merged with .nsembl transcripts! The vast
maIority of merged transcripts "78!+( are of protein-coding biotype! 9ega
transcripts which were not merged "7$!4 of 9ega source transcripts( were
mostly alternative splice variants& pseudogenes or non-coding! These
transcripts were fully transferred into the final gene set! The final
.nsembl-9ega set consisted of **#1* genes and 1+,,,$ transcripts! Jf the
1+,,,$ transcripts& 1%!# "$**8$( were the result of merging .nsembl and
9ega annotations& 1+!1 "$%417( originated from .nsembl& +7!% "1,8+%,(
9
8/13/2019 2011 09 Human Genebuild
10/15
originated from 9ega& and the remaining H,!* were incorporated from other
sources "e!g! immunoglobulin gene segments?transcripts imported from AMT
data(!
3s a quality-control measure& .nsembl translations of protein-coding
transcripts in the final merged gene set were aligned against the 20;A
Ref=eq and 5niprot?=wiss6rot sets of public curated protein sequences
"which were used in the CTargettedE stage of the gene build( to calculate the
proportion of curated sequences covered by the merged gene set! Jver 88
of Ref=eq and =wiss6rot proteins were represented in the merged gene set&
and in the maIority of cases& there was a 1,, match between the curated
protein and .nsembl translation!
=ince .nsembl release %+ "=eptember $,,8(& the .nsembl-9ega gene set
has e>actly corresponded to a .20J). release [$#!]! The gene set in
release +$& which this document describes& corresponds to .20J).
release 4! .ach .20J). release also contains the full annotation of the
consensus coding sequence "00)=( transcript models [$*!]! 3ll 00)=
models are included in each release of the human gene set!
(rotein annotation, cross&referencing, sta'le +dentifiers
ppro!imate time" / #ee$s
;efore public release the transcripts and translations were given e>ternal
references "cross-references to e>ternal databases(& while translations were
searched for domains?signatures of interest and labelled where appropriate!
=table identifiers were assigned to each gene& transcript& e>on and
translation! :hen annotating a species for the first time& these identifiers are
auto-generated! An all subsequent annotations for a species& the stable
identifiers are propagated based on comparison of the new gene set to the
previous gene set!
10
8/13/2019 2011 09 Human Genebuild
11/15
dditional annotation and post genebuild 'iltering inEnsembl release 63
Addition of annotation on haplot"pe regions
ppro!imate time" 1.2 #ee$s
The annotation of the haplotype regions on chromosomes +& 1* and 14 were
added after the main reference genome had been annotated! Figure 4 shows
the annotation pipeline which closely follows the procedure described earlier!
The annotation resulted in a final gene set of $7#1 genes of which $*, were
pseudogenes or retrotransposed gene!
(ost gene'uild filtering
ppro!imate time" 3./ #ee$s
To eliminate and filter out poorly supported models that may have erroneously
been included in the full annotation& the human gene set undergoes an
additional filtering process after each annotation! This is to tae advantage of
the comparative genomics information that becomes available only after the
first annotation has been released!
11
%igure 7" or$'lo# 'or the annotation o' haplotype regions inchromosomes 6, 1/ and 17
8/13/2019 2011 09 Human Genebuild
12/15
3ll models annotated by .nsembl were filtered systematically by a series of
6erl scripts to remove models with erroneous structures! .>amples of such
scenarios would be where a model differed considerably in its internal
structure compared to other models in the same locus& or if e>ons were
missing or had non-consistent splice sites! An addition& models supported by
c)23 fragments with wrongly annotated short open-reading frames were
removed manually on a case-by-case basis! Further filtering of the models
was done using the following criteria at gene levelK
–
8/13/2019 2011 09 Human Genebuild
13/15
The quality of a gene set is dependent on the quality of the genome assembly!
enome assembly can be assessed in a number of ways& includingK
1! 0overage estimate
o 3 higher coverage usually indicates a more complete assembly!
o 5sing =anger sequencing only& a coverage of at least $> is
preferred!
$! 2%, of contigs and scaffolds
o 3 longer 2%, usually indicates a more complete genome
assembly!
o ;earing in mind that an average human gene may be 1,-1% b
in length& contigs shorter than this length will be unliely to hold
full-length gene models!
#! 2umber of contigs and scaffolds
o 3 lower number toplevel sequences usually indicates a more
complete genome assembly!
*! 3lignment of c)23s and .=Ts to the genome
o 3 higher number of alignments& using stringent thresholds&
usually indicates a more complete genome assembly!
More information on the .nsembl automatic gene annotation process can be
found atK
• 0urwen 9& .yras .& 3ndrews T)& 0lare tNrootOensemblPviewOco
13
http://www.ensembl.org/info/docs/genebuild/genome_annotation.htmlhttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://www.ensembl.org/info/docs/genebuild/genome_annotation.html
8/13/2019 2011 09 Human Genebuild
14/15
References
1. =mit& 3F3& Gubley& R P reen& 6K Repeat4as$er 5pen.3 188+-$,1,!
www!repeatmaser!org
2. uBio @& Tatusov R& and
8/13/2019 2011 09 Human Genebuild
15/15
13. ;irney .& 0lamp M& )urbin RK Geneise and Genome#ise Genome Res. $,,*&
1/(0)"877-88%! [6MA)K 1%1$#%8+]
14. .yras .& 0accamo M& 0urwen 9& 0lamp M! E&Genes" alternati-e splicing 'rom
E&s in Ensembl Genome Res. $,,* 1/(0)"84+-874! [6MA)K 1%1$#%8%]
15. he -ertebrate genome annotation (=ega) database! Nucleic Acid Res. $,,7 @anL
3dvance 3ccess published on 2ovember 1*& $,,4L doiK1,!1,8#?nar?gm874
22. httpK??vega!sanger!ac!u?info?about?geneDandDtranscriptDtypes!html
$#! Garrow&@!& )enoeud&F!& Franish&3!& Reymond&3!& 0hen&0!!& 0hrast&@!&