+ All Categories
Home > Documents > 2011 09 Human Genebuild

2011 09 Human Genebuild

Date post: 04-Jun-2018
Category:
Upload: gabriel-caicedo-russy
View: 216 times
Download: 0 times
Share this document with a friend

of 15

Transcript
  • 8/13/2019 2011 09 Human Genebuild

    1/15

    Ensembl gene annotation project (e! 62 and e! 63)

    Homo sapiens (human, GRCh37 assembly)

    Raw computes stage: Searching for sequence patterns,

    aligning proteins and cDNAs to the genome.

    ppro!imate time" 3 #ee$

    The annotation process of the high-coverage human assembly began with the

    raw compute stage [Figure 1] whereby the genomic sequence was screened

    for sequence patterns including repeats using RepeatMaser [1!]

    "version #!$!%& with parameters '-nolow -species “homo sapiens”

    -s’(& )ust [$!]   and TRF [#!]! RepeatMaser and )ust combined mased

    *+!+, of the species genome!

    %igure 1" &ummary o' human gene annotation project

    Transcription start sites were predicted using .ponine/scan [*!] and

    First.F [%!]! 0p islands and tR23s [+!] were also predicted! enscan [4!]

    1

      Raw computes

      Add UTR to coding models

    37007 / 41383 human prot.370 / 2332 PE1,2 Uniprot prot.67232 / 131424 annotated human cDNA

    3!"06 human prot.2132 Uniprot prot.

    66434 human cDNA#$or codin% mode&'

    46.6( repeat ma)ed 4"1720 Uniprot a&i%ned

    10420" / 138000 human prot.122!0" / 12!400 human cDNA

     10301/ 128724 PE1,2mamm/*ert Uniprot prot.

    Proteins/cDNAs

    aligned

    Ensembl gene set

    274!84 / 276"10 human cDNA

    72!7"21 / 81743!3 human E+cDNAs and ESTs

    aligned

    -i&ter 

    131381 / 221864 human cDNAued $or U

    1263"4 human cDNAued $or U

    -i&ter 

    Human Vega gene set

    Final gene set

    er%e3!070 human prot.17"0 Uniprot prot.

    64!3" human cDNA #$or codin% mode&'11462! human cDNA #$or U'

  • 8/13/2019 2011 09 Human Genebuild

    2/15

    was run across RepeatMased sequence and the results were used as input

    for 5ni6rot [7!]& 5niene [8!] and 9ertebrate R23 [1,!] alignments by :5-

    ;onerate "cdna2genome model( [1$!] to generate coding models [Figure $]!

     3dditionally& pre-aligned annotated c)23s were re-aligned to unmased

    genomic regions! This approach helped in discovering small e>ons which may

    have been ignored by e>onerate because of their siBe [Figure $]! ;ecause all

    c)23s used in this step had nown pairing with proteins "e!g! Ref=eq c)23s

    with accession prefi> C2MDE matching Ref=eq proteins with C26DE prefi>(& it

    allowed the comparison of coding models generated by e>onerate for a given

    c)23 to those generated by enewise using its counterpart protein! The

     3pollo software [1%!] was used to visualise the results of filtering!

    :here one protein sequence had generated more than one candidate coding

    model at a locus& the ;estTargetted module was used to select the codingmodel that most closely matched the source protein to tae through to the

    2

  • 8/13/2019 2011 09 Human Genebuild

    3/15

    ne>t stage of the gene annotation process! The generation of transcript

    models using species-specific "in this case& human( data is referred to as the

    CTargetted stageE! This stage resulted in 1$,+%7 coding models built from

    *1#7# human proteins and 6722 cDNAs which were taen through to the

    5TR addition stage!

    Similarit" stage: enerating additional coding models using

     proteins from related species

    ppro!imate time" 2 #ee$s

    Following the human Targetted alignments& additional coding models were

    generated as follows! The 5ni6rot alignments from the Raw 0omputes step

    were filtered to retain only those sequences belonging to 5ni6rots

    CMammaliaE and C9ertebrataE ta>onomical classes as well as 5niprots 6rotein

    .>istence "6.( classification level 1 and $! An genomic regions which were

    not covered by any coding models from Targetted alignments& :5-;

  • 8/13/2019 2011 09 Human Genebuild

    4/15

    was rerun for the 5niprot protein sequences and the results were passed to

    enewise [1#!] to build coding models! An most cases& multiple coding

    models built from different 5niprot proteins were generated in a single locus&

    each model with a slightly different e>on-intron structure! To filter for the best

    supported structures& the Transcript0onsensus module was used to compare

    each enewise model against human c)23 and .=T alignments in the

    region "see ne>t section on how these alignments were generated(& where

    e>ons in the enewise model were scored for overlapping with e>ons of

    c)23?.=T alignments& and model"s( with the highest combined score in a

    region were ept! The generation of transcript models using data from related

    species is referred to as the C=imilarity stageE [Figure #]! This stage resulted in

    **%$ and 188% coding models supported by mammalian 5niprot proteins and

    non-mammalian vertebrate 5niprot proteins respectively!

    4

    %igure 3" lignment and 'iltering o' mammalian and -ertebrate proteins

  • 8/13/2019 2011 09 Human Genebuild

    5/15

    cDNA and #ST alignments

    ppro!imate time" 2.3 #ee$s

    Guman c)23 and .=T sequences were previously downloaded from

    .23?enban?));@ and Ref=eq [8!]& clipped to remove poly3 tails& and

    aligned to the genome using .>onerate "est2genome model( [Figure *]!

    $$17+* "of $4+%1,( human c)23s aligned and 4$84%$1 "of 714*#8#( human

    .=Ts had aligned to the genome! The coverage cut-offs and percentageidentity for c)23 alignments were set at 87& which were higher than those

    for .=Ts "8, coverage& 84 percentage identity( because c)23s are

    generally less fragmented than .=Ts! .=T alignments were used to generate

    .=T-based gene models similar to those for mouse [1*!] and these are

    displayed on the website in a separate trac from the .nsembl gene set!

    5

    %igure /" lignment o' human c*+s and E&s to the human genome

  • 8/13/2019 2011 09 Human Genebuild

    6/15

    $iltering coding models

    ppro!imate time" 2 #ee$s

    The set of coding models was finalised after another stage of filtering& which

    involved manual removal of some more Targetted models supported by

    dubious human protein?c)23 evidence on a case-by-case basis& and removal

    of H+, of =imilarity alignments which contained non-canonical "non T?3(

    splice sites using a 6erl script! The 3pollo software [1%!] was used to

    visualise the results of filtering!

     Addition of %TR to coding models

    ppro!imate time" 2 #ee$s

     3fter finalising the set of coding models& those generated by enewise

    alignments were e>tended into the untranslated regions "5TRs( using human

    c)23s! 0oding models generated by e>onerates cdna2genome  & this

    includes the e>onerate$genesDregion approach where pre-aligned c)23

    sequences are aligned to unmased genomic regions& already contained 5TR

    annotations and hence did not go through this 5TR addition step! :here

    available& human )iTag alignments were used to guide the positioning of

    5TRs and add additional weight to some 5TR structures& while Ref=eq C2ME

    c)23 vs C26E protein pairing information was used to ensure the correct

    matching of c)23s to coding models supported by Ref=eq proteins! This

    resulted in *1,14 "of *7$$#( coding models from #4,,4 human proteins with

    5TR& and *,% "of $88*( coding models from #4, 5niprot proteins with 5TR!

    enerating multi&transcript #nsem'l genes

    ppro!imate time" /.0 #ee$s

    The above steps generated a large set of potential transcript models& with or

    without 5TR& many of which overlapped one another! Redundant transcript

    models were collapsed and the remaining unique set of transcript models

    were clustered into multi-transcript genes where each transcript in a gene has

    at least one coding e>on that overlaps a coding e>on from another transcript

    within the same gene! The resulting .nsembl gene set contained $#,7+

    genes& of which $$#+8 contained transcripts supported by human

    6

  • 8/13/2019 2011 09 Human Genebuild

    7/15

    c)23s?proteins only "from the CTargettedE stage of the build(& and 414

    contained transcripts supported by 5niprot proteins only from the C=imilarityE

    stage of the build [Figure %]! The .nsembl genes were associated with a total

    of %$%%8 .nsembl transcripts& of which %17#% were supported by human

    c)23s?proteins& and 4$* had support from 5niprot proteins [Figure +]!

    7

    %igure 0" &upporting e-idence 'or human Ensembl gene set

    Evidence for Ensembl genes

    humancDNAs/proteins

    Uniprotmamm./vert.proteins

    %igure 6" &upporting e-idence 'or human Ensembl transcript set

    Evidence for Ensembl transcripts

    humancDNAs/proteins

    Uniprot mamm./vert.proteins

  • 8/13/2019 2011 09 Human Genebuild

    8/15

    (seudogenes, immunoglo'ulin genes, mitochondrial genes

    ppro!imate time" 3 #ee$s

    The .nsembl gene set was screened for pseudogenes and retrotransposed

    genes! 2e>t& human immunoglobulin "Ag( genes were annotated using the

    .nsembl CAg genebuildE pipeline [1+!]! ;riefly& human proteins and c)23s for

    Ag genes were downloaded from AMT [14!] and aligned to the human

    genome using .>onerate! The .>onerate alignments were processed to Ioin

    the 9?)?@?0 segments together into Ag gene models& which were then

    compared to the Ag genes already present in the .nsembl gene set

    "generated at the Targetted stage(! Af the models generated by the CAg

    genebuildE pipeline overlapped with e>isting .nsembl genes at the e>on level&the e>isting .nsembl genes will be replaced by the new Ag gene models& for

    the latter are usually more accurate representations of Ag genes! 3lso

    imported into the .nsembl gene set were annotation of mitochondrial genes in

    A2)=0 [17!] and short non-coding R23s "e!g! miR23s& snoR23s( generated

    by the ncR23 pipeline [18!]!

    )erging #nsem'l and *ega gene sets, annotating long

    intergenic non&coding RNA genes and generating the

    final gene set.

    ppro!imate time" 1 #ee$s

    Following the completion of the .nsembl gene set& .nsembl annotations and

    manual annotations "primarily generated by the G39323 team at the

    :ellcome Trust =anger Anstitute( from the 9ega database [$,!& $1!] were

    merged at the transcript level to create the final gene set! The 9ega database

    "as of 1$ =eptember $,1,( contained #8%+% genes and 1##*%1 transcripts!

    An the merge process& .nsembl and 9ega transcripts were merged if they had

    identical exon-intron structures. If transcripts from the two annotation sources

    matched at all internal exon-intron boundaries, i.e. had identical splicing

    pattern, but one of them had longer terminal exons, usually the UTRs, they

    were merged too, but the resulting merged transcript would adopt the exon-

    intron structure of the Vega transcript as we prioritised Vega annotation over

    Ensembl. Transcripts which had not been merged& either because ofdifferences in internal e>on-intron boundaries or presence of transcripts in

    8

  • 8/13/2019 2011 09 Human Genebuild

    9/15

    only one annotation source& were transferred from the source to the final gene

    set intact!

    The .nsembl-9ega merge code also too into account the biotype and

    supporting evidence associated with the transcripts from both annotation

    sources! For a pair of transcripts to be merged& if there was a mismatch in

    biotype& e!g! the .nsembl transcript is protein-coding but the 9ega counterpart

    is non-coding& the 9ega biotype would have precedence over the .nsembl

    model and the .nsembl transcript would undergo a biotype change to match

    its 9ega counterpart! The translation for the .nsembl transcript would then be

    removed if the transcript has lost its protein-coding biotype! ;iotype conflicts

    between .nsembl and 9ega were always reported to the G39323 team for

    investigation& and when resolved& could improve the merged gene set in the

    future! 3s for supporting evidence& the merge of .nsembl and 9ega

    transcripts also involved merging of protein?c)23 supporting evidence

    associated with the transcripts to ensure the basis on which the annotations

    were made would not be lost!

    Following the merge& long intergenic non-coding R23 genes "lincR23s( were

    annotated by the .nsembl lincR23 pipeline [18!] and incorporated in the final

    gene set!

     3n important feature of the merged gene set is the presence of all 9ega

    source transcripts! This has been made possible by allowing 9ega annotation

    to tae precedence over .nsembls when merging transcripts which do not

    match at their terminal e>ons or have different biotypes! Jf all 9ega

    transcripts& 17!# of them were merged with .nsembl transcripts! The vast

    maIority of merged transcripts "78!+( are of protein-coding biotype! 9ega

    transcripts which were not merged "7$!4 of 9ega source transcripts( were

    mostly alternative splice variants& pseudogenes or non-coding! These

    transcripts were fully transferred into the final gene set! The final

    .nsembl-9ega set consisted of **#1* genes and 1+,,,$ transcripts! Jf the

    1+,,,$ transcripts& 1%!# "$**8$( were the result of merging .nsembl and

    9ega annotations& 1+!1 "$%417( originated from .nsembl& +7!% "1,8+%,(

    9

  • 8/13/2019 2011 09 Human Genebuild

    10/15

    originated from 9ega& and the remaining H,!* were incorporated from other

    sources "e!g! immunoglobulin gene segments?transcripts imported from AMT

    data(!

     3s a quality-control measure& .nsembl translations of protein-coding

    transcripts in the final merged gene set were aligned against the 20;A

    Ref=eq and 5niprot?=wiss6rot sets of public curated protein sequences

    "which were used in the CTargettedE stage of the gene build( to calculate the

    proportion of curated sequences covered by the merged gene set! Jver 88

    of Ref=eq and =wiss6rot proteins were represented in the merged gene set&

    and in the maIority of cases& there was a 1,, match between the curated

    protein and .nsembl translation!

    =ince .nsembl release %+ "=eptember $,,8(& the .nsembl-9ega gene set

    has e>actly corresponded to a .20J). release [$#!]! The gene set in

    release +$& which this document describes& corresponds to .20J).

    release 4! .ach .20J). release also contains the full annotation of the

    consensus coding sequence "00)=( transcript models [$*!]! 3ll 00)=

    models are included in each release of the human gene set!

    (rotein annotation, cross&referencing, sta'le +dentifiers

    ppro!imate time" / #ee$s

    ;efore public release the transcripts and translations were given e>ternal

    references "cross-references to e>ternal databases(& while translations were

    searched for domains?signatures of interest and labelled where appropriate!

    =table identifiers were assigned to each gene& transcript& e>on and

    translation! :hen annotating a species for the first time& these identifiers are

    auto-generated! An all subsequent annotations for a species& the stable

    identifiers are propagated based on comparison of the new gene set to the

    previous gene set!

    10

  • 8/13/2019 2011 09 Human Genebuild

    11/15

    dditional annotation and post genebuild 'iltering inEnsembl release 63

     Addition of annotation on haplot"pe regions

    ppro!imate time" 1.2 #ee$s

    The annotation of the haplotype regions on chromosomes +& 1* and 14 were

    added after the main reference genome had been annotated! Figure 4 shows

    the annotation pipeline which closely follows the procedure described earlier!

    The annotation resulted in a final gene set of $7#1 genes of which $*, were

    pseudogenes or retrotransposed gene!

    (ost gene'uild filtering 

    ppro!imate time" 3./ #ee$s

    To eliminate and filter out poorly supported models that may have erroneously

    been included in the full annotation& the human gene set undergoes an

    additional filtering process after each annotation! This is to tae advantage of

    the comparative genomics information that becomes available only after the

    first annotation has been released!

    11

    %igure 7" or$'lo# 'or the annotation o' haplotype regions inchromosomes 6, 1/ and 17

  • 8/13/2019 2011 09 Human Genebuild

    12/15

     3ll models annotated by .nsembl were filtered systematically by a series of

    6erl scripts to remove models with erroneous structures! .>amples of such

    scenarios would be where a model differed considerably in its internal

    structure compared to other models in the same locus& or if e>ons were

    missing or had non-consistent splice sites! An addition& models supported by

    c)23 fragments with wrongly annotated short open-reading frames were

    removed manually on a case-by-case basis! Further filtering of the models

    was done using the following criteria at gene levelK

  • 8/13/2019 2011 09 Human Genebuild

    13/15

    The quality of a gene set is dependent on the quality of the genome assembly!

    enome assembly can be assessed in a number of ways& includingK

    1! 0overage estimate

    o  3 higher coverage usually indicates a more complete assembly!

    o 5sing =anger sequencing only& a coverage of at least $> is

    preferred!

    $! 2%, of contigs and scaffolds

    o  3 longer 2%, usually indicates a more complete genome

    assembly!

    o ;earing in mind that an average human gene may be 1,-1% b

    in length& contigs shorter than this length will be unliely to hold

    full-length gene models!

    #! 2umber of contigs and scaffolds

    o  3 lower number toplevel sequences usually indicates a more

    complete genome assembly!

    *! 3lignment of c)23s and .=Ts to the genome

    o  3 higher number of alignments& using stringent thresholds&

    usually indicates a more complete genome assembly!

    More information on the .nsembl automatic gene annotation process can be

    found atK

    • 0urwen 9& .yras .& 3ndrews T)& 0lare tNrootOensemblPviewOco

    13

    http://www.ensembl.org/info/docs/genebuild/genome_annotation.htmlhttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_genebuild_process.txt?root=ensembl&view=cohttp://www.ensembl.org/info/docs/genebuild/genome_annotation.html

  • 8/13/2019 2011 09 Human Genebuild

    14/15

    References

    1. =mit& 3F3& Gubley& R P reen& 6K Repeat4as$er 5pen.3 188+-$,1,!

    www!repeatmaser!org

    2. uBio @& Tatusov R& and

  • 8/13/2019 2011 09 Human Genebuild

    15/15

    13. ;irney .& 0lamp M& )urbin RK Geneise and Genome#ise Genome Res. $,,*&

    1/(0)"877-88%! [6MA)K 1%1$#%8+]

    14. .yras .& 0accamo M& 0urwen 9& 0lamp M! E&Genes" alternati-e splicing 'rom

    E&s in Ensembl Genome Res. $,,* 1/(0)"84+-874! [6MA)K 1%1$#%8%]

    15. he -ertebrate genome annotation (=ega) database! Nucleic Acid Res. $,,7 @anL

     3dvance 3ccess published on 2ovember 1*& $,,4L doiK1,!1,8#?nar?gm874

    22. httpK??vega!sanger!ac!u?info?about?geneDandDtranscriptDtypes!html 

    $#! Garrow&@!& )enoeud&F!& Franish&3!& Reymond&3!& 0hen&0!!& 0hrast&@!&


Recommended