Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 ·...

Gene annotation

1

Genome annotation VS. Genome sequencing

• Genome sequencing alone does not provide information about the functions encoded by the nucleotidic sequences.

• Genome annotation: sequence is “decorated” by evidences indicating genome regional characteristics (features) providing the basis for further analysis to understand the nature of organisms

>chr1 ACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAACCGTAAAGTCCAAAACGCTAACCCCTTAACCCTAAACCCTAAACCCTGAACCCTAAATCCCT ….

2

A “standard” annotation pipeline

• Detection and masking of repeated sequences

• Ab initio gene prediction

• Evidence alignment: – ESTs

– Full length cDNAs

– RNA-Seq sequences

– Proteins

• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments

3







– Proteins


4

Types of repeats

1. Interspersed repeats

2. Processed pseudogenes

3. Simple sequence repeats

4. Segmental duplications

5. Blocks of tandem repeats

5

Interspersed repeats (transposon-derived repeats)

Constitute ~45% of the human genome. They

involve RNA intermediates (retroelements) or

DNA intermediates (DNA transposons).

• Long-terminal repeat transposons (RNA-

mediated)

• Long interspersed elements (LINEs); these

encode a reverse transcriptase

• Short interspersed elements (SINEs)(RNA-

mediated); these include Alu repeats DNA

transposons (3% of human genome)

6

Interspersed repeats (transposon-derived repeats)

Heredity (2010) 104, 520–533; doi:10.1038/hdy.2009.165 7

Processed pseudogenes

• These genes have a stop codon or

frameshift mutation and do not encode a

functional protein. They commonly arise

from retrotransposition, or following gene

duplication and subsequent gene loss.

8

Simple sequence repeats

• Microsatellites: from one to a dozen base

pairs (Examples: (A)n, (CA)n, (CGG)n)

• Minisatellites: a dozen to 500 base pairs

9

Segmental duplications

• These are blocks of about 1 kilobase to

300 kb that are copied intra- or

interchromosomally (about 5% of the human genome).

10

Blocks of Tandem Repeats

• These include telomeric repeats (e.g.

TTAGGG in humans) and centromeric

repeats (e.g. a 171 base pair repeat of a

satellite DNA in humans).

11

Biological meaning of repeats

• Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in:

– Genome rearrangements (Kazazian, 2004)

– Drift to new biological function (Kidwell and Lisch, 2001)

– Increased rate of evolution under stress (Capy et al, 2000)

12

Detection and masking of repeated sequences

• Repeats need to be masked prior to performing most single-species or multi-species analyses.

• Masking is necessary to avoid artifacts during gene annotation: – Repeated sequences may generate spurius alignments of

ESTs, cDNA, etc…

– Transposon genes ORFs are detected by the ab initio gene predictions and annotated as protein coding genes

• Software: – RepeatMasker (must be trained for each genome)

• Repeats are substituted by Ns

13

RepeatMasker

• Software

– Smit AFA, Hubley R, and Green P. “RepeatMasker-Open 3.0.” 1996-2004. (http://www.repeatmasker.org)

– CrossMatch / WU-BLAST (alignments)

• Repeats Database

– RepBase library (http://www.girinst.org)

14

http://www.repeatmasker.org/

http://www.girinst.org/

Repeats library

• Uses a library of known eukariotic repeat seqs • Supplied by RepBase project • Repeats in RepBase are manually curated. • Requires registration (free for academics)

15

Type of repeats in RepBase library

• Interspersed (Alu, LINE, MIR, …)

• Simple (agagagag, atcatcatc, …)

• Micro- and mini-satellites

• Noncoding RNAs (tRNA, rRNA, snRNA, …)

• Common contaminants (E. coli, vectors)

16

RepBase reports

EMBL format An example report 17

Consensus sequence

• A repeat family is represented by a consensus sequence. Example: more than 1 million Alu repeats in human genome.

accgataggtatacgtatca-tttacgatac

atcgct-ggtttacgcgtcaattcaggatgc

accggt-tgtttacgtagcaatctaggatac

accgat-ggtttacgtatcaatttaggatac

Consensus sequences can be efficiently aligned to a reference genome

Alu Alu Alu Alu Alu

18

RepeatMasker

• Sequences in FASTA input file are split in short overlapping fragments (overlap = 2000bp)

• Repeat consensus sequences are aligned to the splitted input sequences – Can use different BLAST-like software to align repeats to the

reference genome: • Cross-Match • WU-BLAST (faster)

– Output is converted to a standard format (cross-match format)

• Removes duplicates and assemble fragmented hits • Removes insignificant hits (based on Smith-Waterman

threshold)

19

De novo repeats annotation

• Repeats library is available only for a number of species

• Highly diverged repeats can be tough to find

• In case a novel species is being annotated for which no repeats library is available repeats must be annotated de novo.

20

De novo identification of repeats • An all vs. all pairwise

comparison is performed

• Unknown repeats will yield alignments

• Overlapping aligned fragments from the same element are grouped

• Elements can thus be defined

• Defined elements are then clustered into one family because the are all similar to each other

Bao Z, Eddy SR. Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. 2002; 1269–1276. doi:10.1101/gr.88502.

21

De novo repeats annotation RepeatScout http://repeatscout.bioprojects.org/ Price et al.(2005)

RAP http://genomics.cribi.unipd.it/index.php/Ra

p_Repeat_Filter

Campagna et al. (2005)

REPuter http://www.genomes.de/ Kurtz and Schleiermacher

(1999)

Repeat-match http://mummer.sourceforge.net/ Delcher et al.(1999)

RepSeek http://wwwabi.snv.jussieu.fr/public/RepSee

k/

Achaz et al.(2007)

Tallymer http://www.zbh.uni-hamburg.de/Tallymer/ Kurtz et al.(2008)

Vmatch http://www.vmatch.de/ Kurtz (unpublished)

mer-engine http://roma.cshl.org/mer-home.php Healy et al.(2003)

FORRepeats http://al.jalix.org/FORRepeats/ Lefebvre et al.(2003)

P-Clouds http://www.evolutionarygenomics.com/PCl

ouds.html

Gu et al.(2008)

Spectral

repeat finder

http://www.imtech.res.in/raghava/srf/ Sharma et al.(2004)

RepeatFinder http://cbcb.umd.edu/software/RepeatFinde

r/

Volfovsky et al. (2001)

REPEATGLUER http://nbcr.sdsc.edu/euler/intro_tmp.htm Pevzner et al.(2004)

DAWG-PAWS http://dawgpaws.sourceforge.net/ Estill and Bennetzen

(2009)

RepeatModeler http://www.repeatmasker.org/RepeatMode

ler.html

Smit (unpublished)

RepeatRunner http://www.yandell-

lab.org/software/repeatrunner.html

Smith et al.(2007)

REannotate http://www.bioinformatics.org/reannotate/i

ndex.html

Pereira (2008)

22

http://repeatscout.bioprojects.org/

http://genomics.cribi.unipd.it/index.php/Rap_Repeat_Filter

http://genomics.cribi.unipd.it/index.php/Rap_Repeat_Filter

http://www.genomes.de/

http://mummer.sourceforge.net/

http://wwwabi.snv.jussieu.fr/public/RepSeek/

http://wwwabi.snv.jussieu.fr/public/RepSeek/

http://www.zbh.uni-hamburg.de/Tallymer/



http://www.vmatch.de/

http://roma.cshl.org/mer-home.php



http://al.jalix.org/FORRepeats/

http://www.evolutionarygenomics.com/PClouds.html

http://www.evolutionarygenomics.com/PClouds.html

http://www.imtech.res.in/raghava/srf/

http://cbcb.umd.edu/software/RepeatFinder/

http://cbcb.umd.edu/software/RepeatFinder/

http://nbcr.sdsc.edu/euler/intro_tmp.htm

http://dawgpaws.sourceforge.net/

http://www.repeatmasker.org/RepeatModeler.html

http://www.repeatmasker.org/RepeatModeler.html

http://www.yandell-lab.org/software/repeatrunner.html



http://www.bioinformatics.org/reannotate/index.html

http://www.bioinformatics.org/reannotate/index.html

Example: mask repeats with RepeatMasker

•Go to RepeatMasker web server (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker/) and paste your sequence in the Sequence box. •In DNA source select Other from the dropdown menu and specify arabidopsis (for example) •Then press Submit sequence button.

23

http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker/







RepeatMasker Output Summary ==================================================

file name: RM2_arabidopsis.txt_1433174330

sequences: 1

total length: 20251 bp (20251 bp excl N/X-runs)

GC level: 42.31 %

bases masked: 17639 bp ( 87.10 %)

==================================================

number of length percentage

elements* occupied of sequence

--------------------------------------------------

Retroelements 2 10010 bp 49.43 %

SINEs: 0 0 bp 0.00 %

Penelope 0 0 bp 0.00 %

LINEs: 0 0 bp 0.00 %

CRE/SLACS 0 0 bp 0.00 %

L2/CR1/Rex 0 0 bp 0.00 %

R1/LOA/Jockey 0 0 bp 0.00 %

R2/R4/NeSL 0 0 bp 0.00 %

RTE/Bov-B 0 0 bp 0.00 %

L1/CIN4 0 0 bp 0.00 %

LTR elements: 2 10010 bp 49.43 %

BEL/Pao 0 0 bp 0.00 %

Ty1/Copia 0 0 bp 0.00 %

Gypsy/DIRS1 2 10010 bp 49.43 %

Retroviral 0 0 bp 0.00 %

DNA transposons 6 6788 bp 33.52 %

hobo-Activator 0 0 bp 0.00 %

Tc1-IS630-Pogo 0 0 bp 0.00 %

En-Spm 0 0 bp 0.00 %

MuDR-IS905 0 0 bp 0.00 %

PiggyBac 0 0 bp 0.00 %

Tourist/Harbinger 0 0 bp 0.00 %

Other (Mirage, 0 0 bp 0.00 %

P-element, Transib)

Rolling-circles 0 0 bp 0.00 %

Unclassified: 4 841 bp 4.15 %

Total interspersed repeats: 17639 bp 87.10 %

Small RNA: 0 0 bp 0.00 %

Satellites: 0 0 bp 0.00 %

Simple repeats: 0 0 bp 0.00 %

Low complexity: 0 0 bp 0.00 %

==================================================

* most repeats fragmented by insertions or deletions

have been counted as one element

Provides a summary of repeats identified in the input sequence

24

RepeatMasker output

25

RepeatMasker output

HSPs can be expanded

26

RepeatMasker output

Genomic sequence Repeat sequence

27

RepeatMasker output

The sequence is ready for the gene prediction and evidence alignment

Repeats are “hard masked” by substitution of the nucleotidic sequence with Ns

28







– Proteins


29

Types of genes

• Protein-coding genes

• Pseudogenes

• Functional RNA genes:

– tRNA

– rRNA

– snoRNA

– snRNA

– miRNA

Software for identification of non coding RNAs: tRNA-Scan, snoscan Infernal (Infererence of RNA alignments) based on probabilistic models of the sequence and secondary structure of an RNA sequence

30

Gene annotation

• Genes structure is annotated by defining exons, CDSs, UTRs and intron boundaries.

• Both intrinsic properties of the sequence and exstrinsic data are used to define the gene structure.

genome enhancer promoter

UTRs

CDS

Introns

31

Ab initio gene prediction

Uses intrinsic properties of the genome sequence Syntax rules used to predict genes: • Generally the coding segment (CDS) of the gene begins with a start codon

(ATG) and ends with a stop codon (TGA,TAA,TAG). The stop codon must be in frame with the start codon.

• Introns begin with a donor splice site (GT) and ends with an acceptor splice site (AG).

• Usually 5’- and 3’-UTR (untranslated regions) are present but are generally not predicted.

Figure modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York.

32

Statistical measures and training

• Many syntactically valid putative genes do not correspond to real genes.

• Statistical measures are used to compare the putative genes with the statistical profile of a real gene in the organism of interest.

• Statistical measures are learned from a training set of known genes different from genome to genome

• Training can be performed also with ESTs, proteins and RNA-Seq alignments

33

Signal sensors and content sensors

• Putative boundaries (start/stop codons, splice sites) are predicted by signal sensors

• CDSs are scored by a content sensor

• The two sensors together allow to define exons boundaries

Exon Exon

Likelihood ratio scores. Modified from: Methods for Computational gene Prediction. 2007

W.H. Majoros

Position weight matrix (PWM) of a donor site. Modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York..

34

Eukariotic ab initio gene finders

• GlimmerM http://www.cbcb.umd.edu/software/GlimmerM/

• GeneID http://genome.crg.es/software/geneid/

• GeneZilla http://www.genezilla.org/

• GeneMark-ES http://exon.gatech.edu/

• Augustus http://bioinf.uni-greifswald.de/augustus/

35

http://www.cbcb.umd.edu/software/GlimmerHMM/




http://genome.crg.es/software/geneid/

http://www.genezilla.org/

http://exon.gatech.edu/

http://bioinf.uni-greifswald.de/augustus/



Example: ab initio gene prediction

• Go to http://bioinf.uni-greifswald.de/augustus/submission • Paste your MASKED sequence in the submission form and

select an organism similar to the organism of interest (e.g.: Arabidopsis thaliana)

• Run AUGUSTUS

36

http://bioinf.uni-greifswald.de/augustus/submission




• Click on graphical and text results

37


• Then on graphical browsable results

38


• Results displayed on a gbrowse interface

39







– Proteins


40

Evidence alignment

• Alignment of extrinsic data provide experimental evidence for the gene predictions – ESTs, cDNAs

• Both from the same or a related species • Mapped with Exonerate, Gmap, BLAT

– Proteins • Highly curated protein dataset (SwissProt) doesn’t need to

be from the same species • Mapped with TBLASTX, BLAT

– RNA-Seq data • TopHat mapping processed with Cufflinks or scripture • Mappings (Gmap or BLAT) of de novo assembled transcripts

(Oases, Trinity, TransAbyss)

41

Evidence alignment

Gene predictions

Protein alignments

EST alignments

RNA-Seq reads alignments

42

Gene annotation using RNA-Seq data Representative samples mRNA isolation

Sequencing

Library construction

Reconstruction of transcript isoforms

Fragmentation

43

Generation of paired-end reads

ACCTGACTGG

A2 A1 SP1

CACGTCTCTGG SP2

Paired-end reads

Sequence_1.fastq Sequence_2.fastq

Data are generated as paired-end reads: 100bp are sequenced from sequenced fragment (about 250bp) ends

44

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

In un esperimento RNA-Seq le read vengono generate dal sequenziamento delle estremità di frammenti da 200-300 bp dell’RNA messaggero da cui le sequenze introniche sono state rimosse dal macchinario di splicing durante la maturazione dell’mRNA. Alcuni frammenti saranno a cavallo delle giunzioni esone-esone 45


genoma

esoni

introni

mRNA

Read derivanti da frammenti contenuti completamente in singoli esoni mapperanno correttamente con una distanza tra le read compatibile con le dimensioni della libreria

46


genoma

esoni

introni

mRNA

Coppie di read mappanti su 2 esoni diversi avranno una dimensione dell’inserto non compatibile con le dimensioni della libreria

Dimensioni libreria

47


genoma

esoni

introni

mRNA

Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.

48


genoma

esoni

introni

mRNA

Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.

49


genoma

esoni

introni

mRNA

Idealmente la read dovrebbe essere spezzata in uno spliced alignment che tenga conto dell’introne

50

Utilizzo di un database di giunzioni di splicing

[…]

Database custom di giunzioni note.

Un database di giunzioni custom viene costruito unendo le estremità degli esoni. Read spliced vengono rilevate allineando le read non mappanti sul database di giunzioni.

Una limitazione di questo aproccio è che può rilevare solo giunzioni note.

Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., … Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470–6. doi:10.1038/nature07509

51

TopHat 1.0

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

Reference genome

Unmappable read

52 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120. 52

TopHat 1.0

Reference genome

Unmappable read

25nt

53 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.



• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.

53

TopHat 1.0

Reference genome

Unmappable read

25nt




• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.

• Read con segmenti che possono essere mappati solo in maniera non contigua

Marcati come possibili read intron-spanning

54

TopHat 1.0

Reference genome

Unmappable read

25nt

L1 L2




• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.



• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:

L1+L2=k; 1 < L1 < k-1; L2 = k-L1

55

TopHat 1.0

Reference genome

Unmappable read

25nt

donor site acceptor

site








L1+L2=k; 1 < L1 < k-1; L2 = k-L1

• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 56

TopHat 1.0

Unmappable reads


Allineamento delle read non allineabili al database di giunzioni







L1+L2=k; 1 < L1 < k-1; L2 = k-L1

• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 57

Il workflow Cufflinks - 1

58

Read paired-end vengono mappate sul genoma con un allineatore in grado di eseguire allineamenti di tipo spliced (es.: TopHat).


L’output SAM di TopHat viene utilizzato come input di Cufflinks e le read vanno incontro ad un processo di assemblaggio e viene costruito un grafo di overlap.

59

Il workflow Cufflinks- 2

I trascritti vengono dedotti dal grafo cercando il percorso minimo (massima parsimonia) che spieghi gli overlap osservati.

60


Genome

● Le read vengono assegnate ai trascritti basandosi sulla compatibilità dell’allineamento con il modello.

● Poiché i geni hanno hanno isoforme multiple, alcune delle quali condividono esoni, le read non possono essere assegnate in maniera univoca ad una isoforma.

● Cufflinks tratta l’incertezza costruendo una funzione di verosimiglianza che modella il processo di sequenziamento ed identifica le stime di abbondanza di isoforme che meglio spiegano le read ottenute.

● La stima, definita come abbondanza delle isoforme che massimizza la funzione di verosimiglianza (maximum likelihood estimate; MLE).

Distribuzione dei frammenti

61

Annotations generate by Cufflinks

Chromosome

Region

Gene

Transcripts annotated by Cufflinks

Coverage

Read alignments

62

Comparative gene finders

• Use additional information if the form of cross-species conservation at DNA or amino acid level.

63

Annotation consensus

• All sources of structural evidence show errors • Manual curation of all the evidences is not feasible • Integrating evidences into a consensus gives more accuracy

Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.

64

Annotation Edit Distance (AED)

SN = TP/(TP+FN) sensitivity SP = TP/(TP+FP) specificity AC = (SN+SP)/2 accuracy AED = 1-AC AED = 0 the annotation is in perfect agreement with its evidence AED = 1 indicates a complete lack of evidence support for the annotation

100bp 50bp 50bp

75bp 50bp

Numbers at nucleotide level

Numbers at exon level


65



66


67

File formats

• Annotations can be downloaded and redistributed in a variety of formats. Among the most common:

– GFF (General Feature Format)

– GTF (Gene Transfer Format)

– BED

– Wiggle

– VCF (Variant Call Format)

Allow to display coordinates of genes, transcripts, alignments, repeats, etc…

Describe quantitative data

Specific for variants description (developed by 1000 genomes)

68

GFF(3) file format Widely used file format to describe position of features on the genome. Used also by GMOD project and in particular by gbrowse.

chrom source Feature

type start end score strand frame Attributes

1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding

1 Ensembl transcript 3028681 3030154 . - . ID=Vv01s0011g03340.t01;Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340

1 Ensembl exon 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01.exon1;Parent=Vv01s0011g03340.t01




1 Ensembl CDS 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01




69

GFF format explained

Field Description chrom The name of the sequence. Must be a chromosome or scaffold. source The program that generated this feature.

feature The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".

start The starting position of the feature in the sequence. The first base is numbered 1.

end The ending position of the feature (inclusive).

score

A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".

strand Valid entries include '+', '

phase For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.

attributes A list of feature attributes in the format tag=value.

chrom source Feature

type start end score strand frame Attributes

1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding

70

GFF attributes

ID Unique ID Target Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment

Name Name displayed to the user

Gap The alignment of the feature to the target if the two are not collinear

Alias Alternative Name Derives_from

Used to disambiguate the relationship between one feature and another

Parent Indicate the parent of the feature

Dbxref A database cross reference.

Ontology_term

Cross Reference to Ontology term

Note Free form text

A gene is parent to its mRNAs which are parents to their exons, etc… 71

Integrative Genomics Viewer (IGV) stand-alone genome browser

http://www.broadinstitute.org/igv/home

Click here to register and download the application

72

4 AUGUSTUS gene 2916 6002 0.09 + . g1

4 AUGUSTUS transcript 2916 6002 0.09 + . g1.t1

4 AUGUSTUS tss 2916 2916 . + . transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS exon 2916 3164 . + . transcript_id


4 AUGUSTUS start_codon 3073 3075 . + 0 transcript_id


4 AUGUSTUS initial 3073 3164 0.99 + 0 transcript_id


4 AUGUSTUS internal 3243 3308 0.97 + 1 transcript_id






4 AUGUSTUS internal 4332 4466 1 + 1 transcript_id


4 AUGUSTUS internal 4563 4626 1 + 1 transcript_id


4 AUGUSTUS internal 4725 4810 0.93 + 0 transcript_id "g1.t1"; gene_id

"g1";

5 4 AUGUSTUS internal 4913 4995 0.94 + 1 transcript_id






8 4 AUGUSTUS terminal 5616 5914 0.42 + 2 transcript_id


9 4 AUGUSTUS intron 3165 3242 0.97 + . transcript_id








13 4 AUGUSTUS intron 4467 4562 1 + . transcript_id

73

IGV Navigation Bar

Names column Memory usage

Gene Navigation: Draw a box to enlarge the region

Track visualization panels

74

IGV With File dropdown menu it's possible to load genome and tracks data or save the results.

1st step: The Genome There are 3 kind of genomes you can use: 1. Preloaded ones. 2. Load a pre-built IGV genome 3. Create you own. To do this click on Import Genome and

insert in the pop-up window the files of your genome (fasta for the sequence; gff for the genes locations)

The sequence in FASTA format

If you have it, the annotation of the genome in GTF or GFF format 75

IGV

The Genome sequence file has been loaded

A new track with genome annotation has been added

Right click on the track opens a menu that let change track visualization properties

Hovering on the track displays a pop-up window with feature informations.

76

IGV

2nd step: Load a track From the file drop-down menu choose load file,

the pop-up window let you browse your file and choose the one you want to load.

Tracks may be of different natures and of many

filetypes. Here some examples of the most used:

Alignments: SAM BAM (must be indexed) Blat PSL files Annotations: GTF GFF BED Variants: VCF 77

IGV

• Load the following annotations: – RPTchr10.gff repeats masked by RepeatMasker – V0chr10.gff annotation V0 version – V1repeatchr10.gff annotation of ORFs in repeated sequences

78

IGV Example: RNA-Seq data

Load the bam file chr10.bam containing the alignments and the Cufflinks_chr10.gtf containing the annotations generated by Cufflinks

Assembly 12x V1 annotation

Transcripts identified by cufflinks

Reads Coverage (From Alignment)

Reads Alignment

79

IGV

Unannotated gene identified by Cufflinks

Known genes reconstructed by Cufflinks

80

IGV

Right click on the track opens a menu that let change track visualization properties

Hovering on the track displays a pop-up window with feature informations.

Zooming to single nucleotide level view, we can notice the presence of SNPs

81

Date post:	23-Feb-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 ·...

Documents