Gene annotation
1
Genome annotation VS. Genome sequencing
• Genome sequencing alone does not provide information about the functions encoded by the nucleotidic sequences.
• Genome annotation: sequence is “decorated” by evidences indicating genome regional characteristics (features) providing the basis for further analysis to understand the nature of organisms
>chr1 ACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAACCGTAAAGTCCAAAACGCTAACCCCTTAACCCTAAACCCTAAACCCTGAACCCTAAATCCCT ….
2
A “standard” annotation pipeline
• Detection and masking of repeated sequences
• Ab initio gene prediction
• Evidence alignment: – ESTs
– Full length cDNAs
– RNA-Seq sequences
– Proteins
• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments
3
A “standard” annotation pipeline
• Detection and masking of repeated sequences
• Ab initio gene prediction
• Evidence alignment: – ESTs
– Full length cDNAs
– RNA-Seq sequences
– Proteins
• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments
4
Types of repeats
1. Interspersed repeats
2. Processed pseudogenes
3. Simple sequence repeats
4. Segmental duplications
5. Blocks of tandem repeats
5
Interspersed repeats (transposon-derived repeats)
Constitute ~45% of the human genome. They
involve RNA intermediates (retroelements) or
DNA intermediates (DNA transposons).
• Long-terminal repeat transposons (RNA-
mediated)
• Long interspersed elements (LINEs); these
encode a reverse transcriptase
• Short interspersed elements (SINEs)(RNA-
mediated); these include Alu repeats DNA
transposons (3% of human genome)
6
Interspersed repeats (transposon-derived repeats)
Heredity (2010) 104, 520–533; doi:10.1038/hdy.2009.165 7
Processed pseudogenes
• These genes have a stop codon or
frameshift mutation and do not encode a
functional protein. They commonly arise
from retrotransposition, or following gene
duplication and subsequent gene loss.
8
Simple sequence repeats
• Microsatellites: from one to a dozen base
pairs (Examples: (A)n, (CA)n, (CGG)n)
• Minisatellites: a dozen to 500 base pairs
9
Segmental duplications
• These are blocks of about 1 kilobase to
300 kb that are copied intra- or
interchromosomally (about 5% of the human genome).
10
Blocks of Tandem Repeats
• These include telomeric repeats (e.g.
TTAGGG in humans) and centromeric
repeats (e.g. a 171 base pair repeat of a
satellite DNA in humans).
11
Biological meaning of repeats
• Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in:
– Genome rearrangements (Kazazian, 2004)
– Drift to new biological function (Kidwell and Lisch, 2001)
– Increased rate of evolution under stress (Capy et al, 2000)
12
Detection and masking of repeated sequences
• Repeats need to be masked prior to performing most single-species or multi-species analyses.
• Masking is necessary to avoid artifacts during gene annotation: – Repeated sequences may generate spurius alignments of
ESTs, cDNA, etc…
– Transposon genes ORFs are detected by the ab initio gene predictions and annotated as protein coding genes
• Software: – RepeatMasker (must be trained for each genome)
• Repeats are substituted by Ns
13
RepeatMasker
• Software
– Smit AFA, Hubley R, and Green P. “RepeatMasker-Open 3.0.” 1996-2004. (http://www.repeatmasker.org)
– CrossMatch / WU-BLAST (alignments)
• Repeats Database
– RepBase library (http://www.girinst.org)
14
Repeats library
• Uses a library of known eukariotic repeat seqs • Supplied by RepBase project • Repeats in RepBase are manually curated. • Requires registration (free for academics)
15
Type of repeats in RepBase library
• Interspersed (Alu, LINE, MIR, …)
• Simple (agagagag, atcatcatc, …)
• Micro- and mini-satellites
• Noncoding RNAs (tRNA, rRNA, snRNA, …)
• Common contaminants (E. coli, vectors)
16
RepBase reports
EMBL format An example report 17
Consensus sequence
• A repeat family is represented by a consensus sequence. Example: more than 1 million Alu repeats in human genome.
accgataggtatacgtatca-tttacgatac
atcgct-ggtttacgcgtcaattcaggatgc
accggt-tgtttacgtagcaatctaggatac
accgat-ggtttacgtatcaatttaggatac
Consensus sequences can be efficiently aligned to a reference genome
Alu Alu Alu Alu Alu
18
RepeatMasker
• Sequences in FASTA input file are split in short overlapping fragments (overlap = 2000bp)
• Repeat consensus sequences are aligned to the splitted input sequences – Can use different BLAST-like software to align repeats to the
reference genome: • Cross-Match • WU-BLAST (faster)
– Output is converted to a standard format (cross-match format)
• Removes duplicates and assemble fragmented hits • Removes insignificant hits (based on Smith-Waterman
threshold)
19
De novo repeats annotation
• Repeats library is available only for a number of species
• Highly diverged repeats can be tough to find
• In case a novel species is being annotated for which no repeats library is available repeats must be annotated de novo.
20
De novo identification of repeats • An all vs. all pairwise
comparison is performed
• Unknown repeats will yield alignments
• Overlapping aligned fragments from the same element are grouped
• Elements can thus be defined
• Defined elements are then clustered into one family because the are all similar to each other
Bao Z, Eddy SR. Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. 2002; 1269–1276. doi:10.1101/gr.88502.
21
De novo repeats annotation RepeatScout http://repeatscout.bioprojects.org/ Price et al.(2005)
RAP http://genomics.cribi.unipd.it/index.php/Ra
p_Repeat_Filter
Campagna et al. (2005)
REPuter http://www.genomes.de/ Kurtz and Schleiermacher
(1999)
Repeat-match http://mummer.sourceforge.net/ Delcher et al.(1999)
RepSeek http://wwwabi.snv.jussieu.fr/public/RepSee
k/
Achaz et al.(2007)
Tallymer http://www.zbh.uni-hamburg.de/Tallymer/ Kurtz et al.(2008)
Vmatch http://www.vmatch.de/ Kurtz (unpublished)
mer-engine http://roma.cshl.org/mer-home.php Healy et al.(2003)
FORRepeats http://al.jalix.org/FORRepeats/ Lefebvre et al.(2003)
P-Clouds http://www.evolutionarygenomics.com/PCl
ouds.html
Gu et al.(2008)
Spectral
repeat finder
http://www.imtech.res.in/raghava/srf/ Sharma et al.(2004)
RepeatFinder http://cbcb.umd.edu/software/RepeatFinde
r/
Volfovsky et al. (2001)
REPEATGLUER http://nbcr.sdsc.edu/euler/intro_tmp.htm Pevzner et al.(2004)
DAWG-PAWS http://dawgpaws.sourceforge.net/ Estill and Bennetzen
(2009)
RepeatModeler http://www.repeatmasker.org/RepeatMode
ler.html
Smit (unpublished)
RepeatRunner http://www.yandell-
lab.org/software/repeatrunner.html
Smith et al.(2007)
REannotate http://www.bioinformatics.org/reannotate/i
ndex.html
Pereira (2008)
22
Example: mask repeats with RepeatMasker
•Go to RepeatMasker web server (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker/) and paste your sequence in the Sequence box. •In DNA source select Other from the dropdown menu and specify arabidopsis (for example) •Then press Submit sequence button.
23
RepeatMasker Output Summary ==================================================
file name: RM2_arabidopsis.txt_1433174330
sequences: 1
total length: 20251 bp (20251 bp excl N/X-runs)
GC level: 42.31 %
bases masked: 17639 bp ( 87.10 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 2 10010 bp 49.43 %
SINEs: 0 0 bp 0.00 %
Penelope 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 0 0 bp 0.00 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 0 0 bp 0.00 %
RTE/Bov-B 0 0 bp 0.00 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 2 10010 bp 49.43 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 0 0 bp 0.00 %
Gypsy/DIRS1 2 10010 bp 49.43 %
Retroviral 0 0 bp 0.00 %
DNA transposons 6 6788 bp 33.52 %
hobo-Activator 0 0 bp 0.00 %
Tc1-IS630-Pogo 0 0 bp 0.00 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 0 0 bp 0.00 %
Tourist/Harbinger 0 0 bp 0.00 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 4 841 bp 4.15 %
Total interspersed repeats: 17639 bp 87.10 %
Small RNA: 0 0 bp 0.00 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 0 0 bp 0.00 %
Low complexity: 0 0 bp 0.00 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Provides a summary of repeats identified in the input sequence
24
RepeatMasker output
25
RepeatMasker output
HSPs can be expanded
26
RepeatMasker output
Genomic sequence Repeat sequence
27
RepeatMasker output
The sequence is ready for the gene prediction and evidence alignment
Repeats are “hard masked” by substitution of the nucleotidic sequence with Ns
28
A “standard” annotation pipeline
• Detection and masking of repeated sequences
• Ab initio gene prediction
• Evidence alignment: – ESTs
– Full length cDNAs
– RNA-Seq sequences
– Proteins
• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments
29
Types of genes
• Protein-coding genes
• Pseudogenes
• Functional RNA genes:
– tRNA
– rRNA
– snoRNA
– snRNA
– miRNA
Software for identification of non coding RNAs: tRNA-Scan, snoscan Infernal (Infererence of RNA alignments) based on probabilistic models of the sequence and secondary structure of an RNA sequence
30
Gene annotation
• Genes structure is annotated by defining exons, CDSs, UTRs and intron boundaries.
• Both intrinsic properties of the sequence and exstrinsic data are used to define the gene structure.
genome enhancer promoter
UTRs
CDS
Introns
31
Ab initio gene prediction
Uses intrinsic properties of the genome sequence Syntax rules used to predict genes: • Generally the coding segment (CDS) of the gene begins with a start codon
(ATG) and ends with a stop codon (TGA,TAA,TAG). The stop codon must be in frame with the start codon.
• Introns begin with a donor splice site (GT) and ends with an acceptor splice site (AG).
• Usually 5’- and 3’-UTR (untranslated regions) are present but are generally not predicted.
Figure modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York.
32
Statistical measures and training
• Many syntactically valid putative genes do not correspond to real genes.
• Statistical measures are used to compare the putative genes with the statistical profile of a real gene in the organism of interest.
• Statistical measures are learned from a training set of known genes different from genome to genome
• Training can be performed also with ESTs, proteins and RNA-Seq alignments
33
Signal sensors and content sensors
• Putative boundaries (start/stop codons, splice sites) are predicted by signal sensors
• CDSs are scored by a content sensor
• The two sensors together allow to define exons boundaries
Exon Exon
Likelihood ratio scores. Modified from: Methods for Computational gene Prediction. 2007
W.H. Majoros
Position weight matrix (PWM) of a donor site. Modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York..
34
Eukariotic ab initio gene finders
• GlimmerM http://www.cbcb.umd.edu/software/GlimmerM/
• GeneID http://genome.crg.es/software/geneid/
• GeneZilla http://www.genezilla.org/
• GeneMark-ES http://exon.gatech.edu/
• Augustus http://bioinf.uni-greifswald.de/augustus/
35
Example: ab initio gene prediction
• Go to http://bioinf.uni-greifswald.de/augustus/submission • Paste your MASKED sequence in the submission form and
select an organism similar to the organism of interest (e.g.: Arabidopsis thaliana)
• Run AUGUSTUS
36
Example: ab initio gene prediction
• Click on graphical and text results
37
Example: ab initio gene prediction
• Then on graphical browsable results
38
Example: ab initio gene prediction
• Results displayed on a gbrowse interface
39
A “standard” annotation pipeline
• Detection and masking of repeated sequences
• Ab initio gene prediction
• Evidence alignment: – ESTs
– Full length cDNAs
– RNA-Seq sequences
– Proteins
• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments
40
Evidence alignment
• Alignment of extrinsic data provide experimental evidence for the gene predictions – ESTs, cDNAs
• Both from the same or a related species • Mapped with Exonerate, Gmap, BLAT
– Proteins • Highly curated protein dataset (SwissProt) doesn’t need to
be from the same species • Mapped with TBLASTX, BLAT
– RNA-Seq data • TopHat mapping processed with Cufflinks or scripture • Mappings (Gmap or BLAT) of de novo assembled transcripts
(Oases, Trinity, TransAbyss)
41
Evidence alignment
Gene predictions
Protein alignments
EST alignments
RNA-Seq reads alignments
42
Gene annotation using RNA-Seq data Representative samples mRNA isolation
Sequencing
Library construction
Reconstruction of transcript isoforms
Fragmentation
43
Generation of paired-end reads
ACCTGACTGG
A2 A1 SP1
CACGTCTCTGG SP2
Paired-end reads
Sequence_1.fastq Sequence_2.fastq
Data are generated as paired-end reads: 100bp are sequenced from sequenced fragment (about 250bp) ends
44
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
In un esperimento RNA-Seq le read vengono generate dal sequenziamento delle estremità di frammenti da 200-300 bp dell’RNA messaggero da cui le sequenze introniche sono state rimosse dal macchinario di splicing durante la maturazione dell’mRNA. Alcuni frammenti saranno a cavallo delle giunzioni esone-esone 45
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
Read derivanti da frammenti contenuti completamente in singoli esoni mapperanno correttamente con una distanza tra le read compatibile con le dimensioni della libreria
46
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
Coppie di read mappanti su 2 esoni diversi avranno una dimensione dell’inserto non compatibile con le dimensioni della libreria
Dimensioni libreria
47
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.
48
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.
49
Allineamento di read RNA-Seq ad un genoma di riferimento
genoma
esoni
introni
mRNA
Idealmente la read dovrebbe essere spezzata in uno spliced alignment che tenga conto dell’introne
50
Utilizzo di un database di giunzioni di splicing
[…]
Database custom di giunzioni note.
Un database di giunzioni custom viene costruito unendo le estremità degli esoni. Read spliced vengono rilevate allineando le read non mappanti sul database di giunzioni.
Una limitazione di questo aproccio è che può rilevare solo giunzioni note.
Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., … Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470–6. doi:10.1038/nature07509
51
TopHat 1.0
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
Reference genome
Unmappable read
52 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120. 52
TopHat 1.0
Reference genome
Unmappable read
25nt
53 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.
53
TopHat 1.0
Reference genome
Unmappable read
25nt
54 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.
• Read con segmenti che possono essere mappati solo in maniera non contigua
Marcati come possibili read intron-spanning
54
TopHat 1.0
Reference genome
Unmappable read
25nt
L1 L2
55 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.
• Read con segmenti che possono essere mappati solo in maniera non contigua
Marcati come possibili read intron-spanning
• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:
L1+L2=k; 1 < L1 < k-1; L2 = k-L1
55
TopHat 1.0
Reference genome
Unmappable read
25nt
donor site acceptor
site
56 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.
• Read con segmenti che possono essere mappati solo in maniera non contigua
Marcati come possibili read intron-spanning
• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:
L1+L2=k; 1 < L1 < k-1; L2 = k-L1
• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 56
TopHat 1.0
Unmappable reads
57 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.
Allineamento delle read non allineabili al database di giunzioni
• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read
Maggiore sensibilità
• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.
• Read con segmenti che possono essere mappati solo in maniera non contigua
Marcati come possibili read intron-spanning
• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:
L1+L2=k; 1 < L1 < k-1; L2 = k-L1
• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 57
Il workflow Cufflinks - 1
58
Read paired-end vengono mappate sul genoma con un allineatore in grado di eseguire allineamenti di tipo spliced (es.: TopHat).
Il workflow Cufflinks - 2
L’output SAM di TopHat viene utilizzato come input di Cufflinks e le read vanno incontro ad un processo di assemblaggio e viene costruito un grafo di overlap.
59
Il workflow Cufflinks- 2
I trascritti vengono dedotti dal grafo cercando il percorso minimo (massima parsimonia) che spieghi gli overlap osservati.
60
Il workflow Cufflinks - 3
Genome
● Le read vengono assegnate ai trascritti basandosi sulla compatibilità dell’allineamento con il modello.
● Poiché i geni hanno hanno isoforme multiple, alcune delle quali condividono esoni, le read non possono essere assegnate in maniera univoca ad una isoforma.
● Cufflinks tratta l’incertezza costruendo una funzione di verosimiglianza che modella il processo di sequenziamento ed identifica le stime di abbondanza di isoforme che meglio spiegano le read ottenute.
● La stima, definita come abbondanza delle isoforme che massimizza la funzione di verosimiglianza (maximum likelihood estimate; MLE).
Distribuzione dei frammenti
61
Annotations generate by Cufflinks
Chromosome
Region
Gene
Transcripts annotated by Cufflinks
Coverage
Read alignments
62
Comparative gene finders
• Use additional information if the form of cross-species conservation at DNA or amino acid level.
63
Annotation consensus
• All sources of structural evidence show errors • Manual curation of all the evidences is not feasible • Integrating evidences into a consensus gives more accuracy
Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.
64
Annotation Edit Distance (AED)
SN = TP/(TP+FN) sensitivity SP = TP/(TP+FP) specificity AC = (SN+SP)/2 accuracy AED = 1-AC AED = 0 the annotation is in perfect agreement with its evidence AED = 1 indicates a complete lack of evidence support for the annotation
100bp 50bp 50bp
75bp 50bp
Numbers at nucleotide level
Numbers at exon level
Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.
65
Annotation Edit Distance (AED)
Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.
66
Annotation Edit Distance (AED)
67
File formats
• Annotations can be downloaded and redistributed in a variety of formats. Among the most common:
– GFF (General Feature Format)
– GTF (Gene Transfer Format)
– BED
– Wiggle
– VCF (Variant Call Format)
Allow to display coordinates of genes, transcripts, alignments, repeats, etc…
Describe quantitative data
Specific for variants description (developed by 1000 genomes)
68
GFF(3) file format Widely used file format to describe position of features on the genome. Used also by GMOD project and in particular by gbrowse.
chrom source Feature
type start end score strand frame Attributes
1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding
1 Ensembl transcript 3028681 3030154 . - . ID=Vv01s0011g03340.t01;Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340
1 Ensembl exon 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01.exon1;Parent=Vv01s0011g03340.t01
1 Ensembl exon 3029539 3029592 . - 1 Name=Vv01s0011g03340.t01.exon2;Parent=Vv01s0011g03340.t01
1 Ensembl exon 3029303 3029419 . - 1 Name=Vv01s0011g03340.t01.exon3;Parent=Vv01s0011g03340.t01
1 Ensembl exon 3028681 3028748 . - 0 Name=Vv01s0011g03340.t01.exon4;Parent=Vv01s0011g03340.t01
1 Ensembl CDS 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01
1 Ensembl CDS 3029539 3029592 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01
1 Ensembl CDS 3029303 3029419 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01
1 Ensembl CDS 3028681 3028748 . - 0 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01
69
GFF format explained
Field Description chrom The name of the sequence. Must be a chromosome or scaffold. source The program that generated this feature.
feature The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".
start The starting position of the feature in the sequence. The first base is numbered 1.
end The ending position of the feature (inclusive).
score
A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".
strand Valid entries include '+', '
phase For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.
attributes A list of feature attributes in the format tag=value.
chrom source Feature
type start end score strand frame Attributes
1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding
70
GFF attributes
ID Unique ID Target Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment
Name Name displayed to the user
Gap The alignment of the feature to the target if the two are not collinear
Alias Alternative Name Derives_from
Used to disambiguate the relationship between one feature and another
Parent Indicate the parent of the feature
Dbxref A database cross reference.
Ontology_term
Cross Reference to Ontology term
Note Free form text
A gene is parent to its mRNAs which are parents to their exons, etc… 71
Integrative Genomics Viewer (IGV) stand-alone genome browser
http://www.broadinstitute.org/igv/home
Click here to register and download the application
72
4 AUGUSTUS gene 2916 6002 0.09 + . g1
4 AUGUSTUS transcript 2916 6002 0.09 + . g1.t1
4 AUGUSTUS tss 2916 2916 . + . transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS exon 2916 3164 . + . transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS start_codon 3073 3075 . + 0 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS initial 3073 3164 0.99 + 0 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 3243 3308 0.97 + 1 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 3780 3921 0.97 + 1 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 4043 4095 0.99 + 0 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 4332 4466 1 + 1 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 4563 4626 1 + 1 transcript_id
"g1.t1"; gene_id "g1";
4 AUGUSTUS internal 4725 4810 0.93 + 0 transcript_id "g1.t1"; gene_id
"g1";
5 4 AUGUSTUS internal 4913 4995 0.94 + 1 transcript_id
"g1.t1"; gene_id "g1";
6 4 AUGUSTUS internal 5076 5152 0.99 + 2 transcript_id
"g1.t1"; gene_id "g1";
7 4 AUGUSTUS internal 5236 5308 0.5 + 0 transcript_id
"g1.t1"; gene_id "g1";
8 4 AUGUSTUS terminal 5616 5914 0.42 + 2 transcript_id
"g1.t1"; gene_id "g1";
9 4 AUGUSTUS intron 3165 3242 0.97 + . transcript_id
"g1.t1"; gene_id "g1";
10 4 AUGUSTUS intron 3309 3779 0.88 + . transcript_id
"g1.t1"; gene_id "g1";
11 4 AUGUSTUS intron 3922 4042 0.99 + . transcript_id
"g1.t1"; gene_id "g1";
12 4 AUGUSTUS intron 4096 4331 0.99 + . transcript_id
"g1.t1"; gene_id "g1";
13 4 AUGUSTUS intron 4467 4562 1 + . transcript_id
73
IGV Navigation Bar
Names column Memory usage
Gene Navigation: Draw a box to enlarge the region
Track visualization panels
74
IGV With File dropdown menu it's possible to load genome and tracks data or save the results.
1st step: The Genome There are 3 kind of genomes you can use: 1. Preloaded ones. 2. Load a pre-built IGV genome 3. Create you own. To do this click on Import Genome and
insert in the pop-up window the files of your genome (fasta for the sequence; gff for the genes locations)
The sequence in FASTA format
If you have it, the annotation of the genome in GTF or GFF format 75
IGV
The Genome sequence file has been loaded
A new track with genome annotation has been added
Right click on the track opens a menu that let change track visualization properties
Hovering on the track displays a pop-up window with feature informations.
76
IGV
2nd step: Load a track From the file drop-down menu choose load file,
the pop-up window let you browse your file and choose the one you want to load.
Tracks may be of different natures and of many
filetypes. Here some examples of the most used:
Alignments: SAM BAM (must be indexed) Blat PSL files Annotations: GTF GFF BED Variants: VCF 77
IGV
• Load the following annotations: – RPTchr10.gff repeats masked by RepeatMasker – V0chr10.gff annotation V0 version – V1repeatchr10.gff annotation of ORFs in repeated sequences
78
IGV Example: RNA-Seq data
Load the bam file chr10.bam containing the alignments and the Cufflinks_chr10.gtf containing the annotations generated by Cufflinks
Assembly 12x V1 annotation
Transcripts identified by cufflinks
Reads Coverage (From Alignment)
Reads Alignment
79
IGV
Unannotated gene identified by Cufflinks
Known genes reconstructed by Cufflinks
80
IGV
Right click on the track opens a menu that let change track visualization properties
Hovering on the track displays a pop-up window with feature informations.
Zooming to single nucleotide level view, we can notice the presence of SNPs
81