+ All Categories
Home > Documents > Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 ·...

Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 ·...

Date post: 23-Feb-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
81
Gene annotation 1
Transcript
Page 1: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Gene annotation

1

Page 2: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Genome annotation VS. Genome sequencing

• Genome sequencing alone does not provide information about the functions encoded by the nucleotidic sequences.

• Genome annotation: sequence is “decorated” by evidences indicating genome regional characteristics (features) providing the basis for further analysis to understand the nature of organisms

>chr1 ACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAAACCGTAAAGTCCAAAACGCTAACCCCTTAACCCTAAACCCTAAACCCTGAACCCTAAATCCCT ….

2

Page 3: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

A “standard” annotation pipeline

• Detection and masking of repeated sequences

• Ab initio gene prediction

• Evidence alignment: – ESTs

– Full length cDNAs

– RNA-Seq sequences

– Proteins

• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments

3

Page 4: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

A “standard” annotation pipeline

• Detection and masking of repeated sequences

• Ab initio gene prediction

• Evidence alignment: – ESTs

– Full length cDNAs

– RNA-Seq sequences

– Proteins

• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments

4

Page 5: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Types of repeats

1. Interspersed repeats

2. Processed pseudogenes

3. Simple sequence repeats

4. Segmental duplications

5. Blocks of tandem repeats

5

Page 6: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Interspersed repeats (transposon-derived repeats)

Constitute ~45% of the human genome. They

involve RNA intermediates (retroelements) or

DNA intermediates (DNA transposons).

• Long-terminal repeat transposons (RNA-

mediated)

• Long interspersed elements (LINEs); these

encode a reverse transcriptase

• Short interspersed elements (SINEs)(RNA-

mediated); these include Alu repeats DNA

transposons (3% of human genome)

6

Page 7: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Interspersed repeats (transposon-derived repeats)

Heredity (2010) 104, 520–533; doi:10.1038/hdy.2009.165 7

Page 8: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Processed pseudogenes

• These genes have a stop codon or

frameshift mutation and do not encode a

functional protein. They commonly arise

from retrotransposition, or following gene

duplication and subsequent gene loss.

8

Page 9: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Simple sequence repeats

• Microsatellites: from one to a dozen base

pairs (Examples: (A)n, (CA)n, (CGG)n)

• Minisatellites: a dozen to 500 base pairs

9

Page 10: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Segmental duplications

• These are blocks of about 1 kilobase to

300 kb that are copied intra- or

interchromosomally (about 5% of the human genome).

10

Page 11: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Blocks of Tandem Repeats

• These include telomeric repeats (e.g.

TTAGGG in humans) and centromeric

repeats (e.g. a 171 base pair repeat of a

satellite DNA in humans).

11

Page 12: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Biological meaning of repeats

• Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in:

– Genome rearrangements (Kazazian, 2004)

– Drift to new biological function (Kidwell and Lisch, 2001)

– Increased rate of evolution under stress (Capy et al, 2000)

12

Page 13: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Detection and masking of repeated sequences

• Repeats need to be masked prior to performing most single-species or multi-species analyses.

• Masking is necessary to avoid artifacts during gene annotation: – Repeated sequences may generate spurius alignments of

ESTs, cDNA, etc…

– Transposon genes ORFs are detected by the ab initio gene predictions and annotated as protein coding genes

• Software: – RepeatMasker (must be trained for each genome)

• Repeats are substituted by Ns

13

Page 14: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker

• Software

– Smit AFA, Hubley R, and Green P. “RepeatMasker-Open 3.0.” 1996-2004. (http://www.repeatmasker.org)

– CrossMatch / WU-BLAST (alignments)

• Repeats Database

– RepBase library (http://www.girinst.org)

14

Page 15: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Repeats library

• Uses a library of known eukariotic repeat seqs • Supplied by RepBase project • Repeats in RepBase are manually curated. • Requires registration (free for academics)

15

Page 16: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Type of repeats in RepBase library

• Interspersed (Alu, LINE, MIR, …)

• Simple (agagagag, atcatcatc, …)

• Micro- and mini-satellites

• Noncoding RNAs (tRNA, rRNA, snRNA, …)

• Common contaminants (E. coli, vectors)

16

Page 17: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepBase reports

EMBL format An example report 17

Page 18: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Consensus sequence

• A repeat family is represented by a consensus sequence. Example: more than 1 million Alu repeats in human genome.

accgataggtatacgtatca-tttacgatac

atcgct-ggtttacgcgtcaattcaggatgc

accggt-tgtttacgtagcaatctaggatac

accgat-ggtttacgtatcaatttaggatac

Consensus sequences can be efficiently aligned to a reference genome

Alu Alu Alu Alu Alu

18

Page 19: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker

• Sequences in FASTA input file are split in short overlapping fragments (overlap = 2000bp)

• Repeat consensus sequences are aligned to the splitted input sequences – Can use different BLAST-like software to align repeats to the

reference genome: • Cross-Match • WU-BLAST (faster)

– Output is converted to a standard format (cross-match format)

• Removes duplicates and assemble fragmented hits • Removes insignificant hits (based on Smith-Waterman

threshold)

19

Page 20: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

De novo repeats annotation

• Repeats library is available only for a number of species

• Highly diverged repeats can be tough to find

• In case a novel species is being annotated for which no repeats library is available repeats must be annotated de novo.

20

Page 21: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

De novo identification of repeats • An all vs. all pairwise

comparison is performed

• Unknown repeats will yield alignments

• Overlapping aligned fragments from the same element are grouped

• Elements can thus be defined

• Defined elements are then clustered into one family because the are all similar to each other

Bao Z, Eddy SR. Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes. 2002; 1269–1276. doi:10.1101/gr.88502.

21

Page 22: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

De novo repeats annotation RepeatScout http://repeatscout.bioprojects.org/ Price et al.(2005)

RAP http://genomics.cribi.unipd.it/index.php/Ra

p_Repeat_Filter

Campagna et al. (2005)

REPuter http://www.genomes.de/ Kurtz and Schleiermacher

(1999)

Repeat-match http://mummer.sourceforge.net/ Delcher et al.(1999)

RepSeek http://wwwabi.snv.jussieu.fr/public/RepSee

k/

Achaz et al.(2007)

Tallymer http://www.zbh.uni-hamburg.de/Tallymer/ Kurtz et al.(2008)

Vmatch http://www.vmatch.de/ Kurtz (unpublished)

mer-engine http://roma.cshl.org/mer-home.php Healy et al.(2003)

FORRepeats http://al.jalix.org/FORRepeats/ Lefebvre et al.(2003)

P-Clouds http://www.evolutionarygenomics.com/PCl

ouds.html

Gu et al.(2008)

Spectral

repeat finder

http://www.imtech.res.in/raghava/srf/ Sharma et al.(2004)

RepeatFinder http://cbcb.umd.edu/software/RepeatFinde

r/

Volfovsky et al. (2001)

REPEATGLUER http://nbcr.sdsc.edu/euler/intro_tmp.htm Pevzner et al.(2004)

DAWG-PAWS http://dawgpaws.sourceforge.net/ Estill and Bennetzen

(2009)

RepeatModeler http://www.repeatmasker.org/RepeatMode

ler.html

Smit (unpublished)

RepeatRunner http://www.yandell-

lab.org/software/repeatrunner.html

Smith et al.(2007)

REannotate http://www.bioinformatics.org/reannotate/i

ndex.html

Pereira (2008)

22

Page 23: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Example: mask repeats with RepeatMasker

•Go to RepeatMasker web server (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker/) and paste your sequence in the Sequence box. •In DNA source select Other from the dropdown menu and specify arabidopsis (for example) •Then press Submit sequence button.

23

Page 24: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker Output Summary ==================================================

file name: RM2_arabidopsis.txt_1433174330

sequences: 1

total length: 20251 bp (20251 bp excl N/X-runs)

GC level: 42.31 %

bases masked: 17639 bp ( 87.10 %)

==================================================

number of length percentage

elements* occupied of sequence

--------------------------------------------------

Retroelements 2 10010 bp 49.43 %

SINEs: 0 0 bp 0.00 %

Penelope 0 0 bp 0.00 %

LINEs: 0 0 bp 0.00 %

CRE/SLACS 0 0 bp 0.00 %

L2/CR1/Rex 0 0 bp 0.00 %

R1/LOA/Jockey 0 0 bp 0.00 %

R2/R4/NeSL 0 0 bp 0.00 %

RTE/Bov-B 0 0 bp 0.00 %

L1/CIN4 0 0 bp 0.00 %

LTR elements: 2 10010 bp 49.43 %

BEL/Pao 0 0 bp 0.00 %

Ty1/Copia 0 0 bp 0.00 %

Gypsy/DIRS1 2 10010 bp 49.43 %

Retroviral 0 0 bp 0.00 %

DNA transposons 6 6788 bp 33.52 %

hobo-Activator 0 0 bp 0.00 %

Tc1-IS630-Pogo 0 0 bp 0.00 %

En-Spm 0 0 bp 0.00 %

MuDR-IS905 0 0 bp 0.00 %

PiggyBac 0 0 bp 0.00 %

Tourist/Harbinger 0 0 bp 0.00 %

Other (Mirage, 0 0 bp 0.00 %

P-element, Transib)

Rolling-circles 0 0 bp 0.00 %

Unclassified: 4 841 bp 4.15 %

Total interspersed repeats: 17639 bp 87.10 %

Small RNA: 0 0 bp 0.00 %

Satellites: 0 0 bp 0.00 %

Simple repeats: 0 0 bp 0.00 %

Low complexity: 0 0 bp 0.00 %

==================================================

* most repeats fragmented by insertions or deletions

have been counted as one element

Provides a summary of repeats identified in the input sequence

24

Page 25: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker output

25

Page 26: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker output

HSPs can be expanded

26

Page 27: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker output

Genomic sequence Repeat sequence

27

Page 28: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

RepeatMasker output

The sequence is ready for the gene prediction and evidence alignment

Repeats are “hard masked” by substitution of the nucleotidic sequence with Ns

28

Page 29: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

A “standard” annotation pipeline

• Detection and masking of repeated sequences

• Ab initio gene prediction

• Evidence alignment: – ESTs

– Full length cDNAs

– RNA-Seq sequences

– Proteins

• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments

29

Page 30: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Types of genes

• Protein-coding genes

• Pseudogenes

• Functional RNA genes:

– tRNA

– rRNA

– snoRNA

– snRNA

– miRNA

Software for identification of non coding RNAs: tRNA-Scan, snoscan Infernal (Infererence of RNA alignments) based on probabilistic models of the sequence and secondary structure of an RNA sequence

30

Page 31: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Gene annotation

• Genes structure is annotated by defining exons, CDSs, UTRs and intron boundaries.

• Both intrinsic properties of the sequence and exstrinsic data are used to define the gene structure.

genome enhancer promoter

UTRs

CDS

Introns

31

Page 32: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Ab initio gene prediction

Uses intrinsic properties of the genome sequence Syntax rules used to predict genes: • Generally the coding segment (CDS) of the gene begins with a start codon

(ATG) and ends with a stop codon (TGA,TAA,TAG). The stop codon must be in frame with the start codon.

• Introns begin with a donor splice site (GT) and ends with an acceptor splice site (AG).

• Usually 5’- and 3’-UTR (untranslated regions) are present but are generally not predicted.

Figure modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York.

32

Page 33: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Statistical measures and training

• Many syntactically valid putative genes do not correspond to real genes.

• Statistical measures are used to compare the putative genes with the statistical profile of a real gene in the organism of interest.

• Statistical measures are learned from a training set of known genes different from genome to genome

• Training can be performed also with ESTs, proteins and RNA-Seq alignments

33

Page 34: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Signal sensors and content sensors

• Putative boundaries (start/stop codons, splice sites) are predicted by signal sensors

• CDSs are scored by a content sensor

• The two sensors together allow to define exons boundaries

Exon Exon

Likelihood ratio scores. Modified from: Methods for Computational gene Prediction. 2007

W.H. Majoros

Position weight matrix (PWM) of a donor site. Modified from: Majoros, W. H., Korf, I., & Ohler, U. (2009). Gene Prediction Methods. (D. Edwards, J. Stajich, & D. Hansen, Eds.). New York, NY: Springer New York..

34

Page 35: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Eukariotic ab initio gene finders

• GlimmerM http://www.cbcb.umd.edu/software/GlimmerM/

• GeneID http://genome.crg.es/software/geneid/

• GeneZilla http://www.genezilla.org/

• GeneMark-ES http://exon.gatech.edu/

• Augustus http://bioinf.uni-greifswald.de/augustus/

35

Page 36: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Example: ab initio gene prediction

• Go to http://bioinf.uni-greifswald.de/augustus/submission • Paste your MASKED sequence in the submission form and

select an organism similar to the organism of interest (e.g.: Arabidopsis thaliana)

• Run AUGUSTUS

36

Page 37: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Example: ab initio gene prediction

• Click on graphical and text results

37

Page 38: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Example: ab initio gene prediction

• Then on graphical browsable results

38

Page 39: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Example: ab initio gene prediction

• Results displayed on a gbrowse interface

39

Page 40: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

A “standard” annotation pipeline

• Detection and masking of repeated sequences

• Ab initio gene prediction

• Evidence alignment: – ESTs

– Full length cDNAs

– RNA-Seq sequences

– Proteins

• Gene annotation by generating a consensus of ab initio gene predictions and evidence alignments

40

Page 41: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Evidence alignment

• Alignment of extrinsic data provide experimental evidence for the gene predictions – ESTs, cDNAs

• Both from the same or a related species • Mapped with Exonerate, Gmap, BLAT

– Proteins • Highly curated protein dataset (SwissProt) doesn’t need to

be from the same species • Mapped with TBLASTX, BLAT

– RNA-Seq data • TopHat mapping processed with Cufflinks or scripture • Mappings (Gmap or BLAT) of de novo assembled transcripts

(Oases, Trinity, TransAbyss)

41

Page 42: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Evidence alignment

Gene predictions

Protein alignments

EST alignments

RNA-Seq reads alignments

42

Page 43: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Gene annotation using RNA-Seq data Representative samples mRNA isolation

Sequencing

Library construction

Reconstruction of transcript isoforms

Fragmentation

43

Page 44: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Generation of paired-end reads

ACCTGACTGG

A2 A1 SP1

CACGTCTCTGG SP2

Paired-end reads

Sequence_1.fastq Sequence_2.fastq

Data are generated as paired-end reads: 100bp are sequenced from sequenced fragment (about 250bp) ends

44

Page 45: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

In un esperimento RNA-Seq le read vengono generate dal sequenziamento delle estremità di frammenti da 200-300 bp dell’RNA messaggero da cui le sequenze introniche sono state rimosse dal macchinario di splicing durante la maturazione dell’mRNA. Alcuni frammenti saranno a cavallo delle giunzioni esone-esone 45

Page 46: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

Read derivanti da frammenti contenuti completamente in singoli esoni mapperanno correttamente con una distanza tra le read compatibile con le dimensioni della libreria

46

Page 47: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

Coppie di read mappanti su 2 esoni diversi avranno una dimensione dell’inserto non compatibile con le dimensioni della libreria

Dimensioni libreria

47

Page 48: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.

48

Page 49: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

Read a cavallo di una giunzione esone-esone non potranno essere mappate correttamente dagli algoritmi standard.

49

Page 50: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Allineamento di read RNA-Seq ad un genoma di riferimento

genoma

esoni

introni

mRNA

Idealmente la read dovrebbe essere spezzata in uno spliced alignment che tenga conto dell’introne

50

Page 51: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Utilizzo di un database di giunzioni di splicing

[…]

Database custom di giunzioni note.

Un database di giunzioni custom viene costruito unendo le estremità degli esoni. Read spliced vengono rilevate allineando le read non mappanti sul database di giunzioni.

Una limitazione di questo aproccio è che può rilevare solo giunzioni note.

Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., … Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221), 470–6. doi:10.1038/nature07509

51

Page 52: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

Reference genome

Unmappable read

52 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120. 52

Page 53: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

Reference genome

Unmappable read

25nt

53 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.

53

Page 54: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

Reference genome

Unmappable read

25nt

54 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi che vengono mappate indipendentemente.

• Read con segmenti che possono essere mappati solo in maniera non contigua

Marcati come possibili read intron-spanning

54

Page 55: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

Reference genome

Unmappable read

25nt

L1 L2

55 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.

• Read con segmenti che possono essere mappati solo in maniera non contigua

Marcati come possibili read intron-spanning

• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:

L1+L2=k; 1 < L1 < k-1; L2 = k-L1

55

Page 56: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

Reference genome

Unmappable read

25nt

donor site acceptor

site

56 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.

• Read con segmenti che possono essere mappati solo in maniera non contigua

Marcati come possibili read intron-spanning

• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:

L1+L2=k; 1 < L1 < k-1; L2 = k-L1

• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 56

Page 57: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

TopHat 1.0

Unmappable reads

57 Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25(9), 1105-11. doi: 10.1093/bioinformatics/btp120.

Allineamento delle read non allineabili al database di giunzioni

• Dalla versione 1.0 sfrutta le maggiore lunghezza delle read

Maggiore sensibilità

• Read non mappate da 75 basi (o più lunghe) vengono splittate in 3 o più sub-read da 25 basi mappate indipendentemente.

• Read con segmenti che possono essere mappati solo in maniera non contigua

Marcati come possibili read intron-spanning

• Il set di tutte le possibili combinazioni dondatore-accettore viene descritto da:

L1+L2=k; 1 < L1 < k-1; L2 = k-L1

• k basi a monte del sito donatore concatenate con k basi a valle dell’accettore 57

Page 58: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Il workflow Cufflinks - 1

58

Read paired-end vengono mappate sul genoma con un allineatore in grado di eseguire allineamenti di tipo spliced (es.: TopHat).

Page 59: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Il workflow Cufflinks - 2

L’output SAM di TopHat viene utilizzato come input di Cufflinks e le read vanno incontro ad un processo di assemblaggio e viene costruito un grafo di overlap.

59

Page 60: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Il workflow Cufflinks- 2

I trascritti vengono dedotti dal grafo cercando il percorso minimo (massima parsimonia) che spieghi gli overlap osservati.

60

Page 61: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Il workflow Cufflinks - 3

Genome

● Le read vengono assegnate ai trascritti basandosi sulla compatibilità dell’allineamento con il modello.

● Poiché i geni hanno hanno isoforme multiple, alcune delle quali condividono esoni, le read non possono essere assegnate in maniera univoca ad una isoforma.

● Cufflinks tratta l’incertezza costruendo una funzione di verosimiglianza che modella il processo di sequenziamento ed identifica le stime di abbondanza di isoforme che meglio spiegano le read ottenute.

● La stima, definita come abbondanza delle isoforme che massimizza la funzione di verosimiglianza (maximum likelihood estimate; MLE).

Distribuzione dei frammenti

61

Page 62: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Annotations generate by Cufflinks

Chromosome

Region

Gene

Transcripts annotated by Cufflinks

Coverage

Read alignments

62

Page 63: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Comparative gene finders

• Use additional information if the form of cross-species conservation at DNA or amino acid level.

63

Page 64: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Annotation consensus

• All sources of structural evidence show errors • Manual curation of all the evidences is not feasible • Integrating evidences into a consensus gives more accuracy

Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.

64

Page 65: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Annotation Edit Distance (AED)

SN = TP/(TP+FN) sensitivity SP = TP/(TP+FP) specificity AC = (SN+SP)/2 accuracy AED = 1-AC AED = 0 the annotation is in perfect agreement with its evidence AED = 1 indicates a complete lack of evidence support for the annotation

100bp 50bp 50bp

75bp 50bp

Numbers at nucleotide level

Numbers at exon level

Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.

65

Page 66: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Annotation Edit Distance (AED)

Figure modified from: Yandell, M., & Ence, D. (2012). A beginner’s guide to eukaryotic genome annotation. Nature Reviews Genetics, 13(5), 329-342. Nature Publishing Group.

66

Page 67: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Annotation Edit Distance (AED)

67

Page 68: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

File formats

• Annotations can be downloaded and redistributed in a variety of formats. Among the most common:

– GFF (General Feature Format)

– GTF (Gene Transfer Format)

– BED

– Wiggle

– VCF (Variant Call Format)

Allow to display coordinates of genes, transcripts, alignments, repeats, etc…

Describe quantitative data

Specific for variants description (developed by 1000 genomes)

68

Page 69: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

GFF(3) file format Widely used file format to describe position of features on the genome. Used also by GMOD project and in particular by gbrowse.

chrom source Feature

type start end score strand frame Attributes

1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding

1 Ensembl transcript 3028681 3030154 . - . ID=Vv01s0011g03340.t01;Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340

1 Ensembl exon 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01.exon1;Parent=Vv01s0011g03340.t01

1 Ensembl exon 3029539 3029592 . - 1 Name=Vv01s0011g03340.t01.exon2;Parent=Vv01s0011g03340.t01

1 Ensembl exon 3029303 3029419 . - 1 Name=Vv01s0011g03340.t01.exon3;Parent=Vv01s0011g03340.t01

1 Ensembl exon 3028681 3028748 . - 0 Name=Vv01s0011g03340.t01.exon4;Parent=Vv01s0011g03340.t01

1 Ensembl CDS 3030130 3030154 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01

1 Ensembl CDS 3029539 3029592 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01

1 Ensembl CDS 3029303 3029419 . - 1 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01

1 Ensembl CDS 3028681 3028748 . - 0 Name=Vv01s0011g03340.t01;Parent=Vv01s0011g03340.t01

69

Page 70: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

GFF format explained

Field Description chrom The name of the sequence. Must be a chromosome or scaffold. source The program that generated this feature.

feature The name of this type of feature. Some examples of standard feature types are "CDS", "start_codon", "stop_codon", and "exon".

start The starting position of the feature in the sequence. The first base is numbered 1.

end The ending position of the feature (inclusive).

score

A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".

strand Valid entries include '+', '

phase For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.

attributes A list of feature attributes in the format tag=value.

chrom source Feature

type start end score strand frame Attributes

1 Ensembl gene 3028681 3030154 . - . ID=Vv01s0011g03340;Name=Vv01s0011g03340;biotype=protein_coding

70

Page 71: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

GFF attributes

ID Unique ID Target Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment

Name Name displayed to the user

Gap The alignment of the feature to the target if the two are not collinear

Alias Alternative Name Derives_from

Used to disambiguate the relationship between one feature and another

Parent Indicate the parent of the feature

Dbxref A database cross reference.

Ontology_term

Cross Reference to Ontology term

Note Free form text

A gene is parent to its mRNAs which are parents to their exons, etc… 71

Page 72: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

Integrative Genomics Viewer (IGV) stand-alone genome browser

http://www.broadinstitute.org/igv/home

Click here to register and download the application

72

Page 73: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

4 AUGUSTUS gene 2916 6002 0.09 + . g1

4 AUGUSTUS transcript 2916 6002 0.09 + . g1.t1

4 AUGUSTUS tss 2916 2916 . + . transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS exon 2916 3164 . + . transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS start_codon 3073 3075 . + 0 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS initial 3073 3164 0.99 + 0 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 3243 3308 0.97 + 1 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 3780 3921 0.97 + 1 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 4043 4095 0.99 + 0 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 4332 4466 1 + 1 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 4563 4626 1 + 1 transcript_id

"g1.t1"; gene_id "g1";

4 AUGUSTUS internal 4725 4810 0.93 + 0 transcript_id "g1.t1"; gene_id

"g1";

5 4 AUGUSTUS internal 4913 4995 0.94 + 1 transcript_id

"g1.t1"; gene_id "g1";

6 4 AUGUSTUS internal 5076 5152 0.99 + 2 transcript_id

"g1.t1"; gene_id "g1";

7 4 AUGUSTUS internal 5236 5308 0.5 + 0 transcript_id

"g1.t1"; gene_id "g1";

8 4 AUGUSTUS terminal 5616 5914 0.42 + 2 transcript_id

"g1.t1"; gene_id "g1";

9 4 AUGUSTUS intron 3165 3242 0.97 + . transcript_id

"g1.t1"; gene_id "g1";

10 4 AUGUSTUS intron 3309 3779 0.88 + . transcript_id

"g1.t1"; gene_id "g1";

11 4 AUGUSTUS intron 3922 4042 0.99 + . transcript_id

"g1.t1"; gene_id "g1";

12 4 AUGUSTUS intron 4096 4331 0.99 + . transcript_id

"g1.t1"; gene_id "g1";

13 4 AUGUSTUS intron 4467 4562 1 + . transcript_id

73

Page 74: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV Navigation Bar

Names column Memory usage

Gene Navigation: Draw a box to enlarge the region

Track visualization panels

74

Page 75: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV With File dropdown menu it's possible to load genome and tracks data or save the results.

1st step: The Genome There are 3 kind of genomes you can use: 1. Preloaded ones. 2. Load a pre-built IGV genome 3. Create you own. To do this click on Import Genome and

insert in the pop-up window the files of your genome (fasta for the sequence; gff for the genes locations)

The sequence in FASTA format

If you have it, the annotation of the genome in GTF or GFF format 75

Page 76: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV

The Genome sequence file has been loaded

A new track with genome annotation has been added

Right click on the track opens a menu that let change track visualization properties

Hovering on the track displays a pop-up window with feature informations.

76

Page 77: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV

2nd step: Load a track From the file drop-down menu choose load file,

the pop-up window let you browse your file and choose the one you want to load.

Tracks may be of different natures and of many

filetypes. Here some examples of the most used:

Alignments: SAM BAM (must be indexed) Blat PSL files Annotations: GTF GFF BED Variants: VCF 77

Page 78: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV

• Load the following annotations: – RPTchr10.gff repeats masked by RepeatMasker – V0chr10.gff annotation V0 version – V1repeatchr10.gff annotation of ORFs in repeated sequences

78

Page 79: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV Example: RNA-Seq data

Load the bam file chr10.bam containing the alignments and the Cufflinks_chr10.gtf containing the annotations generated by Cufflinks

Assembly 12x V1 annotation

Transcripts identified by cufflinks

Reads Coverage (From Alignment)

Reads Alignment

79

Page 80: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV

Unannotated gene identified by Cufflinks

Known genes reconstructed by Cufflinks

80

Page 81: Gene annotationddlab.sci.univr.it/alberto/bioinformatica/Teoria_L11... · 2015-06-01 · R1/LOA/Jockey 0 0 bp 0.00 % R2/R4/NeSL 0 0 bp 0.00 % RTE/Bov-B 0 0 bp 0 ... •Gene annotation

IGV

Right click on the track opens a menu that let change track visualization properties

Hovering on the track displays a pop-up window with feature informations.

Zooming to single nucleotide level view, we can notice the presence of SNPs

81


Recommended