+ All Categories
Home > Documents > From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga...

From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga...

Date post: 14-Dec-2015
Category:
Upload: roger-daniel
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
39
From Genomes to Genes Rui Alves
Transcript
Page 1: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

From Genomes to Genes

Rui Alves

Page 2: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How to make sense of genome sequences?

…atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgttt

gaagaaattcc…

How do I know where genes are?

Page 3: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Predicting ORFs is easy, predicting genes is hard

• An ORF is a sequence of nucleotides that goes from a start codon (ATG, GTG,…) to a stop codon (GTA)

• Finding them is as easy as reading the DNA sequence

• How do we know if an ORF is a gene?

Page 4: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

There are several ways to predict genes

• By homology

Page 5: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Homology predictions

…Sequenced … Genome…

Sequence of

known gene

Homologue

gene

Page 6: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How are sequences aligned?

A C - …

A 1 0.001 … …

C 0.001 1 … …

- … 1 …

Substitution probability table

…UUACAUUUCCCGUCCGCUCU…

…GGGGUUAAUUUGCCCGUCCA…

…UUACAUUUCCCGUCCGCUCU…

…GGGGUUAAUUUGCCCGUCCA…

S1S2>S1

Page 7: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Problems of homology predictions: The genetic code

…UUAAUUUCCCGUCCG…

…CUUAUAAGUAGACCA…

…LISRP…

NO HOMOLOGY!!

Yet, the code is for the same peptide

Page 8: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Solution for redundancy of genetic code:

Use synonymous substitution when doing the DNA alignment

The problem of doing this:

…UUAAUUUCCCGUCCG…

…UUAAUUUCCCGUCCA…

…UUAAUUUCCAGACCG…

…CUUAUAAGUAGACCA…

Combinatorial Explosion!!!Solutions?

Not many, efficient algorithms, more computer power, pacience

Page 9: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Homology predictions most effective for closely related organisms

Thus, homology-based gene predictions works best when the genome

of a close organism has been fully

sequenced and annotated!!!

Page 10: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

There are other ways to predict if Orfs are genes

• By homology

• Ab initio methods– Signal Sensors

• ATG sites• Promoter elements id• Regulatory elements id• Shine-Dalgarno sequences id (i.e. rybosome

binding sites)• …

Page 11: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using initiation and termination codons to identify ORFs

• ATG is the start codon– GTG, CTG, TTG are minor start codons

• If termination codon too close to ATG then ORFs unlikely to be gene

atgaatgaatgctgccgaagatctctggcaccaaattttggagcggttgcag…

atgaatgaatgctgccgaagatctctggcaccaaattttggagcggtgacag…

Page 12: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using Promoter sequences to identify ORFs

• Many promoters have a known structure

• Identifying Promoters close to initiation codons increases likelihood of ORF being gene

Lac promoter

Page 13: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using response elements to identify ORFs

• Regulatory binding sites (RBS) have a known structure

• Identifying RBS close to initiation codons increases likelihood of ORF being gene

Page 14: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using Rybosomal binding sequences to identify ORFs

• Rybosomal binding sites (SDS) have a known structure

• Identifying SDS close to initiation codons increases likelihood of ORF being gene

AGGAGGConsensus Shine-Dalgarno sequence

Page 15: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

There are several ways to predict genes

• By homology• Ab initio methods

– Signal Sensors• Promoter elements id• Regulatory elements id• Shine-Dalgarno sequences id (i.e. rybosome binding sites)• ATG sites• …

– Content Sensors• Codon usage• GC content• Position assymetry• CpG islands• …

Page 16: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using codon bias to predict expressed ORFs

Average Codon

Usage Ile

ATT ATC ATA

0.34 0.46 0.20

Average Codon

usage Ile RF1

ATT ATC ATA

0.34 0.26 0.40Average Codon

usage Ile RF2

ATT ATC ATA

0.40 0.20 0.40

• Frequency of synonymous codons in an organism are not uniform

• Frequency of synonymous codons in coding sequences is different from that in non-coding sequences

• This can be used to predict coding open reading frames

Average Codon

usage Ile RF3

ATT ATC ATA

0.32 0.42 0.25

atgaatgcatgctgccgaagatctctggcaccaaattttggagcggttgcag…

The third reading frame is the most likely to be a gene

Page 17: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using GC content to predict expressed ORFs

Frame 1 Frame 2 Frame 3

11 9 5

gtgattagctctgccgaagatctctggcaccaaattttggagcggttgcag…

Genes have a very high (low) G+C content on the third position of the codons in the reading frame. Frame 1 (3) more likely to

be expressed

Not very usefull for eukaryotes

The G+C content of the third position of codons in coding sequences is biased

Page 18: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using position assymetry to predict expressed ORFs

Av Gene A T C G

Position 1 0.20 0.20 0.22 0.40

Position 2 0.38 0.22 0.20 0.20

Position 3 0.30 0.22 0.24 0.24

RF1 A T C G

Position 1 0.19 0.19 0.24 0.38

Position 2 0.38 0.24 0.19 0.19

Position 3 0.29 0.24 0.24 0.24

RF2 A T C G

Position 1 0.38 0.24 0.19 0.19

Position 2 0.19 0.38 0.24 0.19

Position 3 0.25 0.25 0.25 0.25

gtgaatgtatgctctgccgaagatctctggcaccaaattttggagcggttgcag… RF3 A T C G

Position 1 0.45 0.15 0.15 0.25

Position 2 0.20 0.18 0.30 0.32

Position 3 0.11 0.36 0.25 0.25

• Coding sequences have a characteristic distribution of nucleotides in each of the three positions of codons

Page 19: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Using position assymetry to predict expressed ORFs

Position Assymetry For A

00,05

0,10,150,2

0,25

0,30,350,4

0,450,5

1 2 3

Position

Fre

qu

ency

<A>

A R1

A R2

A R3

Position Assymetry For T

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

1 2 3

Position

Fre

qu

ency

T

T R1

T R2

T R3

Position Assymetry For C

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

1 2 3

Position

Fre

qu

ency

C

C R1

C R2

C R3

Position Assymetry For G

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1 2 3

Position

Fre

qu

ency

G

G R1

G R2

G R3

Reading Frame 1 the most likely because it has the highest similarity to the position assymetry of known

genes.

Page 20: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

CpG Islands are signals for transcription initiation

• Near the promoter of known genes, the content of CG dinucleotides is higher than that away from initiation of transcription sites

• Thus, ATG preceded by CpG island are more likely to be genes

Page 21: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Other assimetry measures of gene likelihood

• Dinucleotide bias

• Hexanucleotide bias

• …

Page 22: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Summary• Genes can be predicted by

•Homology

•Content sensors

•Signal sensors

If you need to annotate a genome, e.g. go to TIGR

Page 23: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How are eukaryotic genes different?

DNA

RNA PolmRNA

RybProtein

Page 24: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How are eukaryotic genes different?

DNA

RNA Pol

RybProtein

mRNA mRNA

SpliceosomemRNA mRNA

Correctly Identifying Splicing sites is not a trivial task

Page 25: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How do we predict splicing sites?

• By Homology

• Ab initio– SS motifs– Codon usage– Exonic Splicing Enhancers– Intronic Splicing Enhancers– Exonic Splicing Silencers– Intronic Splicing Silencers

Page 26: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Homology Splice Site Prediction

Known spliced gene

Predicted spliced gene

Page 27: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Splice Site Motifs

Page 28: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Exonic Splicing Enhancers

Page 29: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Exonic Splicing Silencers

Genes & Development 18:1241-1250

Page 30: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Interaction between SE and SI

Page 31: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Rules for Splicing

• 3’ end likely target for repression

• Distance between SE and 3’ end < 100bp

• Splicing efficiency p(interaction SEC-3’ end)

Page 32: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Methods for splicing detection

Training set

of

know spliced

genes

Algorithm

Test set

of

know spliced

genes

Set

of

know spliced

genes

GA, NN, HMM

Bayesian

GA, NN, HMM

Bayes,METest set

Predictions

Page 33: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

A Genetic Algorithm Method

Motif DM1 … AMi … EM

DM1

AM

p(i)

EM

IM

Shuffle lines and columns k times and each time calculate the probability of a given

combination of motifs getting spliced

Select m best combinations and continue to evolve the algorithm until it predicts training

set

Page 34: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

A Neural Net Method

Weight Table for splice

elements

Hidden Nodes

Sequences

Predicted Splicing

Corrected Weight Table for splice

elements

Page 35: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Summary

• Eukaryotic genes have exons

• Biological rules combined with mathematical and statistical approaches can be used to predict the boundaries for the exons and to predict the splice variants

Page 36: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

How to find what genes a string of DNA contains

Rui Alves

Page 37: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Simple steps

• Go to a known gene prediction server (or google for one)

• Input sequence and wait for prediction

• Get prediction(s), either as cDNA or as a tranlated protein sequence and do homology searches to identify them in a known database (e.g. NCBI or SWISSPROT)

Page 38: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Simple steps a)

• Go to a known gene prediction server (or google for one)

• Input sequence and wait for prediction

• Get prediction(s), either as cDNA or as a translated protein sequence and do homology searches to identify them

Page 39: From Genomes to Genes Rui Alves. How to make sense of genome sequences? …atgattattggcgga atcggcggtgcaagg acacaaacaggactc agattcgaagaacgta cagacttacgaaagtt.

Paper PresentationThe human genome (Science) vs. The human

genome (Nature)

Nature : Pages 875 to 901

Science: Pages 1317-1337

Compare the differences in methods and results for the annotation

DO NOT SPEND TIME TALKING ABOUT THE SEQUENCING OR ASSEMBLY ITSELF

Do not go into the comparative genome analysis


Recommended