+ All Categories
Home > Documents > Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington...

Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington...

Date post: 18-Jan-2018
Category:
Upload: osborne-sutton
View: 215 times
Download: 0 times
Share this document with a friend
Description:
Phyloinformatic workflow Retrieve Sequences Phylota Genbank Align MAFFT……………… Evaluate Alignment LAST Gblocks / Guidance Infer Phylogeny
37
Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014
Transcript
Page 1: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformaticsor

How to analyze LOTS of sequences

Heath BlackmonUniversity of Texas at Arlington

Bioinformatics – Spring 2014

Page 2: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Infer Phylogeny

Page 3: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Infer Phylogeny

Page 4: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

www.phylota.net

Page 5: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 6: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 7: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 8: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 9: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Select and Download Data

• Find a sequence cluster with:> 500 sequences< 2000 base pairs

- Tetrapoda- Teleostei- eudicotyledons- arthropoda

Page 10: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Select and Download Data

• Find a sequence cluster with:> 500 sequences< 2000 base pairs

Download the example file of 18S sequences from the class google drive: 18S.fa

- Tetrapoda- Teleostei- eudicotyledons- arthropoda

Page 11: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 12: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 13: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

ProbConsTCofee

ClustalMuscle

Kalign

PRRN

DIALIGN-T

MAFFT

Alignment Programs

Clustal Omega

Bali-Phy

DECIPHER

Page 14: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Balance Between Scalability & Accuracy

Method Score CPU time (s)Consistency based methodsMAFFT 5.662 86.91 6,000ProbCons 1.10 87.25 43,000TCofee 2.46 84.56 210,000

Iterative refinement methodsMuscle 3.52 81.67 3,400PRRN 3.11 82.61 250,000MAFFT 3.89 82.16 3,600ClustalW 2.0 76.67 58,000Progressive methodsKalign 1.0 80.25 480MAFFT 5.662 78.63 140Muscle 3.52 77.63 160ClustalW 1.83 75.34 2,000

Page 15: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

MAFFT

• Align 1,000s of sequences in minutes/hours• Progressive and iterative methods supported• Multiple scoring schemes

• Install locally or run on the CBRC servers

Page 16: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 17: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 18: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 19: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 20: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 21: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Go ahead and try aligning the 18S.fa file that you downloaded from the class google drive.

Page 22: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 23: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Phyloinformatic workflowRetrieve Sequences• Phylota• Genbank

Align• MAFFT………………

Evaluate Alignment• LAST• Gblocks / Guidance

Page 24: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.
Page 25: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Dot PlotA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C G A G C A T A A C A

Page 26: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

DELETION / INSERTIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C G C A T A A C A

Page 27: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

INVERSIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C A C A A T A C G A G

Page 28: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

INVERSIONA

C

A

A

T

A

C

G

A

G

C

A

T

A

A

A

T

C

C T A A A T A C T G T T A T G C T C

Matches between same strand

Matches between opposite strand

Page 29: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Evaluating the 18S alignment

• Look at your dot plots first. What is wrong with the sequences?

• How would you fix/prevent this problem?

Page 30: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Evaluating Sites in an Alignments

• Bootstrapping - Guidance• ID regions with strong support - Gblocks

Page 31: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

GBlocks

Page 32: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

GBlocks

Page 33: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

GBlocks

9 W residues6 I residues8 F residues

Page 34: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Bootstrapping

Page 35: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Bootstrapping

These scores across the bottom scaled between 0 and 1 report the proportion of alignments that agree on the assignment of nucleotides in the original MSA

Page 36: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

Try The Data You Downloaded

• Make an alignment• Check the dot plots• Use Gblocks to remove uncertain sites

– How many sites in initial alignment?– How many sites in filtered alignment?– Did you lose any taxa?

Page 37: Phyloinformatics or How to analyze LOTS of sequences Heath Blackmon University of Texas at Arlington Bioinformatics – Spring 2014.

• Treat your alignment as a model parameter!

• BaliPhy: Estimates phylogenetic trees across all possible alignments without conditioning on a single alignment being “true”

• Thanks for listening to me!


Recommended