+ All Categories
Home > Documents > CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf ·...

CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf ·...

Date post: 06-Jun-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
89
CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence Alignment CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254A / EC 2474; Phone x3748; Email: [email protected] My Homepage: http://www.cs.fiu.edu/ ~ giri http://www.cs.fiu.edu/ ~ giri/teach/BioinfS15.html Office ECS 254 (and EC 2474); Phone: x-3748 Office Hours: By Appointment Only Jan 19, 2015
Transcript
Page 1: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

CAP 5510: Introduction to BioinformaticsCGS 5166: Bioinformatics Tools

Giri Narasimhan

ECS 254A / EC 2474; Phone x3748; Email: [email protected]

My Homepage: http://www.cs.fiu.edu/~giri

http://www.cs.fiu.edu/~giri/teach/BioinfS15.html

Office ECS 254 (and EC 2474); Phone: x-3748Office Hours: By Appointment Only

Jan 19, 2015

Page 2: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Presentation Outline

1 Molecular Biology Preliminaries

2 Databases

3 Sequence Alignment

Page 3: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

The drama of Molecular Biology . . . the actors

http://exploringorigins.org/images/centralDogma.jpg

Page 4: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

The Polymeric Players

Molecule Unit Name Unit Composition

DNA Nucleotide A, C, G, Tor Base

RNA Nucleotide A, C, G, Uor Base

Protein Amino acid amino acids representedresidue by 20-letter alphabet

missing {B, J, O, U, X, Z}

Page 5: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

The Polymeric Players

Molecule Unit Name Unit Composition

DNA Nucleotide A, C, G, Tor Base

RNA Nucleotide A, C, G, Uor Base

Protein Amino acid amino acids representedresidue by 20-letter alphabet

missing {B, J, O, U, X, Z}

Page 6: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

The Polymeric Players

Molecule Unit Name Unit Composition

DNA Nucleotide A, C, G, Tor Base

RNA Nucleotide A, C, G, Uor Base

Protein Amino acid amino acids representedresidue by 20-letter alphabet

missing {B, J, O, U, X, Z}

Page 7: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Typical DNA Sequence

1 gggagaacac ccggagaagg aggaggaggc gaagaaaagc aacagaagcc cagttgctgc

61 tccaggtccc tcggacagag ctttttccat gtggagactc tctcaatgga cgtgccccct

121 agtgcttctt agacggactg cggtctccta aaggtcgacc atggtggccg ggacccgctg

181 tcttctagtg ttgctgcttc cccaggtcct cctgggcggc gcggccggcc tcattccaga

241 gctgggccgc aagaagttcg ccgcggcatc cagccgaccc ttgtcccggc cttcggaaga

301 cgtcctcagc gaatttgagt tgaggctgct cagcatgttt ggcctgaagc agagacccac

361 ccccagcaag gacgtcgtgg tgccccccta tatgctagat ctgtaccgca ggcactcagg

421 ccagccagga gcgcccgccc cagaccaccg gctggagagg gcagccagcc gcgccaacac

481 cgtgcgcagc ttccatcacg aagaagccgt ggaggaactt ccagagatga gtgggaaaac

541 ggcccggcgc ttcttcttca atttaagttc tgtccccagt gacgagtttc tcacatctgc

601 agaactccag atcttccggg aacagataca ggaagctttg ggaaacagta gtttccagca

661 ccgaattaat atttatgaaa ttataaagcc tgcagcagcc aacttgaaat ttcctgtgac

721 cagactattg gacaccaggt tagtgaatca gaacacaagt cagtgggaga gcttcgacgt

781 caccccagct gtgatgcggt ggaccacaca gggacacacc aaccatgggt ttgtggtgga

841 agtggcccat ttagaggaga acccaggtgt ctccaagaga catgtgagga ttagcaggtc

901 tttgcaccaa gatgaacaca gctggtcaca gataaggcca ttgctagtga cttttggaca

961 tgatggaaaa ggacatccgc tccacaaacg agaaaagcgt caagccaaac acaaacagcg

Page 8: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Building Blocks of DNA & RNA

Fig 1.1, Zvelebil & Baum

Page 9: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

DNA – Double Helix Structure

Fig 1.3, Zvelebil & Baum

Page 10: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

DNA Molecule

From http://www.cellsalive.com/cells/cell_model.htm

Page 11: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

RNA Molecule

Fig 1.5, Zvelebil & Baum

Page 12: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Protein – The 20 Amino Acids

Letter 3 Letter Amino Letter 3 Letter AminoCode Code Acid Code Code Acid

A Ala Alanine M Met MethionineC Cys Cysteine N Asn AsparagineD Asp Aspartic P Pro Proline

AcidE Glu Glutamic Q Gla Glutamine

AcidF Phe Phenylalanine R Arg ArginineG Gly Glycine S Ser SerineH His Histidine T Thr ThreonineI Ile Isoleucine V Val ValineK Lys Lysine W Trp TrypophanL Leu Leucine Y Tyr Tyrosine

Page 13: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Protein – Typical Sequence

>gi|23491729|dbj|BAC16799.1| P53 [Homo sapiens]

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA

PRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKT

CPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRN

TFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVHVCACPGR

DRRTEEENLRKKGEPHHELPPGSTKRALSNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALEL

KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

Page 14: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Protein molecules have a 3D structure

From Branden & Tooze

Page 15: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Central Dogma

Fig 1.6, Zvelebil & Baum

Page 16: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

DNA Replication

Fig 1.4, Zvelebil & Baum

Page 17: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

The Cell

Fromhttp://www.cellsalive.com/cells/cell_model.htm

Page 18: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Bacterial Chromosomes

Fromhttp://www.cellsalive.com/cells/cell_model.htm

Page 19: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Human Chromosomes

Fromhttp://www.cellsalive.com/cells/cell_model.htm

Page 20: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Genes

From http://www2.le.ac.uk/departments/genetics/vgec/diagrams/36chromosomeunravel.jpg

Page 21: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Human Chromosomes

From http://www.cellsalive.com/cells/cell_model.htm

Page 22: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Central Dogma

Page 23: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

RNA and Genetic Code

Page 24: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Central Dogma

Page 25: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Transcription

Fig 1.7, Zvelebil & Baum

Page 26: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Transcription

Courtesy: Dr. Kalai Mathee

Page 27: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Transcription

Fig 1.6, Zvelebil & Baum

Page 28: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Transcription

Page 29: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Translation

Page 30: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Translation

Page 31: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Presentation Outline

1 Molecular Biology Preliminaries

2 Databases

3 Sequence Alignment

Page 32: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

3 Major Public Databases

GenBank

National Center for Biotechnology Information (NCBI)

EMBL European Mol Biol Laboratory

European Bioinformatics Institute (EBI)

DDBJ: DNA Data Bank of Japan

National Institute of Genetics (NIG)

All 3 have been completely integrated!

Page 33: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Entrez Portal @ NCBI

PubMed; Bookshelf

DNA and Protein Sequence database

Protein Structure database

Genome Assemblies

BLAST

dbSNP

Taxonomy Browser

Population study data sets

PubChem

GEO (Gene Expression Omnibus)

OMIM (Mendelian Inheritance in Man)

Page 34: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Other Important Databases

PDB http://www.wwpdb.org/

KEGG http://www.genome.jp/kegg/

MetaCyc http://metacyc.org

ENCODE http://encodeproject.org/ENCODE/

(functional elements in human genome)

1000 Genomes Project

International HapMap Project

Human Microbiome Project

Human Epigenome Project

Gene Ontology (GO)

Human Connectome Project

Page 35: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Presentation Outline

1 Molecular Biology Preliminaries

2 Databases

3 Sequence Alignment

Page 36: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

1. Can show sequences are close

Page 37: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

2. Can show sequences have similar parts

Page 38: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

3. Can identify similar sequences from DB

Page 39: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

4. Can pinpoint mutations

Page 40: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

5. Can help in sequence assembly

Page 41: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 42: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 43: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 44: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 45: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.

R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 46: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.

Sequence Alignment of v-sis and PDGF had surprises.

Page 47: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

6. Can be basis for discovery

Early 1970s: SSV causes cancer in some species of monkeys.

1970s: infection by certain viruses cause some cells in culture(in vitro) to grow without bounds.

Hypothesis: Oncogenes in viruses encode cellular growthfactors (proteins to stimulate growth); Uncontrolledquantities of growth factors produced by infected cellscause cancer-like behavior.

1983:

Oncogene from SSV called v-sis isolated & sequenced.Partial sequence for platelet-derived growth factor (PDGF)sequenced & published; PDGF stimulates proliferation ofcells.R.F. Doolittle was maintaining one of the earliesthome-grown databases of published aa sequences.Sequence Alignment of v-sis and PDGF had surprises.

Page 48: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

PDGF and v-SIS

Alignment was good.

Two regions of alignment

region of 31 aa with 26 matchesregion of 39 with 35 matches

Conclusion:

Previously harmless virus incorporates growth-related gene(proto-oncogene) of its host into its genome.Gene gets mutated in the virus, or moves closer to a strongenhancer, or moves away from a repressor.When virus infects a cell, it causes uncontrolled amount ofgrowth factor.

Several other oncogenes known to be similar togrowth-regulating proteins in normal cells.

Page 49: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

7. Can help describe motifs, domains, and familiesof sequences

Page 50: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Implications of Sequence Alignment

Mutation in DNA is a natural evolutionary process. Thussequence similarity may indicate common ancestry.

In biomolecular sequences (DNA, RNA, protein), highsequence similarity implies significant structural and/orfunctional similarity.

Page 51: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Similarity vs. Homology

Homologous sequences share common ancestry.

Similar sequences are near to each other by someappropriately defined measurable criteria.

Page 52: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Types of Sequence Alignment ... 1

Page 53: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Types of Sequence Alignment ... 2

Page 54: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Types of Sequence Alignment ... 3

Page 55: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Types of Sequence Alignment ... 4

Page 56: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Sequence Alignment Algorithms

Global alignments: Needleman-Wunsch-Sellers 1970

Local alignments: Smith-Waterman 1981

Both use Dynamic Programming

Page 57: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

How to Score Mismatches

Page 58: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Revolution in Sequence Alignment Algorithms

FASTA: Lipman, Pearson ’85, ’88

Basic Local Alignment Search Tool (BLAST): Altschul,Gish, Miller, Myers, Lipman ’90

Both programs:

search entire databasestremendous speed and sensitivityreport statistical significance

Page 59: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Revolution in Sequence Alignment Algorithms

FASTA: Lipman, Pearson ’85, ’88

Basic Local Alignment Search Tool (BLAST): Altschul,Gish, Miller, Myers, Lipman ’90

Both programs:

search entire databasestremendous speed and sensitivityreport statistical significance

Page 60: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST

Page 61: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Strategy & Improvements

Lipman et al.: speeded up finding runs of hot spots.

Eugene Myers 94: Sublinear algorithm for approximatekeyword matching.

Karlin, Altschul, Dembo 90, 91: Statistical Significance ofMatches

Page 62: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Variants

Nucleotide BLAST

blastnMEGABlASTShort Sequences (higher E-value threshold, smaller wordsize, no low-complexity filtering)

Protein BLAST

blastpPSI-BlASTPHI-BLASTShort Sequences (higher E-value threshold, smaller wordsize, no low-complexity filtering, PAM-30)

Translating BLAST

blastx: Search nucleotide sequence in protein database (6reading frames)Tblastn: Search protein sequence in nucleotide dBTblastx: Search nucleotide seq (6 frames) in nucleotideDB (6 frames)

Pairwise BLAST

Page 63: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 64: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 65: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 66: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 67: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 68: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 69: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 70: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 71: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Parameters

Type of query: nucleotide / protein

Word size, w

Gap penalties, p1, p2

Threshold scores, S ,T

E-value cutoff, E

E-value, E , is the expected number of sequences thatwould have an alignment score greater than the currentscore, S

Number of hits to display, H

Database to search, D

Scoring Matrix, M

Page 72: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 73: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,custom

Nucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 74: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 75: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 76: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250

going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 77: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30

going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 78: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 79: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

BLAST Database and Scoring Matrix

Databases:

Protein: NR )non-redudant, SwissPROT/UniPROT, pdb,customNucleotide: NR, dbest, dbsts, htgs, gss, pdb, vector, . . ..

Scoring Matrices

PAM Matrices: PAM 40, 160, 250 going fmor shortalignments with high similarity (70-90 %) to members of aprotein family (50-60 %), to longer alignments withdivergent homologous sequences (less than 30 %)BLOSUM Matrices: BLSOUM90, 80, 62, 30 going fmorshort alignments with high similarity (70-90 %) tomembers of a protein family (50-60 %), to weak homologs(30-40 %), to longer alignments with divergenthomologous sequences (less than 30 %) vector, . . ..

Page 80: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 81: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 82: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 83: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive.

A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 84: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 85: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 86: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb

Homology is often characterized by significant similarity overentire sequence or strong similarity in key places.

Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance

Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues.

Homology is transitive. A homologous to B & B to C ⇒ Ahomologous to C .

Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology.

Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.

Page 87: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb ... 2

Results of searches using different scoring systems may becompared directly using normalized scores.

If S is the (raw) score for a local alignment, the normalizedscore S’ (in bits) is given by

S ′ =λ− lnK

ln 2

The parameters depend on the scoring system.

Statistically significant normalized score,

S ′ > log (N/E )

where E-value = E and N = size of search space.

Page 88: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb ... 2

Results of searches using different scoring systems may becompared directly using normalized scores.

If S is the (raw) score for a local alignment, the normalizedscore S’ (in bits) is given by

S ′ =λ− lnK

ln 2

The parameters depend on the scoring system.

Statistically significant normalized score,

S ′ > log (N/E )

where E-value = E and N = size of search space.

Page 89: CAP 5510: Introduction to Bioinformatics CGS 5166 ...giri/teach/Bioinf/S15/Lec2-Biol-seqaln.pdf · CAP 5510; CGS 5166 Giri Narasimhan Molecular Biology Preliminaries Databases Sequence

CAP 5510;CGS 5166

GiriNarasimhan

MolecularBiologyPreliminaries

Databases

SequenceAlignment

Rules of Thumb ... 2

Results of searches using different scoring systems may becompared directly using normalized scores.

If S is the (raw) score for a local alignment, the normalizedscore S’ (in bits) is given by

S ′ =λ− lnK

ln 2

The parameters depend on the scoring system.

Statistically significant normalized score,

S ′ > log (N/E )

where E-value = E and N = size of search space.


Recommended