Genome of the week

Genome of the weekBacillus subtilis

Gram-positive soil bacteriumGenetically tractable, well-studiedDevelopmental pathways (sporulation, genetic competence)Industrial and agricultural importance4.2 Mb genome (sequence completed 1997)Close relative of Bacillus anthracis (Anthrax)

B. subtilis genome features

• 4,106 protein coding genes• 10 rRNA operons• Nearly 50% of the genome consists of paralogous

genes.– 77 ABC transporter binding proteins

• 10 phage like regions - horizontal transfer. Low GC regions in the genome.

• 18 sigma factors - initiate transcription.• 34 two-component regulatory systems.

Annotating genes

• How to assign preliminary functions to genes.• Automated programs.• Similarity searches

– BLAST and PSI-BLAST– COGs, Pfam, CDD, other databases– Only 50-75% of genes will have a predicted function.

Some have no known homologs in any other genome.

• Functional characterization (individual genes)– Gene knockouts– Overexpression

• In many cases computer annotation will only be able to predict function - NOT assign function!– The biological function of many genes have not

been determined, even in model systems.– As genomic characterization of gene function

continues - more and more computer generated annotations will be correct.

• Molecular function - activity of a protein at the molecular level.– Examples would be ATPase, metal binding,

converting glucose-6-phosphate to fructose-6-phosphate.

• Biological function - cellular role of the protein.– Examples would be translation initiation, DNA

replication, glycolysis.

Homologs, orthologs, and paralogs.

• Homologous genes are genes that share a common evolutionary ancestor.– Orthologs are genes found in different

organisms that arose from a common ancestor– Paralogs are genes found in the same organism

that arose from a common ancestor. Duplication could have occurred in the species or earlier.

Using BLAST to predict gene function.

• BLAST predicted protein sequence against the non-redundant database.

• Determine best hits• Automated annotation programs will often

assign the best hit function to the gene being searched.

• Must manually confirm automated annotations. (Final project).

Basic Local Alignment Search Tool

Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman algorithm Searches for matching “words” rather than individual

residues Uses statistical theory to determine if a match might have

occurred by chance

NCBI Field Guide

Nucleotide WordsGTACTGGACATGGACCCTACAGGAAQuery:

Word Size = 11GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

...........

Make a lookuptable of words

Minimum word size = 7blastn default = 11megablast default = 28

NCBI Field Guide

Protein WordsGTQITVEDLFYNIATRRKALKNQuery:

Word Size = 3

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word Size can be 2 or 3 (default = 3)

NCBI Field Guide

Minimum Requirements for a Hit

•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

exact word matchone match

two matches NCBI Field Guide

Scoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

NCBI Field Guide

Scoring Systems - ProteinsPosition Independent Matrices

PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly conserved blocks• Each matrix derived separately from blocks with a defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)

PSI- and RPS-BLASTNCBI Field Guide

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

NCBI Field Guide

Scores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

Simply add the scores for each pair of aligned residues

Different matrices produce different scores!

NCBI Field Guide

Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

Expect ValueE = number of database hits you expect to find by

chancesize of database

your score

expected number of

random hits

NCBI Field Guide

At low E values E approximates a P value

BLAST Databases for Proteins

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– PIR, Swiss-Prot, PRF– PDB (sequences from structures)

swissprot

pat - patents

pdb – sequences with 3D structures

month – sequences updated within 30 days

NCBI Field Guide

Assessment of BLAST output• What is the level of identity and similarity of the best hits?

– More identity - more likely the proteins may have similar functions.

• Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19)– Often you will find hits to only part of your protein. A GTP-

binding domain for example.

• Have any of the best hits been characterized experimentally?– With so many microbial genomes sequenced chances are you

will have to search extensively to find a hit that has been characterized experimentally.

NCBI Field Guide

BLAST Formatting Page

NCBI Field Guide

BLAST Output: Graphic Overview

SH3 PX

NCBI Field Guide

BLAST Output: Descriptions

links to entrez

4 X 10-68

default e value cutoff = 10

TaxBLAST: Taxonomy Reports

BLAST Output: Alignments

>gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain-

containing protein 1) (SDP1 protein) Length = 595

Score = 255 bits (652), Expect = 4e-68

Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)

Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280

Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254

Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339

DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ

Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314

Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399

TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+

Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371

SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C +

NCBI Field Guide

Blink – Protein BLAST Alignments

• Lists only 200 hits • List is nonredundant

NCBI Field Guide

Nucleotide vs. Protein BLAST

aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc

Human: N R V T V V L G A Q W G D E G

+ + V + V L G Q W G D E G

A.th.: S Q V S G V L G C Q W G D E G

agtcaagtatctggtgtactcggttgccaatggggagatgaaggt

Comparing ADSS from H. sapiens and A. thaliana

BLASTp finds three matching words

BLASTn finds no match, because there are no 7 bp words

Protein searches are generally more sensitive than nucleotide searches.

NCBI Field Guide

Translated BLAST

Query DatabaseProgram

N Pucleotide rotein

N

N

N

N

P

P

blastx

tblastn

tblastx

PPPP P P

PPPP P P PPPP P P

PPPP P PParticularly useful for nucleotide sequences withoutprotein annotations, such as ESTs or genomic DNA

Linking Protein Sequence, Structure, and Function

CDD: Conserved functional domains in proteins represented by a PSSM

Protein sequencesProtein

Domains

3D Domains

PSI-BLAST, RPS-BLAST, CDART

NCBI Field Guide

Position Specific Substitution Rates

Active site serineWeakly conserved serine

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine is scored differently in these two positions

Active site nucleophile

PSI-BLAST

Create your own PSSM:

Confirming relationships of purine

nucleotide metabolism proteins

query BLOSUM62PSSM AlignmentAlignment

NCBI Field Guide

PSI BLAST

>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY

e value cutoff for PSSM

NCBI Field Guide

PSI Results: Initial BLAST Run

NCBI Field Guide

First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NCBI Field Guide

Third PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

NCBI Field Guide

• Sanger Center• Pfam-A seeds:HMM based modelsrepresenting a widevariety of functionaldomains derived from SWISS-PROT

COG 30%

SMART 4%CDD 2%

LOAD 0.3%

16,482 recordsEntrez Domains (CDD)

A Database of Position Specific Score Matrices

• EMBL• HMM based modelsoriginally concentratingon eukaryotic signalingdomains, now expanding

• NCBI• BLAST based alignments derived from complete proteomes of unicelluar organisms

NCBI Curated Alignments

• NCBI• Library of Ancient Domains

Pfam 35%

Domains

KOG 29%• NCBI• Eukaryotic COGs

NCBI Field Guide

Date post:	31-Dec-2015
Category:	Documents
Upload:	lane-walters
View:	15 times
Download:	0 times

Genome of the week

Documents