Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | lane-walters |
View: | 15 times |
Download: | 0 times |
Genome of the weekBacillus subtilis
Gram-positive soil bacteriumGenetically tractable, well-studiedDevelopmental pathways (sporulation, genetic competence)Industrial and agricultural importance4.2 Mb genome (sequence completed 1997)Close relative of Bacillus anthracis (Anthrax)
B. subtilis genome features
• 4,106 protein coding genes• 10 rRNA operons• Nearly 50% of the genome consists of paralogous
genes.– 77 ABC transporter binding proteins
• 10 phage like regions - horizontal transfer. Low GC regions in the genome.
• 18 sigma factors - initiate transcription.• 34 two-component regulatory systems.
Annotating genes
• How to assign preliminary functions to genes.• Automated programs.• Similarity searches
– BLAST and PSI-BLAST– COGs, Pfam, CDD, other databases– Only 50-75% of genes will have a predicted function.
Some have no known homologs in any other genome.
• Functional characterization (individual genes)– Gene knockouts– Overexpression
• In many cases computer annotation will only be able to predict function - NOT assign function!– The biological function of many genes have not
been determined, even in model systems.– As genomic characterization of gene function
continues - more and more computer generated annotations will be correct.
• Molecular function - activity of a protein at the molecular level.– Examples would be ATPase, metal binding,
converting glucose-6-phosphate to fructose-6-phosphate.
• Biological function - cellular role of the protein.– Examples would be translation initiation, DNA
replication, glycolysis.
Homologs, orthologs, and paralogs.
• Homologous genes are genes that share a common evolutionary ancestor.– Orthologs are genes found in different
organisms that arose from a common ancestor– Paralogs are genes found in the same organism
that arose from a common ancestor. Duplication could have occurred in the species or earlier.
Using BLAST to predict gene function.
• BLAST predicted protein sequence against the non-redundant database.
• Determine best hits• Automated annotation programs will often
assign the best hit function to the gene being searched.
• Must manually confirm automated annotations. (Final project).
Basic Local Alignment Search Tool
Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman algorithm Searches for matching “words” rather than individual
residues Uses statistical theory to determine if a match might have
occurred by chance
NCBI Field Guide
Nucleotide WordsGTACTGGACATGGACCCTACAGGAAQuery:
Word Size = 11GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
...........
Make a lookuptable of words
Minimum word size = 7blastn default = 11megablast default = 28
NCBI Field Guide
Protein WordsGTQITVEDLFYNIATRRKALKNQuery:
Word Size = 3
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word Size can be 2 or 3 (default = 3)
NCBI Field Guide
Minimum Requirements for a Hit
•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
exact word matchone match
two matches NCBI Field Guide
Scoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
NCBI Field Guide
Scoring Systems - ProteinsPosition Independent Matrices
PAM Matrices (Percent Accepted Mutation)• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly conserved blocks• Each matrix derived separately from blocks with a defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)
PSI- and RPS-BLASTNCBI Field Guide
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
NCBI Field Guide
Scores
V D S – C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 7
PAM30 +7 +2 0 -10 +10 +2 11
Simply add the scores for each pair of aligned residues
Different matrices produce different scores!
NCBI Field Guide
Local Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score
Alig
nm
en
ts
Expect ValueE = number of database hits you expect to find by
chancesize of database
your score
expected number of
random hits
NCBI Field Guide
At low E values E approximates a P value
BLAST Databases for Proteins
nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– PIR, Swiss-Prot, PRF– PDB (sequences from structures)
swissprot
pat - patents
pdb – sequences with 3D structures
month – sequences updated within 30 days
NCBI Field Guide
Assessment of BLAST output• What is the level of identity and similarity of the best hits?
– More identity - more likely the proteins may have similar functions.
• Does the area of similarity occur over the entire protein? Or just part of the protein? (fig. 2.19)– Often you will find hits to only part of your protein. A GTP-
binding domain for example.
• Have any of the best hits been characterized experimentally?– With so many microbial genomes sequenced chances are you
will have to search extensively to find a hit that has been characterized experimentally.
NCBI Field Guide
BLAST Output: Alignments
>gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domain-
containing protein 1) (SDP1 protein) Length = 595
Score = 255 bits (652), Expect = 4e-68
Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)
Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280
Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254
Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339
DP K +K G+KSYI Y+L PT+T V+ RYKHFDWLY RL KF I +P LP+KQ
Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314
Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399
TGRFEE+FI R + L WM M HPV+++ +VFQ FL + DEK WK GKRKAE+
Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371
SS+++ LN+F F K G E ++L A K +K+ +++G YGP W F C +
NCBI Field Guide
Nucleotide vs. Protein BLAST
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc
Human: N R V T V V L G A Q W G D E G
+ + V + V L G Q W G D E G
A.th.: S Q V S G V L G C Q W G D E G
agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
Comparing ADSS from H. sapiens and A. thaliana
BLASTp finds three matching words
BLASTn finds no match, because there are no 7 bp words
Protein searches are generally more sensitive than nucleotide searches.
NCBI Field Guide
Translated BLAST
Query DatabaseProgram
N Pucleotide rotein
N
N
N
N
P
P
blastx
tblastn
tblastx
PPPP P P
PPPP P P PPPP P P
PPPP P PParticularly useful for nucleotide sequences withoutprotein annotations, such as ESTs or genomic DNA
Linking Protein Sequence, Structure, and Function
CDD: Conserved functional domains in proteins represented by a PSSM
Protein sequencesProtein
Domains
3D Domains
PSI-BLAST, RPS-BLAST, CDART
NCBI Field Guide
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine is scored differently in these two positions
Active site nucleophile
PSI-BLAST
Create your own PSSM:
Confirming relationships of purine
nucleotide metabolism proteins
query BLOSUM62PSSM AlignmentAlignment
NCBI Field Guide
PSI BLAST
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOHMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYYVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNGRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTHVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY
e value cutoff for PSSM
NCBI Field Guide
First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NCBI Field Guide
Third PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
NCBI Field Guide
• Sanger Center• Pfam-A seeds:HMM based modelsrepresenting a widevariety of functionaldomains derived from SWISS-PROT
COG 30%
SMART 4%CDD 2%
LOAD 0.3%
16,482 recordsEntrez Domains (CDD)
A Database of Position Specific Score Matrices
• EMBL• HMM based modelsoriginally concentratingon eukaryotic signalingdomains, now expanding
• NCBI• BLAST based alignments derived from complete proteomes of unicelluar organisms
NCBI Curated Alignments
• NCBI• Library of Ancient Domains
Pfam 35%
Domains
KOG 29%• NCBI• Eukaryotic COGs
NCBI Field Guide