Brian Kidd
October 21, 2010
Computational Biology Tools
Lecture 8:
Protein Sequence Databases and
Analysis Tools
see survey 04 on eCommons (tests and quizzes)
Questions/Concerns from Last Time
Overview
1. Protein Sequence Databases• SwissProt, UniProt, NCBI
2. Protein Analysis tools• Linear sequence analysis• 3D structure analysis
3. Finding Distant Homologs
Sequence Databases
SwissProt (ExPASY)
highly curated, updated less frequently
translated nucleotide sequences
automatic translation, fast but less info
Unified Protein Resource
Combines SwissProt, TrEMBL, PIR sequences
TrEMBL (ExPASY)
UniProt (EBI)
Sequence Analysis Sites
For protein sequences and tools to analyze them, the two major centers are:
ExPASY: Expert Protein Analysis System
many tools – http://ca.expasy.org/tools
Databases: SwissProt, TrEMBL
PIR: Protein Information Resource (folded into UniProt consortium; no longer major resource site)
NCBI: Entrez Protein and Domains
More Sequence Databases
Non-redundant
NR (NCBI), UniRef (PIR/EBI)
Reference
RefSeq (NCBI) – reannotated by NCBI
Domains/Families
Pfam – protein families (Sanger center + mirror sites)
SMART – Simple Modular Architecture Research Tool
CDD – Conserved protein Domain Database (NCBI), combines Pfam, SMART, and COGs databases
InterPro – (based on UniProt, at EMBL-EBI
Many others...
Linear Sequence Analysis
Calculate its physical properties
What can you learn from a (single) protein sequence?
Signal sequences, transmembrane domains, coiled-coils, post-translational modification sites, secondary structure (non-homologous)
Domains, functional motifs (homologous)
Identify sequence motifs and families
Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy (hydrophobic vs. hydrophilic regions)
Does not take into account post-translational modifications, so calculations are usually not 100% accurate
Protein Sequence Analysis Tools
ExPASY Proteomics Tools
Calculate physical properties
Predict sequence motifs
what ExPASY calls “Topology” : localization, TM domains
Signal sequences, post-translational modifications
Search pattern and profile collections
PredictProtein and Meta-PP
Meta-server providing access to many servers with one submission form
Structure Databases
Experimental
PDB: Protein Data Bank
Families:
SCOP, CATH, Dali Database, Homstrad
Models/Predictions
ModBase
SwissModel
NOTE: all of these databases are described in the January Database issue of Nucleic Acids Research (NAR)
Includes links to the databases
3D Structure Analysis
Visualization
Evaluate structure “quality”
Domain structure, global fold, active sites, point mutations, SNPs, splice sites
Calculate physical properties
Prediction
Surface areas, distances, side-chain conformations, contact maps
Structural alignment (i.e. similarity to other structures)
Physical properties: binding affinity, pKa’s, stability, specificity
3D structure (homology modeling, fold recognition, de novo)
Advanced: protein design, “docking” of two proteins, active site modeling
Secondary Structure Prediction
Three good methods:
Psipred
SAM-T02/T04/T06
PhD (PredictProtein)
Compare a couple methods
Use the three-state predictions
Information FlowSequence ⇔ Structure ⇔ Function
Evolutionary selection operates on function
Structure is more closely linked to function than is sequence, so structure tends to be more conserved than sequence
Need to search farther in sequence space to find proteins with related structures and functions
Detecting Remote Similarities
Remote similarities can more easily be detected by comparing protein sequences
DNA sequences change faster than protein sequences (wobble position, redundant codons)
4 letter DNA code vs. 20 letter amino acid code means that matches by chance are more likely in DNA ➜ the protein code has more information in it!
Detecting Homology
NEAR Evolutionary Distance FAR
BLASTnBLASTp
PSIBLASTFold Recognition
METHODS
DNA SequenceProtein Sequence
Protein Structure
SIMILARITY
Similar Sequence Share Similar Structures
Compare all pairs of proteins in the same “family” (pairs for which homology is very probable)
Homologs do not necessarily share much sequence similarity
Proteins with > 30% sequence identity almost always share the same fold
Saunder et al., Proteins 40:6-22 (2000).
Family
All others Immunoglobulins
Mor
e st
ruct
ural
sim
ilarit
y
PSIBLAST
Position-Specific Iterated BLAST
Use BLASTp and identify related sequences (E-value threshold)
Creates a scoring matrix specialized for your sequence
Allows more distantly related sequences to be identified
Steps:
Create a profile from related sequences
Search for related sequences using this profile
Repeat
!"#$%&'(
!"#$%&)*
BLASTing the Protein Universe
Evolution and the Protein Universe
PSIBLASTing the Protein Universe
!"#$%"&'()*
!"#$%"&'()+
Sequence Profiles
Align all sequences and count how often each amino acid occurs at every position
Combine with prior information about substitution frequencies using pseudo-counts from BLOSUM62
Convert to log odds score to give a Position-Specific Scoring Matrix (PSSM)
!!!!!!!!!!!"!!#!!$!!%!!&!!'!!(!!)!!*!!+!!,!!-!!.!!/!!0!!1!!2!!3!!4!!5!!!!!!6!.!!!76!78!78!79!78!76!78!79!78!!6!!8!78!!:!!;!79!78!76!78!76!!6!!!!!!8!-!!!76!!6!!;!!6!7<!!8!!<!78!!;!79!79!!9!78!7<!76!!;!76!79!78!79!!!!!!9!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!<!5!!!!;!79!79!7<!76!79!79!7<!7<!!9!!6!79!!6!76!79!78!!;!79!76!!<!!!!!!=!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!:!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!!>!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!!?!,!!!76!79!79!7<!76!79!79!7<!79!!8!!8!79!!6!!9!79!78!76!78!!;!!9!!!!!!@!,!!!76!79!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!8!!!!!6;!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!66!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!68!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!69!3!!!78!79!7<!7<!78!78!79!7<!79!!6!!<!79!!8!!6!79!79!78!!>!!;!!;!!!!!6<!"!!!!9!78!76!78!76!76!78!!<!78!78!78!76!78!79!76!!6!76!79!79!76!!!!!6=!"!!!!8!76!!;!76!78!!8!!;!!8!76!79!79!!;!78!79!76!!9!!;!79!78!78!!!!!6:!"!!!!<!78!76!78!76!76!76!!9!78!78!78!76!76!79!76!!6!!;!79!78!76!!!!!AAA!!!!9>!1!!!!8!76!!;!76!76!!;!!;!!;!76!78!79!!;!78!79!76!!<!!6!79!78!78!!!!!9?!)!!!!;!79!76!78!79!78!78!!:!78!7<!7<!78!79!7<!78!!;!78!79!79!7<!!!!!9@!2!!!!;!76!!;!76!76!76!76!78!78!76!76!76!76!78!76!!6!!=!79!78!!;!!!!!<;!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!<6!4!!!78!78!78!79!79!78!78!79!!8!78!76!78!76!!9!79!78!78!!8!!>!76!!!!!<8!"!!!!<!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!
!"#$%&'()*'"+%,-#-*.#"$/0-1)%/*2%!3*10-#*/4%5'*#$-1)%%678,9%:;<=>;?>::<;@AB%C%?::D%E#F*%G-4'H%I%8#*)+%7*1B
A Sample PSSM
PSSM Corruption
False positives can occur in a PSIBLAST search if the PSSM becomes corrupted
One sequence that is not homologous to the query gets included in the alignment used to make the PSSM
The PSSM now looks a bit like this spurious sequence and will match well to other similar spurious sequences
The additional spurious sequences that are detected are included in the new alignment, amplifying the corrupting signal
How do PSSMs become corrupted?
Once a “bad” sequence is included in the PSSM, the search veers off course and cannot be corrected
Preventing PSSM Corruption
Applying filtering of biased composition regions (low complexity filter)
Use better methods to estimate the E-value (composition-based statistics)
Increase threshold for judging two sequences to be similar: adjust E-value from 0.001 (default) to a lower value such as 0.0001
Manually inspect the output from each iteration and remove suspicious hits
PHI-BLAST
Pattern-Hit-Initiated BLAST
What other proteins contain a particular sequence pattern and are similar in the vicinity of this pattern?
May filter out cases where pattern matches randomly and doesn’t indicate homology
Combines matching of regular expressions with local alignments surrounding the match
Pattern matching uses ScanProsite syntax
Sequence similarity search is like PSIBLAST
Syntax Rules for Patterns[] any one of the listed characters allowed
E[LIV]X(0,3)PP[STG]matches:
ELPPS
EVIPPG
does not match:
ELIVPPPPG
{} any character except the listed ones allowed
x(n) n positions in which any residue is allowed
x(n,m) n-m positions in which any residue is allowed
Examples:GXW[YF][EA][IVLM]matches:
GTWFEL
GKWYAI
does not match:
GGWYFEI
GWYEI
Gene Discovery with BLAST
Start with the sequence of a known protein
Search a DNA database (e.g HTGS, dbEST, or genomic sequence from a specific organism
Find matches...• to DNA encoding known
proteins• to DNA encoding
related (novel!) proteins• to false positives
Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene
tblastn
insepctblastx
orblastp
nr
Essentials at this Point
Accessing literature and sequence information from various databases (NCBI and UCSC)
BLAST (all variants)
Pairwise sequence analysis tools and algorithms
Single sequence analysis tools DNA:EMBOSS, ORFs, Restriction Enzymes, & Primers
Protein databases and analysis tools
PSI and PHI BLASTs
For Next Time
Reading
Problem set
B4D Chapter 9 – Building a Multiple Sequence Alignment
B4D Chapter 10 – Editing and Publishing Alignments
Continue working on PS #2 (due Friday, October 29)
http://www.soe.ucsc.edu/classes/bme110/Fall10/calendar.html