+ All Categories
Home > Documents > School B&I TCD Bioinformatics Database homology searching May 2010.

School B&I TCD Bioinformatics Database homology searching May 2010.

Date post: 02-Jan-2016
Category:
Upload: arabella-baldwin
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
33
School B&I TCD Bioinformatics Database homology searching May 2010
Transcript
Page 1: School B&I TCD Bioinformatics Database homology searching May 2010.

School B&I TCD Bioinformatics

Database homology searching

May 2010

Page 2: School B&I TCD Bioinformatics Database homology searching May 2010.

Why search a Database

• To find homologous sequences to your unknown to determine function

• To find other related sequences to do evolutionary studies (trees) or to make specialised database (nematode 16sRNA)

• To find the mouse or E.coli homolog of your gene of interest

• To find genes in a newly sequenced genome• To predict 3-D structure (blast vs PDB)

Page 3: School B&I TCD Bioinformatics Database homology searching May 2010.

BLAST

• For many molecular biologists

• a Blast search – with a DNA sequence – at NCBI – accepting default parameters

• IS bioinformatics– 70% of searches at NCBI at blastN

Page 4: School B&I TCD Bioinformatics Database homology searching May 2010.

BLAST

• Basic local alignment search tool• More or less unchanged for 20 years• Extraordinarily quick:

– Submit seq via satellite to Bethesda MD– Crunch thro 400 GB of sequence– Return results via satellite– ALL in 30 seconds

• Available at other sites (EBI, ExPaSy etc.)• with better response times, fewer conditions

Page 5: School B&I TCD Bioinformatics Database homology searching May 2010.

NCBI penalties

• 1st request: current time• 2nd request: current time + 60 seconds• 3rd request: current time + 120 seconds• 4th request: current time + 180 seconds• 5th request: current time + 240 second

• DON’T submit multiple simultaneous queries

Page 6: School B&I TCD Bioinformatics Database homology searching May 2010.

Reliable servers

• http://www.ch.embnet.org/software/bBLAST.html • http://www.ch.embnet.org/software/aBLAST.html

– Basic and advanced SwissBlast

• http://www.ebi.ac.uk/blast2/index.html – EBI (Cambridge, UK) NCBI Blast

• http://www.ncbi.nlm.nih.gov/blast/– Not the only Blast server!

Page 7: School B&I TCD Bioinformatics Database homology searching May 2010.

BLAST varieties• blastn: searches a DNA sequence against a DNA database like

EMBL, Genbank, or dbEST. also megablast for exact matches• blastp: searches a protein sequence against a protein database

such as UniProt, or "nr" a non-redundant database which ideally contains one copy of every available sequence.

Then you have:• blastx: searches a DNA sequence (translated in all six reading

frames) against a protein database.• tblastn: searches a protein sequence against a DNA database

(translated in all six reading frames) – essential for searching EST databases.

and in the interests of completeness there is:• tblastx: searches a DNA sequence (translated in all six reading

frames) against a DNA database (translated in all six reading frames).

finally• Psi-blast an iterative process using position specific subst matrix

PSSM to find distantly related sequences

Page 8: School B&I TCD Bioinformatics Database homology searching May 2010.

Algorithm

• Five steps– Break the query seq into short “words” – typically 3

consecutive residues for protein– Search the database for (nearly) exact matches to

these words– Extend match “hits” out on either side until score

stops going up – these are HSPs (high scoring segment pairs)

– Sort the HSPs by some “optimum” criterion– Significant hits are then formally scored, aligned and

displayed

Page 9: School B&I TCD Bioinformatics Database homology searching May 2010.

Alternatives to BLAST

• FASTA– A little slower than Blast– More sensitive (so recommended) for DNA to

DNA searches

• Smith-Waterman (Blitz)– Much slower than Blast (20x slower)– Much more sensitive

• But Blast is standard because it gives a “good enough” answer quickly.

Page 10: School B&I TCD Bioinformatics Database homology searching May 2010.

Blast Blast at NCBI

Parallel search in Conserved domain DB

Paste your query sequence here

Page 11: School B&I TCD Bioinformatics Database homology searching May 2010.

Input sequence parameters

Optional parameters to change include• Parameters that alter the hits found

– Database searched– Word size– Substitution matrix– Gap penalties– Low complexity masking

• Parameters that alter the results delivered– Expectation cut-off– Limit by organism or taxonomic group– Number of hits reported– Number of alignments shown

Page 12: School B&I TCD Bioinformatics Database homology searching May 2010.

Database

• If query is a coding gene: translate and search protein database

• Search PDB if you want a 3-D structure• Search NR if you want any hit• Search UniProt to know what the hits are• Search dbEST to know if your sequence is expressed

• UniProt90: no seq is more than 90% ident to any other (for an uncluttered tree) also UniProt50

Page 13: School B&I TCD Bioinformatics Database homology searching May 2010.

Word size

• Default at NCBI Blast– 11 for DNA; option of 7 and 5– 3 for protein; option of 2

• Increasing wordsize will speed search

… but will lose sensitivity– It will miss some useful but distant hits

• Decreasing wordsize will be more sensitive … but take longer

Page 14: School B&I TCD Bioinformatics Database homology searching May 2010.

Substitution matrix

• By default BLAST uses BLOSUM62

• Can change this– Blosum90 mirrors changes in closely related

sequences– Blosum30 changes in distant sequences

• Should run 3 blast searches with different substitution matrices.

Page 15: School B&I TCD Bioinformatics Database homology searching May 2010.

Substitution matricesTop left part of a BLOSUM 90 matrix A R N D C Q E G H I LA 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3N -2 -1 7 1 -4 0 -1 -1 0 -4 -4D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2Q -1 1 0 -1 -4 7 2 -3 1 -4 -3E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5H -2 0 0 -2 -5 1 -1 -3 8 -4 -4I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

Changes 30-62-90 not linear

Positive off-diagonals aresimilar

Page 16: School B&I TCD Bioinformatics Database homology searching May 2010.

Reality of matrix change

• Query: GHDEICI• GH + C • Sbjct: GHACNCG

– Scores 39 with Blosum30; 5 with Blosum90

• Query: HEQCRLEN• +E LEN• Sbjct: QENAHLEN

– Scores 19 with Blosum30; 24 with Blosum90

• So GHDEICIHEQCRLEN will find different hits

Page 17: School B&I TCD Bioinformatics Database homology searching May 2010.

Matrix families

• PAM – point accepted mutation– Margaret Dayhoff 1970s (before Genbank)

• BLOSUM – based on aligned blocks– Henikoff and Henikoff 1992– Blosum 62 default (based on aligned seqs

62% identical – don’t ask why 62: it works)

• Low PAM = High Blosum– PAM 250 == Blosum 30– PAM 30 == Blosum 90

Page 18: School B&I TCD Bioinformatics Database homology searching May 2010.

Gap penalty

• Original Blast was gap-free– Because gapped aligns much more CPU

• Now can insert “affine” gaps– Gap open 10; gap extend 1

• Raise gap penalty to discourage gaps– Preferentially gets closely related hits

• Lower gap penalty for sensitive search for distant relatives

Page 19: School B&I TCD Bioinformatics Database homology searching May 2010.

Low Complexity Masking• Seg masks low-complexity regions

– “too many” serines, prolines etc.• What a masked sequence looks like:>P04729 Wheat gamma gliadinMKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQQPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQIPIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQSRYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLGQQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTGVGAY*and after low complexity masking:

>P04729 SEG low-complexity maskedMKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTGVGAY*

• Xnu masks repeats (from database of common repeats)

Page 20: School B&I TCD Bioinformatics Database homology searching May 2010.

Masking• Check if masking on/off by default (differs at

sites)• Run 2 searches with masking on, masking off

ANNYway• Masking by hand:

– If you know about the DNA binding domain already• Which is really common and will occupy the top 100 hits

against any database– Replace the region with Xs– Re-run blast to find other motifs/domains/information– NCBI blast allows select subsequences

• DUST masks low info DNA sequences– Like polyA tails

Page 21: School B&I TCD Bioinformatics Database homology searching May 2010.

Advanced

Page 22: School B&I TCD Bioinformatics Database homology searching May 2010.

Expectation cut-off• E value

– Expected number of chance hits• Given the database searched• Query HSP length

• Related to probability• Default is 10: “expect to find 10 hits as

good as this by chance alone”• E values less than 10-4 unreliable • So different from p < 0.05 • For short seqs crank E up to 100 or 1000

Page 23: School B&I TCD Bioinformatics Database homology searching May 2010.

Data deluge – hits delivered

• Default suits general purpose

• V = 50 num of “hits” one line descriptions

• B = 50 num of alignments between query and hit which are displayed

Page 24: School B&I TCD Bioinformatics Database homology searching May 2010.

Swiss Blast

Page 25: School B&I TCD Bioinformatics Database homology searching May 2010.

EBI blast

Page 26: School B&I TCD Bioinformatics Database homology searching May 2010.

Limit by taxon/organism

Also search at www.ensembl.org for “genomed” organisms

Page 27: School B&I TCD Bioinformatics Database homology searching May 2010.

Database choice

NR for getting aNNy hitswissprot or refseq for getting hits that have annotationMonth for recent hits

Page 28: School B&I TCD Bioinformatics Database homology searching May 2010.

The output

• Is in five parts

1. Admin – date, size of database, id of query

2. Graphic display of query and hits

3. List of hits with links to database and to…

4. …alignments (may be > 1 HSP per hit)

5. More admin, including errors/warnings

Page 29: School B&I TCD Bioinformatics Database homology searching May 2010.

Notes!

• Record the top admin stuff. Your search will be different – Next week– On a different server– With different DB– With different input parameters

• Mouse-over any graphic display– Shows domain structure– Shows if hit is global or local

• Read the bottom admin stuff

Page 30: School B&I TCD Bioinformatics Database homology searching May 2010.

Blast outputsp|P06471|HOR3_HORVU B3-HORDEIN Length = 264 Score = 62.5 bits (149), Expect = 1e-09 Identities = 32/63 (50%), Positives = 38/63 (59%)Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY LNPCARSQM R+EA+YSbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY Query: 191 SII 193 SI+Sbjct: 171 SIV 173

Low-complexity masked regionWhat sort of residues masked inThe query sequence??

May be more HSPs for same hitIf big deletion in either seq

Page 31: School B&I TCD Bioinformatics Database homology searching May 2010.

General blast protocol1. Find server, choose DB, paste seq, GO

2. Rerun search with/out masking

3. Rerun search with two diff subs matrix

4. 2 x 3 for six searches

5. If top N hits all same family/domain then XXX this region and resubmit

6. LOOK at the results; esp strange ones

7. Limit results by organism

Page 32: School B&I TCD Bioinformatics Database homology searching May 2010.

Blast notes

• Twilight zone <25% protein <70% NA• Because two sequences give high blast scores

to a third, doesn’t mean they are related– NNNNDOMAINANNNNDOMAINBNNN– NNNNDAMIANANNNNNNNNNNNNNN– NNNNNNNNNNNNNNNDIMIAMBNNN

• E-value vs % ID. >10-4 unreliable• Bit score <50 unreliable

Page 33: School B&I TCD Bioinformatics Database homology searching May 2010.

PSI-BLAST

• Uses a PSSM rather than BloSum/PAM

• Iterative …– Can find very distant relatives– …so deep insight

• BUT can iterate off with the fairies


Recommended