Date post: | 29-May-2018 |
Category: |
Documents |
Upload: | richa-singh |
View: | 222 times |
Download: | 0 times |
of 96
8/8/2019 tics Day 1
1/96
WELCOME TO THEWORKSHOP ON
BIOINFORMATICS
M.Sc II REVISED SYLLABUS
8/8/2019 tics Day 1
2/96
DAY 1
30/09/09
INTRODUCTION
8/8/2019 tics Day 1
3/96
NCBI
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
Entrez, the Life Sciences Search Engine
The Entrez page is home to the EntrezGlobal Query database search engine (the
Entrez cross-database search page).
The entire group of individual Entrez databases
is organized on this page with literaturedatabases at the top including PubMed, PubMed
Central, Journals, Books, OMIM and OMIA.
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/8/8/2019 tics Day 1
4/96
NCBI
The NCBI Site Search is also listed.
The sequence databases include Nucleotide, Protein,Genome, Structure, and SNPs.
The remaining databases are Taxonomy, Gene,UniGene, HomoloGene, Conserved Domains, 3D
Domains, UniSTS, PopSet, GEO Profiles, GEODatasets, PubChem Bio-Assay, PubChem Compound,PubChem Substance, Cancer Chromosomes, Probe,MeSH, Journals and NLM Catalog.
Links to popular NCBI Web pages, such as PubMed,Human Genome, Map Viewer, and BLAST, are on thetoolbar.
There is also a link to the "GenBank" database, leadingto the Nucleotide database.
8/8/2019 tics Day 1
5/96
NCBI By using the Entrez Global query, a search
across all Entrez databases is performed byentering a simple search term or phrase in the"Search across databases" query box.
Select the Go button to execute the search, or
press the Enter button on your keyboard. The CLEAR button erases search terms in the
query box; use it to begin a new search. The results found in each database are
displayed on the Global Query page. Click on the result number or its adjacentdatabase name to get to the specific results.
See the link to the Global Query Help document,which is to the right of the CLEAR button.
8/8/2019 tics Day 1
6/96
8/8/2019 tics Day 1
7/96
Nucleotide Database When a search is done in the Nucbleotide database,
Entrez search results are also shown for the threecomponent Nucleotide databases on the Search statisticline.
The component Nucleotide databases together containall the sequence data from GenBank, EMBL, and DDBJ,the members of the International Collaboration ofSequence Databases.
The new component databases are included within theEntrez linking scheme and Links within and betweendatabases can be selected as usual from the variousdatasets.
Popular search strategies such as the Limits,Preview/Index, History, and MyNCBI can be used withineach individual database.
Nucleotide database also includes the ReferenceSequence (RefSeq) records. RefSeqs are an NCBI-
curated non-redundant set of sequences.
8/8/2019 tics Day 1
8/96
Protein Database
The Protein database contains sequence
data from the translated coding regions
from DNA sequences in GenBank, EMBL,
and DDBJ as well as protein sequences
submitted to Protein Information Resource
(PIR), SWISS-PROT, Protein Research
Foundation (PRF), and Protein Data Bank(PDB) (sequences from solved structures).
8/8/2019 tics Day 1
9/96
Genome Database
The Genome database provides views for
a variety of genomes, complete
chromosomes, sequence maps with
contigs, and integrated genetic and
physical maps.
8/8/2019 tics Day 1
10/96
Structure Database
The Structure database or MolecularModeling Database (MMDB) containsexperimental data from crystallographic
and NMR structure determinations. The data for MMDB are obtained from the
Protein Data Bank (PDB).
The NCBI has cross-linked structural datato bibliographic information, to thesequence databases, and to the NCBItaxonomy.
8/8/2019 tics Day 1
11/96
Conserved Domains
Conserved Domains is a database of
protein domains.
The source databases for Conserved
Domains are Pfam, Smart, and COG.
8/8/2019 tics Day 1
12/96
3D Domains
3D Domains contains protein domains
from the Entrez Structure Database.
8/8/2019 tics Day 1
13/96
UniSTS
UniSTS is a unified, non-redundant view ofsequence tagged sites (STSs).
UniSTS integrates marker and mapping datafrom a variety of public resources.
Data sources include dbSTS, RHdb, GDB,various human maps (Genethon genetic map,Marshfield genetic map, Whitehead RH map,Whitehead YAC map, Stanford RH map, NHGRI
chr 7 physical map, and WashU chrX physicalmap), and various mouse maps (Whitehead RHmap, Whitehead YAC map, and JacksonLaboratory's MGD map).
8/8/2019 tics Day 1
14/96
Gene
Gene provides a unified query
environment for genes defined by
sequence and/or in NCBI's Map Viewer.
You can query on names, symbols,
accessions, publications, GO terms,
chromosome numbers, E.C. numbers, and
many other attributes associated withgenes and the products they encode.
8/8/2019 tics Day 1
15/96
Taxonomy Database
The Taxonomy database contains the
names of all organisms that are
represented in the NCBI genetic database
by at least one nucleotide or proteinsequence.
8/8/2019 tics Day 1
16/96
PubMed Central
PubMed Central (PMC) is the U.S.
National Library of Medicine's digital
archive of life sciences journal literature.
Access to the full text of articles in PMC is
free, except where a journal requires a
subscription for access to recent articles.
8/8/2019 tics Day 1
17/96
Journals
The Journals database can be searched
using the journal title, MEDLINE
abbreviation, NLM ID, ISO abbreviation, or
ISSN.
The database includes the journals in all
Entrez databases, e.g., PubMed,
Nucleotide, Protein.
8/8/2019 tics Day 1
18/96
MeSH
MeSH (Medical Subject Headings) is the
National Library of Medicine's controlled
vocabulary used for indexing articles in
PubMed.
MeSH terminology provides a consistent
way to retrieve information that may use
different terminology for the sameconcepts.
8/8/2019 tics Day 1
19/96
Bookshelf
The Bookshelf has a collection of
Biomedical books that are linked in Entrez.
The NCBI Handbook is also available from
the Bookshelf.
8/8/2019 tics Day 1
20/96
OMIM Database
The OMIM (Online Mendelian Inheritance
in Man) database is a catalog of human
genes and genetic disorders.
8/8/2019 tics Day 1
21/96
OMIA Database
Online Mendelian Inheritance in Animals (OMIA)is a database of genes, inherited disorders and
traits in animal species (other than human and
mouse) authored by Professor Frank Nicholas of
the University of Sydney, Australia, with helpfrom many people over the years.
The database contains textual information and
references, as well as links to relevant records
from OMIM, PubMed, Gene, and soon to NCBI'sPhenotype database.
B l O t d i
8/8/2019 tics Day 1
22/96
Boolean Operators used inEntrez
AND: To AND two search terms togetherinstructs Entrez to find all documents thatcontain BOTH terms
OR: To OR two search terms together instructs
Entrez to find all documents that containEITHER term.
NOT: To NOT two search terms togetherinstructs Entrez to find all documents that
contain search term 1 BUT NOT search term 2. Boolean operators AND, OR, NOT must beentered in UPPERCASE (e.g., promoters ORresponse elements).
Searching for Unique
8/8/2019 tics Day 1
23/96
Unique identifiers can be accessionnumbers, which apply to a completesequence record, orsequenceidentification numbers, which apply to
the individual sequences within a record. The format ofaccession numbers varies,
depending upon the source database.
Each data domain in Entrez containsrecords from a number of differentsources.
Searching for UniqueIdentifiers
Searching for Unique
8/8/2019 tics Day 1
24/96
The unique identifier for a sequence record. An accession number applies to the complete record and
is usually a combination of a letter(s) and numbers, suchas a single letter followed by five digits (e.g., U12345) ortwo letters followed by six digits (e.g., AF123456).
Some accessions might be longer, depending on thetype of sequence record.
Accession numbers do not change, even if information inthe record is changed at the author's request.
Sometimes, however, an original accession numbermight become secondary to a newer accession number,if the authors make a new submission that combinesprevious sequences, or if for some reason a newsubmission supercedes an earlier record.
Searching for UniqueIdentifiers
Searching for Unique
8/8/2019 tics Day 1
25/96
Records from the RefSeq database of
reference sequences have a
different accession number format that
begins with two letters followed by anunderscore bar and six or more digits, for
example:
NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins
NC_123456 chromosomes
Searching for UniqueIdentifiers
Searching for Unique
http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/8/8/2019 tics Day 1
26/96
GI numbers:a series of digits that are assigned consecutivelyby NCBI to each sequence it processes
"GenInfo Identifier" sequence identificationnumber, in this case, for the nucleotidesequence.
If a sequence changes in any way, a new GI
number will be assigned. A separate GI number is also assigned to eachprotein translation within a nucleotide sequencerecord, and a new GI is assigned if the protein
translation changes in any way
Searching for UniqueIdentifiers
Searching for Unique
8/8/2019 tics Day 1
27/96
Nucleotide sequence:
GI: 6995995
VERSION: NM_000492.2
Protein translation:
GI: 6995996
VERSION: NP_000483.2
Searching for UniqueIdentifiers
8/8/2019 tics Day 1
28/96
EBI
The European Bioinformatics Institute (EBI) is a
non-profit academic organisation that forms part
of the European Molecular Biology Laboratory (
EMBL). The EBI is a centre for research and services in
bioinformatics.
http://www.ebi.ac.uk
The Institute manages databases of biological
data including nucleic acid, protein sequences
and macromolecular structures.
http://www.embl.org/http://www.embl.org/8/8/2019 tics Day 1
29/96
EBI It is the European node for globally coordinated efforts to
collect and disseminate biological data. Many of their databases are household names tobiologists they include EMBL-Bank (DNA and RNAsequences), Ensembl (genomes), ArrayExpress(microarray-based gene-expression data), UniProt
(protein sequences), InterPro (protein families, domainsand motifs) and PDBe (macromolecular structures).
Others, such as IntAct (proteinprotein interactions),Reactome (pathways) and ChEBI (small molecules), arenew resources that help researchers to understand notonly the molecular parts that go towards constructing anorganism, but how these parts combine to createsystems.
The details of each database vary, but they all uphold
the same principles of service provision.
http://www.ebi.ac.uk/embl/http://www.ensembl.org/http://www.ebi.ac.uk/arrayexpress/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/intact/http://www.reactome.org/http://www.ebi.ac.uk/chebi/http://www.ebi.ac.uk/chebi/http://www.reactome.org/http://www.ebi.ac.uk/intact/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/arrayexpress/http://www.ensembl.org/http://www.ebi.ac.uk/embl/8/8/2019 tics Day 1
30/96
EMBL - Bank
EMBL-Bank is produced as part of the InternationalNucleotide Sequence Database Collaboration (see sidepanel and figure).
Each of the three groups DDBJ, EMBL Bank, GenBank
collects a proportion of the total sequence data reportedworldwide, and all new and updated database entries areexchanged between the groups on a daily basis.
EMBL-Bank contains over 150 million DNA and RNAsequences, ranging from as few as ten base pairs to
entire genomes. Its sequences come from three main sources: individual
research groups, genome-sequencing projects andpatent applications.
8/8/2019 tics Day 1
31/96
ENSEMBL Ensembl provides a comprehensive resource for the
scientific community which allows analysis of geneticinformation within and between species.
Hence, the resource is of use in a wide range ofresearch fields from evolutionary biology to clinicalresearch.
Ensembl annotates chordate genomes (i.e. vertebratesand closely related invertebrates such as the sea squirt).
Gene sets from model organisms such as yeast and flyare also imported for comparative analysis.
All Ensembl genes are placed according to theexperimental evidence of protein and mRNA sequencesobtained from UniProt/Swiss-Prot, UniProt/TrEMBL andRefSeq.
Sequence data is obtained from relevant genomesequencing centres and consortia.
8/8/2019 tics Day 1
32/96
EMSEMBL
With Ensembl you can: Retrieve all or part of a genome sequence. Use the sequence alignment search tools BLAST and
BLAT against any
Ensembl genome. Link to genome annotation from microarray results. View expressed sequence tags (ESTs), clones, mRNA
and proteins for any chromosomal region. Examine genes, markers and single nucleotide
polymorphisms (SNPs) in a chromosomal region. View variations such as SNPs across strains (rat,
mouse) or populations (human). View all alternative transcripts (splice variants) for a
gene.
8/8/2019 tics Day 1
33/96
UniProt
UniProt is produced by the UniProt Consortium,a collaboration between the EuropeanBioinformatics Institute (EBI), the Swiss Instituteof Bioinformatics (SIB) and the Protein
Information Resource (PIR). UniProt comprises four components:The UniProt Knowledgebase (UniProtKB)UniProt Reference Clusters (UniRef)
UniProt Archive (UniParc)UniProt Metagenomic and Environmental
Sequences (UniMES)
8/8/2019 tics Day 1
34/96
UniProt
The UniProt Knowledgebase, and in particular
UniProtKB/Swiss-Prot, is used to access
functional information on proteins.
Every UniProtKB entry contains the amino acid
sequence, protein name or description,taxonomic data and citation information but in
addition to this, annotation are added.
This includes widely accepted biological
ontologies, classifications and cross-references,as well as clear indications on the quality of
annotation in the form of evidence attribution to
experimental and computational data.
8/8/2019 tics Day 1
35/96
UniProt
The UniRef databases provide clustered
sets of sequences from UniProtKB and
selected UniParc records to provide
complete coverage of sequence space at
several resolutions.
UniRef90 and UniRef50 yield a database
size reduction of approximately 40% and
65%, respectively, providing significantlyfaster sequence searches.
8/8/2019 tics Day 1
36/96
UniProt
UniParc is the most comprehensivepublicly accessible non-redundant proteinsequence database available, providing
links to all underlying sources andversions of these sequences.
You can instantly find out whether asequence of interest is already in thepublic domain and, if not, identify itsclosest relatives.
8/8/2019 tics Day 1
37/96
UniProt
UniMES is a repository specifically for
metagenomic and environmental data.
8/8/2019 tics Day 1
38/96
DAY 130/09/09
Session II
8/8/2019 tics Day 1
39/96
Sequence Alignment
Quite simply, the comparison of two or
more DNA or protein sequences to each
other.
The purpose of alignment is to highlight
similarity between the sequences.
8/8/2019 tics Day 1
40/96
SEQUENCE ALIGNMENT
Sequence alignment is a standard
technique in bioinformatics for visualizing
the relationships between residues in a
collection of evolutionarily or structurally
related proteins An alignment provides a birds eye view of
the underlying evolutionary, structural, or
functional constraints characterizing aprotein family in a concise, visually
intuitive format.
8/8/2019 tics Day 1
41/96
Fundamental assumption of sequence
alignment:
Sequences that are similar share a
common ancestral sequence.
Due to common ancestry, similar
sequences have similar functionality.
Sequences that share a common
ancestorare said to be homologous.
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
42/96
Why do we have divergent copies of the
same sequence in genomes?
Speciation Events
- Divergence of a single species into two
or more new species
Gene Duplication
- Errors in replication
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
43/96
What kind of Alignment?
- Global vs. Local
- Pairwise vs. Multiple Sequence Alignment
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
44/96
BLAST
The comparison of nucleotide or proteinsequences from the same or different organismsis a very powerful tool in molecular biology.
By finding similarities between sequences,scientists can infer the function of newlysequenced genes, predict new members of genefamilies, and explore evolutionary relationships.
Now that whole genomes are being sequenced,sequence similarity searching can be used topredict the location and function of protein-coding and transcription-regulation regions ingenomic DNA.
8/8/2019 tics Day 1
45/96
BLAST
Basic Local Alignment Search Tool(BLAST) is the tool most frequently usedfor calculating sequence similarity.
BLAST comes in variations for use withdifferent query sequences against differentdatabases.
All BLAST applications, as well asinformation on which BLAST program touse and other help documentation, arelisted on the BLAST homepage.
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/8/8/2019 tics Day 1
46/96
BLAST Nucleotide BLAST searches allow one to input
nucleotide sequences and compare these against other
nucleotides. Standard nucleotide-nucleotide BLAST - Takes
nucleotides sequences in FASTA format, GenBankAccession numbers or GI numbers and compares them
against the NCBI nucleotide databases. MEGABLAST - This program uses a "greedy algorithm"(Webb Miller et al.) for nucleotide sequence alignmentsearches and concatenates many queries to save timespent scanning the database. It is optimized for aligning
sequences that differ slightly and is up to 10 times fasterthan more common sequence similarity programs. It canbe used to swiftly compare two large sets of sequencesagainst each other.
S
http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html8/8/2019 tics Day 1
47/96
BLAST Protein BLAST allows one to input protein sequences and
compare these against other protein sequences.
Standard protein-protein BLAST - Takes protein sequences inFASTA format, GenBank Accession numbers or GI numbers andcompares them against the NCBI protein databases.
PSI-BLAST - Position Specific Iterated BLAST uses an iterativesearch in which sequences found in one round of searching areused to build a score model for the next round of searching.
Highly conserved positions receive high scores and weaklyconserved positions receive scores near zero. The profile is usedto perform a second (etc.) BLAST search and the results of each"iteration" used to refine the profile. This iterative searchingstrategy results in increased sensitivity.
PHI-BLAST - Pattern Hit Initiated BLAST combines matching of
regular expression pattern with a Position Specific iterativeprotein search. PHI-BLAST can locate other protein sequenceswhich both contain the regular expression pattern and arehomologous to a query protein sequence.
http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html8/8/2019 tics Day 1
48/96
BLAST
Pairwise BLAST performs a comparison between twosequences using the BLAST algorithm. Not that theprogram considers a "Sequence 1" to be the Querysequence and "Sequence 2" to be the Subject sequence.There are the following program options:
blastn - for nucleotide - nucleotide comparisons blastp - for protein - protein comparisons tblastn - compares the protein "Sequence 1" against the
nucleotide "Sequence 2" which has been translated in allsix reading frames
blastx - compares the nucleotide "Sequence 1" againstthe protein "Sequence 2"
tblastx - compares nucleotide "Sequence 1" translatedin all six reading frames against the nucleotide"Sequence 2" translated in all six reading frames.
8/8/2019 tics Day 1
49/96
Why would you translate nucleotide
sequence to protein before comparing it?
More information in proteins!
- Detect more distant homology.
BLAST
8/8/2019 tics Day 1
50/96
Most proteins are modular in nature, with functionaldomains often being repeated within the same protein aswell as across different proteins from different species.
The BLAST algorithm is tuned to find these domains or
shorter stretches of sequence similarity. The local alignment approach also means that a mRNA
can be aligned with a piece of genomic DNA, as isfrequently required in genome assembly and analysis.
If instead BLAST started out by attempting to align two
sequences over their entire lengths (known as a globalalignment), fewer similarities would be detected,especially with respect to domains and motifs.
BLAST
8/8/2019 tics Day 1
51/96
A gap is a space introduced into an alignment to
compensate for insertions and deletions in one
sequence relative to another.
To prevent the accumulation of too many gaps inan alignment, introduction of a gap causes the
deduction of a fixed amount (the gap score) from
the alignment score.
Extension of the gap to encompass additionalnucleotides or amino acid is also penalized in
the scoring of an alignment.
BLAST
BLAST
8/8/2019 tics Day 1
52/96
BLAST Once BLAST has found a similar sequence to the query
in the database, it is helpful to have some idea of
whether the alignment is good and whether it portraysa possible biological relationship, or whether thesimilarity observed is attributable to chance alone.
BLAST uses statistical theory to produce a bit score andexpect value (E-value) for each alignment pair (query to
hit). The bit score gives an indication of how good the
alignment is; the higher the score, the better thealignment.
The E-value gives an indication of the statistical
significance of a given pairwise alignment and reflectsthe size of the database and the scoring system used.The lower the E-value, the more significant the hit. Asequence alignment that has an E-value of 0.05 meansthat this similarity has a 5 in 100 (1 in 20) chance ofoccurring by chance alone.
The BLAST report header.
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.htmlhttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html8/8/2019 tics Day 1
53/96
The top line gives information about the type of program (in this case,
BLASTP), the version (2.2.1), and a version release date.
The research paper that describes BLAST is then cited, followed by the
request ID (issued by QBLAST), the query sequence definition line, and
a summary of the database searched.
The Taxonomy reports link displays this BLAST result on the basis of
information in the Taxonomy database
Graphical overview of BLAST results
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.12378/8/2019 tics Day 1
54/96
Graphical overview of BLAST results. The query sequence is represented by the numbered red
barat the top of the figure. Database hits are shown aligned to the query, belowthe
red bar. Of the aligned sequences, the most similar are shown
closest to the query. In this case, there are three high-scoring database
matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches
that align to two regions of the query, from aboutresidues 360 and residues 220500.
The cross-hatched parts of the these bars indicate thatthe two regions of similarity are on the same protein, butthat this intervening region does not match.
The remaining bars show lower-scoring alignments. Mousing over the bars displays the definition line for that
sequence to be shown in the window above the graphic.
8/8/2019 tics Day 1
55/96
8/8/2019 tics Day 1
56/96
One-line descriptions in the BLAST report.
Each line is composed of four fields:
(a) the gi number, database designation, Accession
number, and locus name for the matched sequence,separated by vertical bars (Appendix 1);
(b) a brief textual description of the sequence, thedefinition. This usually includes information on theorganism from which the sequence was derived, the typeof sequence (e.g., mRNA or DNA), and some informationabout function or phenotype. The definition line is oftentruncated in the one-line descriptions to keep the displaycompact;
(c) the alignment score in bits. Higher scoring hits arefound at the top of the list; and
(d) the E-value, which provides an estimate of statisticalsignificance.
F th fi t hit i th li t th i b i
8/8/2019 tics Day 1
57/96
For the first hit in the list, the gi number is
116365, the database designation is sp (for
SWISS-PROT), the Accession number is
P26374, the locus name is RAE2_HUMAN, thedefinition line is Rab proteins, the score is 1216,
and the E-value is 0.0.
Note that the first 17 hits have very low E-values(much less than 1) and are either RAB proteins
or GDP dissociation inhibitors.
The other database matches have much higher
E-values, 0.5 and above, which means thatthese sequences may have been matched by
chance alone.
8/8/2019 tics Day 1
58/96
A pairwise sequence alignment from a BLAST report. The alignment is preceded by the sequence identifier the full definition
8/8/2019 tics Day 1
59/96
The alignment is preceded by the sequence identifier, the full definitionline, and the length of the matched sequence, in amino acids.
Next comes the bit score (the raw score is inparentheses) and thenthe E-value.
The following line contains information on the number of identical
residues in this alignment (Identities), the number of conservativesubstitutions (Positives), and if applicable, the number of gaps in thealignment.
Finally, the actual alignment is shown, with the query on top, and thedatabase match is labeled as Sbjct, below.
The numbers at leftand rightrefer to the position in the amino acid
sequence. One or more dashes () within a sequence indicate insertions or
deletions. Amino acid residues in the query sequence that have been masked
because of low complexity are replaced by Xs (see, for example, thefourth and lastblocks).
The line between the two sequences indicates the similarities betweenthe sequences. If the query and the subject have the same amino acid at a given
location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are
indicated with +.
8/8/2019 tics Day 1
60/96
BLAST
8/8/2019 tics Day 1
61/96
BLAST
Steps to BLAST a particular sequence
Search all databases for any protein/genename
Click on the hit obtained and derive the
FASTA sequence Go to the NCBI BLAST home page
Choose the type of BLAST program to be
used Copy the FASTA format of the sequence
and paste it in the search field.
8/8/2019 tics Day 1
62/96
8/8/2019 tics Day 1
63/96
8/8/2019 tics Day 1
64/96
DAY 130/09/09
Session III
BLAST ORTHOLOGS AND
8/8/2019 tics Day 1
65/96
PARALOGS
Homology refers to any similarity betweencharacteristics oforganisms that is due to their
shared ancestry.
Homologous sequences are orthologous if they
were separated by a speciation event: when a
species diverges into two separate species, the
divergent copies of a single gene in the resulting
species are said to be orthologous. Orthologs, or orthologous genes, are genes in
different species that are similar to each other
because they originated from a common
ancestor.
BLAST ORTHOLOGS AND
http://en.wikipedia.org/wiki/Characteristichttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Characteristic8/8/2019 tics Day 1
66/96
BLAST ORTHOLOGS ANDPARALOGS
Homologous sequences are paralogous if they wereseparated by a gene duplication event: if a gene in anorganism is duplicated to occupy two differentpositions in the same genome, then the two copiesare paralogous.
A set of sequences that are paralogous are calledparalogs of each other.
Paralogs typically have the same or similar function,but sometimes do not: due to lack of the original
selective pressure upon one copy of the duplicatedgene, this copy is free to mutate and acquire newfunctions.
BLAST ORTHOLOGS AND
http://en.wikipedia.org/wiki/Gene_duplicationhttp://en.wikipedia.org/wiki/Gene_duplication8/8/2019 tics Day 1
67/96
PARALOGS
Orthologs and paralogs are two fundamentally differenttypes of homologous genes that evolved, respectively,by vertical descent from a single ancestral gene and byduplication.
Orthology and paralogy are key concepts of evolutionary
genomics. A clear distinction between orthologs and paralogs is
critical for the construction of a robust evolutionaryclassification of genes and reliable functional annotationof newly sequenced genomes.
Genome comparisons show that orthologousrelationships with genes from taxonomically distantspecies can be established for the majority of the genesfrom each sequenced genome.
BLAST Orthology
8/8/2019 tics Day 1
68/96
BLAST - Orthology
http://oxytricha.princeton.edu/BlastO/
Derive a sequence in the FASTA format
for a particular protein
Paste it in the search query field Keep all parameters as default
Click the search button and wait for the
results to be displayed Database variation can change results
BLAST - Paralogy
http://oxytricha.princeton.edu/BlastO/http://oxytricha.princeton.edu/BlastO/8/8/2019 tics Day 1
69/96
BLAST Paralogy
Perform blastp with human angiogenin
Perform blastp with :-
Phosphoglycerate kinase
PhospholipasesSerine proteinase
Zinc metalloproteinase
8/8/2019 tics Day 1
70/96
DAY 130/09/09
Session IV
8/8/2019 tics Day 1
71/96
Sequence Alignment
Quite simply, the comparison of two or
more DNA or protein sequences to each
other.
The purpose of alignment is to highlightsimilarity between the sequences.
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
72/96
SEQUENCE ALIGNMENT
Sequence alignment is a standard
technique in bioinformatics for visualizingthe relationships between residues in a
collection of evolutionarily or structurally
related proteins An alignment provides a birds eye view of
the underlying evolutionary, structural, or
functional constraints characterizing aprotein family in a concise, visually
intuitive format.
8/8/2019 tics Day 1
73/96
Fundamental assumption of sequence
alignment:
Sequences that are similar share a
common ancestral sequence.
Due to common ancestry, similar
sequences have similar functionality.
Sequences that share a common ancestor
are said to be homologous.
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
74/96
Why do we have divergent copies of the
same sequence in genomes?
Speciation Events
- Divergence of a single species into two
or more new species
Gene Duplication
- Errors in replication
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
75/96
What kind of Alignment?
- Global vs. Local
- Pairwise vs. Multiple Sequence Alignment
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
76/96
Global Vs Local ALIGNMENT
8/8/2019 tics Day 1
77/96
A global alignment is one that comparesthe two sequences over their entirelengths, and is appropriate for comparing
sequences that are expected to sharesimilarity over the whole length.
The alignment maximises regions ofsimilarity and minimises gaps using the
scoring matrices and gap parametersprovided to the program.
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
78/96
Global sequence alignment algorithms
align sequences over their entire lengths.
A second comparison method, local
alignment, searches for regions of localsimilarity and need not include the entire
length of the sequences.
SEQUENCE ALIGNMENT
8/8/2019 tics Day 1
79/96
EMBOSS Pairwise Alignment Algorithms http://www.ebi.ac.uk/Tools/emboss/align/index.html
This tool is used to compare 2 sequences.
When you want an alignment that covers the whole
length of both sequences, use needle. When you are trying to find the best region of similarity
between two sequences, use water.
Wateris for aligning the best matching subsequences of
two sequences. It does not necessarily align wholesequences against each other; you should use needle ifyou wish to align closely related sequences along theirwhole lengths.
PAIRWISE SEQUENCE ALIGNMENT
http://www.ebi.ac.uk/Tools/emboss/align/index.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/index.html8/8/2019 tics Day 1
80/96
The %id is the percentage of identical matchesbetween the two sequences over the reportedaligned region.
The %similarity is the percentage of matches
between the two sequences over the reportedaligned region where the scoring matrix value isgreater or equal to 0.0.
The Overall %id and Overall %similarity arecalculated in a similar manner for the number ofmatches over the length of the longest of the twosequences.
PAIRWISE SEQUENCE ALIGNMENT
PAIRWISE SEQUENCE
8/8/2019 tics Day 1
81/96
What will you use at NCBI for pairwise
sequence alignment????
PAIRWISE SEQUENCEALIGNMENT
PAIRWISE SEQUENCE
8/8/2019 tics Day 1
82/96
PAIRWISE SEQUENCE
ALIGNMENT
LALIGN
http://www.ch.embnet.org/software/LALIGN_f
MULTIPLE SEQUENCE
http://www.ch.embnet.org/software/LALIGN_form.htmlhttp://www.ch.embnet.org/software/LALIGN_form.html8/8/2019 tics Day 1
83/96
A multiple sequence alignment (MSA) is asequence alignment of three or more
biological sequences, generally protein, DNA, or
RNA.
In many cases, the input set of query sequences are
assumed to have an evolutionary relationship by
which they share a lineage and are descended from
a common ancestor.
From the resulting MSA, sequence homology can be
inferred and phylogeneticanalysis can be conducted
to assess the sequences' shared evolutionary origins.
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Sequence_alignment8/8/2019 tics Day 1
84/96
Multiple sequence alignment is often usedto assess sequence conservation of
protein domains, tertiary and secondary
structures, and even individual aminoacids or nucleotides.
http://pbil.univ-lyon1.fr/alignment.html
http://au.expasy.org/tools/#align
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
http://en.wikipedia.org/wiki/Conservation_%28genetics%29http://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Secondary_structurehttp://pbil.univ-lyon1.fr/alignment.htmlhttp://au.expasy.org/tools/http://au.expasy.org/tools/http://pbil.univ-lyon1.fr/alignment.htmlhttp://en.wikipedia.org/wiki/Secondary_structurehttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Conservation_%28genetics%298/8/2019 tics Day 1
85/96
Multiple alignments of protein sequences areimportant tools in studying sequences.
The basic information they provide is
identification of conserved sequence regions. This is very useful in designing experiments to
test and modify the function of specific proteins,
in predicting the function and structure of
proteins, and in identifying new members ofprotein families.
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
8/8/2019 tics Day 1
86/96
ClustalW is a fully automatic program for globalmultiple alignment of DNA and proteinsequences.
The alignment is progressive and considers the
sequence redundancy. Trees can also be calculated from multiple
alignments.
The program has some adjustable parameterswith reasonable defaults
http://www.ebi.ac.uk/Tools/clustalw2/index.html
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
http://www.ebi.ac.uk/Tools/clustalw2/index.htmlhttp://www.ebi.ac.uk/Tools/clustalw2/index.html8/8/2019 tics Day 1
87/96
Show ColorsA button labeled 'Show Colors' will be displayed in theAlignment section of results page. If you press thisbutton the alignment will be show in color according tothe table below.
NOTE: This option only works when you have chosenALN or GCG as the output format.
AVFPMILW - RED - Small (small+ hydrophobic(incl.aromatic -Y))
DE BLUE ACIDIC RK - MAGENTA - BASIC STYHCNGQ GREEN - Hydroxyl + Amine + Basic - Q Others - Gray
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
8/8/2019 tics Day 1
88/96
CONSENSUS SYMBOLS:
An alignment will display by default the following symbolsdenoting the degree of conservation observed in eachcolumn:
* " means that the residues or nucleotides in thatcolumn are identical in all sequences in the alignment.
: " means that conserved substitutions have beenobserved, according to the COLOUR table above.
. " means that semi-conserved substitutions areobserved.
MULTIPLE SEQUENCEALIGNMENT
MULTIPLE SEQUENCE
8/8/2019 tics Day 1
89/96
PHYLOGENETIC TREE
Phylogram is a branching diagram (tree) assumed to be an estimateof a phylogeny, branch lengths are proportional to the amount ofinferred evolutionary change.
A Cladogram is a branching diagram (tree) assumed to be an
estimate of a phylogeny where the branches are of equal length,thus cladograms show common ancestry, but do not indicate theamount of evolutionary "time" separating taxa.
Tree distances can be shown, just click on the diagram to get amenu of options. The ".dnd" file is a file that describes thephylogenetic tree.
These are now in controlled with new buttons in the output file aswell as a pop up menu, that is available by right-clicking on theapplet.
The buttons on the page include "Show as Phylogram Tree", "Showas Cladogram Tree" and "Show Distances".
U S QU CALIGNMENT
MULTIPLE SEQUENCE
8/8/2019 tics Day 1
90/96
QALIGNMENT
KALIGN
T-COFFEE
COBALT
DIALIGN
KALIGN
8/8/2019 tics Day 1
91/96
KALIGN
Kalign is a fast alignment method forprotein and nucleotide sequences.
It uses a fast approximate string matching
algorithm to estimate sequence distancesquickly and accurately.
As a result Kalign is very fast compared to
other programs and can align 1500sequences in under 10 seconds.
T-COFFEE
8/8/2019 tics Day 1
92/96
T COFFEE T-Coffee is a multiple sequence alignment program.
Multiple sequence alignment programs are meant toalign a set of sequences previously gathered using otherprograms such as blast,
The main characteristic of T-Coffee is that it will allowyou to combine results obtained with several alignmentmethods.
For instance if you have an alignment coming fromClustalW2, an other alignment coming from Dialign, anda structural alignment of some of your sequences, T-Coffee will combine all that information and produce anew multiple sequence having the best agreement whith
all these methods. By default, T-Coffee will compare all you sequences two
by two, producing a global alignment and a series oflocal alignments (using lalign). The program will thencombine all these alignments into a multiple alignment.
COBALT
http://www.ebi.ac.uk/Tools/clustalw2/http://www.ebi.ac.uk/Tools/clustalw2/8/8/2019 tics Day 1
93/96
COBALT
COBALT (Constraint based MultipleAlignment Tool) is a multiple sequencealignment tool that finds a collection of
pairwise constraints derived fromconserved domain database, protein motifdatabase, and sequence similarity, usingRPS-BLAST, BLASTP, and PHI-BLAST.
Pairwise constraints are then incorporatedinto a progressive multiple alignment.
DIALIGN
8/8/2019 tics Day 1
94/96
DIALIGN
DIALIGN is a software program for multiple sequencealignment developed by BurkhardMorgensternet al.
While standard alignment methods rely on comparingsingle residues and imposing gap penalties, DIALIGNconstructs pairwise and multiple alignments bycomparing entire segments of the sequences.
No gap penalty is used.
This approach can be used for both global and localalignment, but it is particularly successful in situations
where sequences share only localhomologies. http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.html
DIALIGN
http://www.gobics.de/burkhard/http://www.gobics.de/burkhard/http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://www.gobics.de/burkhard/http://www.gobics.de/burkhard/8/8/2019 tics Day 1
95/96
DIALIGN
Names of the aligned sequences are shown onthe left hand side of the alignment.
Numbers on the left hand side of the alignmentdenote the position of the first residue in a line
within the respective sequence. Capital letters denote aligned residues. Lower-case letters denote residues not
considered to be aligned by DIALIGN. Thus, ifa lower-case letter is standing in the samecolumn with other letters, this is pure chance;these residues are not considered to behomologous.
DIALIGN
8/8/2019 tics Day 1
96/96
DIALIGN
The number of `*' characters below thealignment reflects the degree of local similarity
among sequences.
The number of `*' characters is normalized suchthat regions of maximum similarity have N`*'
characters per column. Ncan be specified by
the user. By default, N= 5. Note that the number
of `*' characters depicts the relative degree ofsimilarity within an alignment, since in every
alignment, the region of maximum similarity gets