tics Day 1

8/8/2019 tics Day 1

1/96

WELCOME TO THEWORKSHOP ON

BIOINFORMATICS

M.Sc II REVISED SYLLABUS

8/8/2019 tics Day 1

2/96

DAY 1

30/09/09

INTRODUCTION

8/8/2019 tics Day 1

3/96

NCBI

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/

Entrez, the Life Sciences Search Engine

The Entrez page is home to the EntrezGlobal Query database search engine (the

Entrez cross-database search page).

The entire group of individual Entrez databases

is organized on this page with literaturedatabases at the top including PubMed, PubMed

Central, Journals, Books, OMIM and OMIA.
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/

8/8/2019 tics Day 1

4/96

NCBI

The NCBI Site Search is also listed.

The sequence databases include Nucleotide, Protein,Genome, Structure, and SNPs.

The remaining databases are Taxonomy, Gene,UniGene, HomoloGene, Conserved Domains, 3D

Domains, UniSTS, PopSet, GEO Profiles, GEODatasets, PubChem Bio-Assay, PubChem Compound,PubChem Substance, Cancer Chromosomes, Probe,MeSH, Journals and NLM Catalog.

Links to popular NCBI Web pages, such as PubMed,Human Genome, Map Viewer, and BLAST, are on thetoolbar.

There is also a link to the "GenBank" database, leadingto the Nucleotide database.

8/8/2019 tics Day 1

5/96

NCBI By using the Entrez Global query, a search

across all Entrez databases is performed byentering a simple search term or phrase in the"Search across databases" query box.

Select the Go button to execute the search, or

press the Enter button on your keyboard. The CLEAR button erases search terms in the

query box; use it to begin a new search. The results found in each database are

displayed on the Global Query page. Click on the result number or its adjacentdatabase name to get to the specific results.

See the link to the Global Query Help document,which is to the right of the CLEAR button.

8/8/2019 tics Day 1

6/96

8/8/2019 tics Day 1

7/96

Nucleotide Database When a search is done in the Nucbleotide database,

Entrez search results are also shown for the threecomponent Nucleotide databases on the Search statisticline.

The component Nucleotide databases together containall the sequence data from GenBank, EMBL, and DDBJ,the members of the International Collaboration ofSequence Databases.

The new component databases are included within theEntrez linking scheme and Links within and betweendatabases can be selected as usual from the variousdatasets.

Popular search strategies such as the Limits,Preview/Index, History, and MyNCBI can be used withineach individual database.

Nucleotide database also includes the ReferenceSequence (RefSeq) records. RefSeqs are an NCBI-

curated non-redundant set of sequences.

8/8/2019 tics Day 1

8/96

Protein Database

The Protein database contains sequence

data from the translated coding regions

from DNA sequences in GenBank, EMBL,

and DDBJ as well as protein sequences

submitted to Protein Information Resource

(PIR), SWISS-PROT, Protein Research

Foundation (PRF), and Protein Data Bank(PDB) (sequences from solved structures).

8/8/2019 tics Day 1

9/96

Genome Database

The Genome database provides views for

a variety of genomes, complete

chromosomes, sequence maps with

contigs, and integrated genetic and

physical maps.

8/8/2019 tics Day 1

10/96

Structure Database

The Structure database or MolecularModeling Database (MMDB) containsexperimental data from crystallographic

and NMR structure determinations. The data for MMDB are obtained from the

Protein Data Bank (PDB).

The NCBI has cross-linked structural datato bibliographic information, to thesequence databases, and to the NCBItaxonomy.

8/8/2019 tics Day 1

11/96

Conserved Domains

Conserved Domains is a database of

protein domains.

The source databases for Conserved

Domains are Pfam, Smart, and COG.

8/8/2019 tics Day 1

12/96

3D Domains

3D Domains contains protein domains

from the Entrez Structure Database.

8/8/2019 tics Day 1

13/96

UniSTS

UniSTS is a unified, non-redundant view ofsequence tagged sites (STSs).

UniSTS integrates marker and mapping datafrom a variety of public resources.

Data sources include dbSTS, RHdb, GDB,various human maps (Genethon genetic map,Marshfield genetic map, Whitehead RH map,Whitehead YAC map, Stanford RH map, NHGRI

chr 7 physical map, and WashU chrX physicalmap), and various mouse maps (Whitehead RHmap, Whitehead YAC map, and JacksonLaboratory's MGD map).

8/8/2019 tics Day 1

14/96

Gene

Gene provides a unified query

environment for genes defined by

sequence and/or in NCBI's Map Viewer.

You can query on names, symbols,

accessions, publications, GO terms,

chromosome numbers, E.C. numbers, and

many other attributes associated withgenes and the products they encode.

8/8/2019 tics Day 1

15/96

Taxonomy Database

The Taxonomy database contains the

names of all organisms that are

represented in the NCBI genetic database

by at least one nucleotide or proteinsequence.

8/8/2019 tics Day 1

16/96

PubMed Central

PubMed Central (PMC) is the U.S.

National Library of Medicine's digital

archive of life sciences journal literature.

Access to the full text of articles in PMC is

free, except where a journal requires a

subscription for access to recent articles.

8/8/2019 tics Day 1

17/96

Journals

The Journals database can be searched

using the journal title, MEDLINE

abbreviation, NLM ID, ISO abbreviation, or

ISSN.

The database includes the journals in all

Entrez databases, e.g., PubMed,

Nucleotide, Protein.

8/8/2019 tics Day 1

18/96

MeSH

MeSH (Medical Subject Headings) is the

National Library of Medicine's controlled

vocabulary used for indexing articles in

PubMed.

MeSH terminology provides a consistent

way to retrieve information that may use

different terminology for the sameconcepts.

8/8/2019 tics Day 1

19/96

Bookshelf

The Bookshelf has a collection of

Biomedical books that are linked in Entrez.

The NCBI Handbook is also available from

the Bookshelf.

8/8/2019 tics Day 1

20/96

OMIM Database

The OMIM (Online Mendelian Inheritance

in Man) database is a catalog of human

genes and genetic disorders.

8/8/2019 tics Day 1

21/96

OMIA Database

Online Mendelian Inheritance in Animals (OMIA)is a database of genes, inherited disorders and

traits in animal species (other than human and

mouse) authored by Professor Frank Nicholas of

the University of Sydney, Australia, with helpfrom many people over the years.

The database contains textual information and

references, as well as links to relevant records

from OMIM, PubMed, Gene, and soon to NCBI'sPhenotype database.

B l O t d i

8/8/2019 tics Day 1

22/96

Boolean Operators used inEntrez

AND: To AND two search terms togetherinstructs Entrez to find all documents thatcontain BOTH terms

OR: To OR two search terms together instructs

Entrez to find all documents that containEITHER term.

NOT: To NOT two search terms togetherinstructs Entrez to find all documents that

contain search term 1 BUT NOT search term 2. Boolean operators AND, OR, NOT must beentered in UPPERCASE (e.g., promoters ORresponse elements).

Searching for Unique

8/8/2019 tics Day 1

23/96

Unique identifiers can be accessionnumbers, which apply to a completesequence record, orsequenceidentification numbers, which apply to

the individual sequences within a record. The format ofaccession numbers varies,

depending upon the source database.

Each data domain in Entrez containsrecords from a number of differentsources.

Searching for UniqueIdentifiers


8/8/2019 tics Day 1

24/96

The unique identifier for a sequence record. An accession number applies to the complete record and

is usually a combination of a letter(s) and numbers, suchas a single letter followed by five digits (e.g., U12345) ortwo letters followed by six digits (e.g., AF123456).

Some accessions might be longer, depending on thetype of sequence record.

Accession numbers do not change, even if information inthe record is changed at the author's request.

Sometimes, however, an original accession numbermight become secondary to a newer accession number,if the authors make a new submission that combinesprevious sequences, or if for some reason a newsubmission supercedes an earlier record.



8/8/2019 tics Day 1

25/96

Records from the RefSeq database of

reference sequences have a

different accession number format that

begins with two letters followed by anunderscore bar and six or more digits, for

example:

NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins

NC_123456 chromosomes


http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/

8/8/2019 tics Day 1

26/96

GI numbers:a series of digits that are assigned consecutivelyby NCBI to each sequence it processes

"GenInfo Identifier" sequence identificationnumber, in this case, for the nucleotidesequence.

If a sequence changes in any way, a new GI

number will be assigned. A separate GI number is also assigned to eachprotein translation within a nucleotide sequencerecord, and a new GI is assigned if the protein

translation changes in any way



8/8/2019 tics Day 1

27/96

Nucleotide sequence:

GI: 6995995

VERSION: NM_000492.2

Protein translation:

GI: 6995996

VERSION: NP_000483.2


8/8/2019 tics Day 1

28/96

EBI

The European Bioinformatics Institute (EBI) is a

non-profit academic organisation that forms part

of the European Molecular Biology Laboratory (

EMBL). The EBI is a centre for research and services in

bioinformatics.

http://www.ebi.ac.uk

The Institute manages databases of biological

data including nucleic acid, protein sequences

and macromolecular structures.
http://www.embl.org/http://www.embl.org/

8/8/2019 tics Day 1

29/96

EBI It is the European node for globally coordinated efforts to

collect and disseminate biological data. Many of their databases are household names tobiologists they include EMBL-Bank (DNA and RNAsequences), Ensembl (genomes), ArrayExpress(microarray-based gene-expression data), UniProt

(protein sequences), InterPro (protein families, domainsand motifs) and PDBe (macromolecular structures).

Others, such as IntAct (proteinprotein interactions),Reactome (pathways) and ChEBI (small molecules), arenew resources that help researchers to understand notonly the molecular parts that go towards constructing anorganism, but how these parts combine to createsystems.

The details of each database vary, but they all uphold

the same principles of service provision.
http://www.ebi.ac.uk/embl/http://www.ensembl.org/http://www.ebi.ac.uk/arrayexpress/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/intact/http://www.reactome.org/http://www.ebi.ac.uk/chebi/http://www.ebi.ac.uk/chebi/http://www.reactome.org/http://www.ebi.ac.uk/intact/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/arrayexpress/http://www.ensembl.org/http://www.ebi.ac.uk/embl/

8/8/2019 tics Day 1

30/96

EMBL - Bank

EMBL-Bank is produced as part of the InternationalNucleotide Sequence Database Collaboration (see sidepanel and figure).

Each of the three groups DDBJ, EMBL Bank, GenBank

collects a proportion of the total sequence data reportedworldwide, and all new and updated database entries areexchanged between the groups on a daily basis.

EMBL-Bank contains over 150 million DNA and RNAsequences, ranging from as few as ten base pairs to

entire genomes. Its sequences come from three main sources: individual

research groups, genome-sequencing projects andpatent applications.

8/8/2019 tics Day 1

31/96

ENSEMBL Ensembl provides a comprehensive resource for the

scientific community which allows analysis of geneticinformation within and between species.

Hence, the resource is of use in a wide range ofresearch fields from evolutionary biology to clinicalresearch.

Ensembl annotates chordate genomes (i.e. vertebratesand closely related invertebrates such as the sea squirt).

Gene sets from model organisms such as yeast and flyare also imported for comparative analysis.

All Ensembl genes are placed according to theexperimental evidence of protein and mRNA sequencesobtained from UniProt/Swiss-Prot, UniProt/TrEMBL andRefSeq.

Sequence data is obtained from relevant genomesequencing centres and consortia.

8/8/2019 tics Day 1

32/96

EMSEMBL

With Ensembl you can: Retrieve all or part of a genome sequence. Use the sequence alignment search tools BLAST and

BLAT against any

Ensembl genome. Link to genome annotation from microarray results. View expressed sequence tags (ESTs), clones, mRNA

and proteins for any chromosomal region. Examine genes, markers and single nucleotide

polymorphisms (SNPs) in a chromosomal region. View variations such as SNPs across strains (rat,

mouse) or populations (human). View all alternative transcripts (splice variants) for a

gene.

8/8/2019 tics Day 1

33/96

UniProt

UniProt is produced by the UniProt Consortium,a collaboration between the EuropeanBioinformatics Institute (EBI), the Swiss Instituteof Bioinformatics (SIB) and the Protein

Information Resource (PIR). UniProt comprises four components:The UniProt Knowledgebase (UniProtKB)UniProt Reference Clusters (UniRef)

UniProt Archive (UniParc)UniProt Metagenomic and Environmental

Sequences (UniMES)

8/8/2019 tics Day 1

34/96

UniProt

The UniProt Knowledgebase, and in particular

UniProtKB/Swiss-Prot, is used to access

functional information on proteins.

Every UniProtKB entry contains the amino acid

sequence, protein name or description,taxonomic data and citation information but in

addition to this, annotation are added.

This includes widely accepted biological

ontologies, classifications and cross-references,as well as clear indications on the quality of

annotation in the form of evidence attribution to

experimental and computational data.

8/8/2019 tics Day 1

35/96

UniProt

The UniRef databases provide clustered

sets of sequences from UniProtKB and

selected UniParc records to provide

complete coverage of sequence space at

several resolutions.

UniRef90 and UniRef50 yield a database

size reduction of approximately 40% and

65%, respectively, providing significantlyfaster sequence searches.

8/8/2019 tics Day 1

36/96

UniProt

UniParc is the most comprehensivepublicly accessible non-redundant proteinsequence database available, providing

links to all underlying sources andversions of these sequences.

You can instantly find out whether asequence of interest is already in thepublic domain and, if not, identify itsclosest relatives.

8/8/2019 tics Day 1

37/96

UniProt

UniMES is a repository specifically for

metagenomic and environmental data.

8/8/2019 tics Day 1

38/96

DAY 130/09/09

Session II

8/8/2019 tics Day 1

39/96

Sequence Alignment

Quite simply, the comparison of two or

more DNA or protein sequences to each

other.

The purpose of alignment is to highlight

similarity between the sequences.

8/8/2019 tics Day 1

40/96

SEQUENCE ALIGNMENT

Sequence alignment is a standard

technique in bioinformatics for visualizing

the relationships between residues in a

collection of evolutionarily or structurally

related proteins An alignment provides a birds eye view of

the underlying evolutionary, structural, or

functional constraints characterizing aprotein family in a concise, visually

intuitive format.

8/8/2019 tics Day 1

41/96

Fundamental assumption of sequence

alignment:

Sequences that are similar share a

common ancestral sequence.

Due to common ancestry, similar

sequences have similar functionality.

Sequences that share a common

ancestorare said to be homologous.

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

42/96

Why do we have divergent copies of the

same sequence in genomes?

Speciation Events

- Divergence of a single species into two

or more new species

Gene Duplication

- Errors in replication

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

43/96

What kind of Alignment?

- Global vs. Local

- Pairwise vs. Multiple Sequence Alignment

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

44/96

BLAST

The comparison of nucleotide or proteinsequences from the same or different organismsis a very powerful tool in molecular biology.

By finding similarities between sequences,scientists can infer the function of newlysequenced genes, predict new members of genefamilies, and explore evolutionary relationships.

Now that whole genomes are being sequenced,sequence similarity searching can be used topredict the location and function of protein-coding and transcription-regulation regions ingenomic DNA.

8/8/2019 tics Day 1

45/96

BLAST

Basic Local Alignment Search Tool(BLAST) is the tool most frequently usedfor calculating sequence similarity.

BLAST comes in variations for use withdifferent query sequences against differentdatabases.

All BLAST applications, as well asinformation on which BLAST program touse and other help documentation, arelisted on the BLAST homepage.

BLAST
http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/

8/8/2019 tics Day 1

46/96

BLAST Nucleotide BLAST searches allow one to input

nucleotide sequences and compare these against other

nucleotides. Standard nucleotide-nucleotide BLAST - Takes

nucleotides sequences in FASTA format, GenBankAccession numbers or GI numbers and compares them

against the NCBI nucleotide databases. MEGABLAST - This program uses a "greedy algorithm"(Webb Miller et al.) for nucleotide sequence alignmentsearches and concatenates many queries to save timespent scanning the database. It is optimized for aligning

sequences that differ slightly and is up to 10 times fasterthan more common sequence similarity programs. It canbe used to swiftly compare two large sets of sequencesagainst each other.

S
http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html

8/8/2019 tics Day 1

47/96

BLAST Protein BLAST allows one to input protein sequences and

compare these against other protein sequences.

Standard protein-protein BLAST - Takes protein sequences inFASTA format, GenBank Accession numbers or GI numbers andcompares them against the NCBI protein databases.

PSI-BLAST - Position Specific Iterated BLAST uses an iterativesearch in which sequences found in one round of searching areused to build a score model for the next round of searching.

Highly conserved positions receive high scores and weaklyconserved positions receive scores near zero. The profile is usedto perform a second (etc.) BLAST search and the results of each"iteration" used to refine the profile. This iterative searchingstrategy results in increased sensitivity.

PHI-BLAST - Pattern Hit Initiated BLAST combines matching of

regular expression pattern with a Position Specific iterativeprotein search. PHI-BLAST can locate other protein sequenceswhich both contain the regular expression pattern and arehomologous to a query protein sequence.
http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html

8/8/2019 tics Day 1

48/96

BLAST

Pairwise BLAST performs a comparison between twosequences using the BLAST algorithm. Not that theprogram considers a "Sequence 1" to be the Querysequence and "Sequence 2" to be the Subject sequence.There are the following program options:

blastn - for nucleotide - nucleotide comparisons blastp - for protein - protein comparisons tblastn - compares the protein "Sequence 1" against the

nucleotide "Sequence 2" which has been translated in allsix reading frames

blastx - compares the nucleotide "Sequence 1" againstthe protein "Sequence 2"

tblastx - compares nucleotide "Sequence 1" translatedin all six reading frames against the nucleotide"Sequence 2" translated in all six reading frames.

8/8/2019 tics Day 1

49/96

Why would you translate nucleotide

sequence to protein before comparing it?

More information in proteins!

- Detect more distant homology.

BLAST

8/8/2019 tics Day 1

50/96

Most proteins are modular in nature, with functionaldomains often being repeated within the same protein aswell as across different proteins from different species.

The BLAST algorithm is tuned to find these domains or

shorter stretches of sequence similarity. The local alignment approach also means that a mRNA

can be aligned with a piece of genomic DNA, as isfrequently required in genome assembly and analysis.

If instead BLAST started out by attempting to align two

sequences over their entire lengths (known as a globalalignment), fewer similarities would be detected,especially with respect to domains and motifs.

BLAST

8/8/2019 tics Day 1

51/96

A gap is a space introduced into an alignment to

compensate for insertions and deletions in one

sequence relative to another.

To prevent the accumulation of too many gaps inan alignment, introduction of a gap causes the

deduction of a fixed amount (the gap score) from

the alignment score.

Extension of the gap to encompass additionalnucleotides or amino acid is also penalized in

the scoring of an alignment.

BLAST

BLAST

8/8/2019 tics Day 1

52/96

BLAST Once BLAST has found a similar sequence to the query

in the database, it is helpful to have some idea of

whether the alignment is good and whether it portraysa possible biological relationship, or whether thesimilarity observed is attributable to chance alone.

BLAST uses statistical theory to produce a bit score andexpect value (E-value) for each alignment pair (query to

hit). The bit score gives an indication of how good the

alignment is; the higher the score, the better thealignment.

The E-value gives an indication of the statistical

significance of a given pairwise alignment and reflectsthe size of the database and the scoring system used.The lower the E-value, the more significant the hit. Asequence alignment that has an E-value of 0.05 meansthat this similarity has a 5 in 100 (1 in 20) chance ofoccurring by chance alone.

The BLAST report header.
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.htmlhttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

8/8/2019 tics Day 1

53/96

The top line gives information about the type of program (in this case,

BLASTP), the version (2.2.1), and a version release date.

The research paper that describes BLAST is then cited, followed by the

request ID (issued by QBLAST), the query sequence definition line, and

a summary of the database searched.

The Taxonomy reports link displays this BLAST result on the basis of

information in the Taxonomy database

Graphical overview of BLAST results
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237

8/8/2019 tics Day 1

54/96

Graphical overview of BLAST results. The query sequence is represented by the numbered red

barat the top of the figure. Database hits are shown aligned to the query, belowthe

red bar. Of the aligned sequences, the most similar are shown

closest to the query. In this case, there are three high-scoring database

matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches

that align to two regions of the query, from aboutresidues 360 and residues 220500.

The cross-hatched parts of the these bars indicate thatthe two regions of similarity are on the same protein, butthat this intervening region does not match.

The remaining bars show lower-scoring alignments. Mousing over the bars displays the definition line for that

sequence to be shown in the window above the graphic.

8/8/2019 tics Day 1

55/96

8/8/2019 tics Day 1

56/96

One-line descriptions in the BLAST report.

Each line is composed of four fields:

(a) the gi number, database designation, Accession

number, and locus name for the matched sequence,separated by vertical bars (Appendix 1);

(b) a brief textual description of the sequence, thedefinition. This usually includes information on theorganism from which the sequence was derived, the typeof sequence (e.g., mRNA or DNA), and some informationabout function or phenotype. The definition line is oftentruncated in the one-line descriptions to keep the displaycompact;

(c) the alignment score in bits. Higher scoring hits arefound at the top of the list; and

(d) the E-value, which provides an estimate of statisticalsignificance.

F th fi t hit i th li t th i b i

8/8/2019 tics Day 1

57/96

For the first hit in the list, the gi number is

116365, the database designation is sp (for

SWISS-PROT), the Accession number is

P26374, the locus name is RAE2_HUMAN, thedefinition line is Rab proteins, the score is 1216,

and the E-value is 0.0.

Note that the first 17 hits have very low E-values(much less than 1) and are either RAB proteins

or GDP dissociation inhibitors.

The other database matches have much higher

E-values, 0.5 and above, which means thatthese sequences may have been matched by

chance alone.

8/8/2019 tics Day 1

58/96

A pairwise sequence alignment from a BLAST report. The alignment is preceded by the sequence identifier the full definition

8/8/2019 tics Day 1

59/96

The alignment is preceded by the sequence identifier, the full definitionline, and the length of the matched sequence, in amino acids.

Next comes the bit score (the raw score is inparentheses) and thenthe E-value.

The following line contains information on the number of identical

residues in this alignment (Identities), the number of conservativesubstitutions (Positives), and if applicable, the number of gaps in thealignment.

Finally, the actual alignment is shown, with the query on top, and thedatabase match is labeled as Sbjct, below.

The numbers at leftand rightrefer to the position in the amino acid

sequence. One or more dashes () within a sequence indicate insertions or

deletions. Amino acid residues in the query sequence that have been masked

because of low complexity are replaced by Xs (see, for example, thefourth and lastblocks).

The line between the two sequences indicates the similarities betweenthe sequences. If the query and the subject have the same amino acid at a given

location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are

indicated with +.

8/8/2019 tics Day 1

60/96

BLAST

8/8/2019 tics Day 1

61/96

BLAST

Steps to BLAST a particular sequence

Search all databases for any protein/genename

Click on the hit obtained and derive the

FASTA sequence Go to the NCBI BLAST home page

Choose the type of BLAST program to be

used Copy the FASTA format of the sequence

and paste it in the search field.

8/8/2019 tics Day 1

62/96

8/8/2019 tics Day 1

63/96

8/8/2019 tics Day 1

64/96

DAY 130/09/09

Session III

BLAST ORTHOLOGS AND

8/8/2019 tics Day 1

65/96

PARALOGS

Homology refers to any similarity betweencharacteristics oforganisms that is due to their

shared ancestry.

Homologous sequences are orthologous if they

were separated by a speciation event: when a

species diverges into two separate species, the

divergent copies of a single gene in the resulting

species are said to be orthologous. Orthologs, or orthologous genes, are genes in

different species that are similar to each other

because they originated from a common

ancestor.

BLAST ORTHOLOGS AND
http://en.wikipedia.org/wiki/Characteristichttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Characteristic

8/8/2019 tics Day 1

66/96

BLAST ORTHOLOGS ANDPARALOGS

Homologous sequences are paralogous if they wereseparated by a gene duplication event: if a gene in anorganism is duplicated to occupy two differentpositions in the same genome, then the two copiesare paralogous.

A set of sequences that are paralogous are calledparalogs of each other.

Paralogs typically have the same or similar function,but sometimes do not: due to lack of the original

selective pressure upon one copy of the duplicatedgene, this copy is free to mutate and acquire newfunctions.

BLAST ORTHOLOGS AND
http://en.wikipedia.org/wiki/Gene_duplicationhttp://en.wikipedia.org/wiki/Gene_duplication

8/8/2019 tics Day 1

67/96

PARALOGS

Orthologs and paralogs are two fundamentally differenttypes of homologous genes that evolved, respectively,by vertical descent from a single ancestral gene and byduplication.

Orthology and paralogy are key concepts of evolutionary

genomics. A clear distinction between orthologs and paralogs is

critical for the construction of a robust evolutionaryclassification of genes and reliable functional annotationof newly sequenced genomes.

Genome comparisons show that orthologousrelationships with genes from taxonomically distantspecies can be established for the majority of the genesfrom each sequenced genome.

BLAST Orthology

8/8/2019 tics Day 1

68/96

BLAST - Orthology

http://oxytricha.princeton.edu/BlastO/

Derive a sequence in the FASTA format

for a particular protein

Paste it in the search query field Keep all parameters as default

Click the search button and wait for the

results to be displayed Database variation can change results

BLAST - Paralogy
http://oxytricha.princeton.edu/BlastO/http://oxytricha.princeton.edu/BlastO/

8/8/2019 tics Day 1

69/96

BLAST Paralogy

Perform blastp with human angiogenin

Perform blastp with :-

Phosphoglycerate kinase

PhospholipasesSerine proteinase

Zinc metalloproteinase

8/8/2019 tics Day 1

70/96

DAY 130/09/09

Session IV

8/8/2019 tics Day 1

71/96

Sequence Alignment

Quite simply, the comparison of two or

more DNA or protein sequences to each

other.

The purpose of alignment is to highlightsimilarity between the sequences.

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

72/96

SEQUENCE ALIGNMENT

Sequence alignment is a standard

technique in bioinformatics for visualizingthe relationships between residues in a

collection of evolutionarily or structurally

related proteins An alignment provides a birds eye view of

the underlying evolutionary, structural, or

functional constraints characterizing aprotein family in a concise, visually

intuitive format.

8/8/2019 tics Day 1

73/96

Fundamental assumption of sequence

alignment:

Sequences that are similar share a

common ancestral sequence.

Due to common ancestry, similar

sequences have similar functionality.

Sequences that share a common ancestor

are said to be homologous.

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

74/96

Why do we have divergent copies of the

same sequence in genomes?

Speciation Events

- Divergence of a single species into two

or more new species

Gene Duplication

- Errors in replication

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

75/96

What kind of Alignment?

- Global vs. Local

- Pairwise vs. Multiple Sequence Alignment

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

76/96

Global Vs Local ALIGNMENT

8/8/2019 tics Day 1

77/96

A global alignment is one that comparesthe two sequences over their entirelengths, and is appropriate for comparing

sequences that are expected to sharesimilarity over the whole length.

The alignment maximises regions ofsimilarity and minimises gaps using the

scoring matrices and gap parametersprovided to the program.

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

78/96

Global sequence alignment algorithms

align sequences over their entire lengths.

A second comparison method, local

alignment, searches for regions of localsimilarity and need not include the entire

length of the sequences.

SEQUENCE ALIGNMENT

8/8/2019 tics Day 1

79/96

EMBOSS Pairwise Alignment Algorithms http://www.ebi.ac.uk/Tools/emboss/align/index.html

This tool is used to compare 2 sequences.

When you want an alignment that covers the whole

length of both sequences, use needle. When you are trying to find the best region of similarity

between two sequences, use water.

Wateris for aligning the best matching subsequences of

two sequences. It does not necessarily align wholesequences against each other; you should use needle ifyou wish to align closely related sequences along theirwhole lengths.

PAIRWISE SEQUENCE ALIGNMENT
http://www.ebi.ac.uk/Tools/emboss/align/index.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/index.html

8/8/2019 tics Day 1

80/96

The %id is the percentage of identical matchesbetween the two sequences over the reportedaligned region.

The %similarity is the percentage of matches

between the two sequences over the reportedaligned region where the scoring matrix value isgreater or equal to 0.0.

The Overall %id and Overall %similarity arecalculated in a similar manner for the number ofmatches over the length of the longest of the twosequences.

PAIRWISE SEQUENCE ALIGNMENT

PAIRWISE SEQUENCE

8/8/2019 tics Day 1

81/96

What will you use at NCBI for pairwise

sequence alignment????

PAIRWISE SEQUENCEALIGNMENT

PAIRWISE SEQUENCE

8/8/2019 tics Day 1

82/96

PAIRWISE SEQUENCE

ALIGNMENT

LALIGN

http://www.ch.embnet.org/software/LALIGN_f

MULTIPLE SEQUENCE
http://www.ch.embnet.org/software/LALIGN_form.htmlhttp://www.ch.embnet.org/software/LALIGN_form.html

8/8/2019 tics Day 1

83/96

A multiple sequence alignment (MSA) is asequence alignment of three or more

biological sequences, generally protein, DNA, or

RNA.

In many cases, the input set of query sequences are

assumed to have an evolutionary relationship by

which they share a lineage and are descended from

a common ancestor.

From the resulting MSA, sequence homology can be

inferred and phylogeneticanalysis can be conducted

to assess the sequences' shared evolutionary origins.

MULTIPLE SEQUENCEALIGNMENT

MULTIPLE SEQUENCE
http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Sequence_alignment

8/8/2019 tics Day 1

84/96

Multiple sequence alignment is often usedto assess sequence conservation of

protein domains, tertiary and secondary

structures, and even individual aminoacids or nucleotides.

http://pbil.univ-lyon1.fr/alignment.html

http://au.expasy.org/tools/#align


MULTIPLE SEQUENCE
http://en.wikipedia.org/wiki/Conservation_%28genetics%29http://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Secondary_structurehttp://pbil.univ-lyon1.fr/alignment.htmlhttp://au.expasy.org/tools/http://au.expasy.org/tools/http://pbil.univ-lyon1.fr/alignment.htmlhttp://en.wikipedia.org/wiki/Secondary_structurehttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Conservation_%28genetics%29

8/8/2019 tics Day 1

85/96

Multiple alignments of protein sequences areimportant tools in studying sequences.

The basic information they provide is

identification of conserved sequence regions. This is very useful in designing experiments to

test and modify the function of specific proteins,

in predicting the function and structure of

proteins, and in identifying new members ofprotein families.


MULTIPLE SEQUENCE

8/8/2019 tics Day 1

86/96

ClustalW is a fully automatic program for globalmultiple alignment of DNA and proteinsequences.

The alignment is progressive and considers the

sequence redundancy. Trees can also be calculated from multiple

alignments.

The program has some adjustable parameterswith reasonable defaults

http://www.ebi.ac.uk/Tools/clustalw2/index.html


MULTIPLE SEQUENCE
http://www.ebi.ac.uk/Tools/clustalw2/index.htmlhttp://www.ebi.ac.uk/Tools/clustalw2/index.html

8/8/2019 tics Day 1

87/96

Show ColorsA button labeled 'Show Colors' will be displayed in theAlignment section of results page. If you press thisbutton the alignment will be show in color according tothe table below.

NOTE: This option only works when you have chosenALN or GCG as the output format.

AVFPMILW - RED - Small (small+ hydrophobic(incl.aromatic -Y))

DE BLUE ACIDIC RK - MAGENTA - BASIC STYHCNGQ GREEN - Hydroxyl + Amine + Basic - Q Others - Gray


MULTIPLE SEQUENCE

8/8/2019 tics Day 1

88/96

CONSENSUS SYMBOLS:

An alignment will display by default the following symbolsdenoting the degree of conservation observed in eachcolumn:

* " means that the residues or nucleotides in thatcolumn are identical in all sequences in the alignment.

: " means that conserved substitutions have beenobserved, according to the COLOUR table above.

. " means that semi-conserved substitutions areobserved.


MULTIPLE SEQUENCE

8/8/2019 tics Day 1

89/96

PHYLOGENETIC TREE

Phylogram is a branching diagram (tree) assumed to be an estimateof a phylogeny, branch lengths are proportional to the amount ofinferred evolutionary change.

A Cladogram is a branching diagram (tree) assumed to be an

estimate of a phylogeny where the branches are of equal length,thus cladograms show common ancestry, but do not indicate theamount of evolutionary "time" separating taxa.

Tree distances can be shown, just click on the diagram to get amenu of options. The ".dnd" file is a file that describes thephylogenetic tree.

These are now in controlled with new buttons in the output file aswell as a pop up menu, that is available by right-clicking on theapplet.

The buttons on the page include "Show as Phylogram Tree", "Showas Cladogram Tree" and "Show Distances".

U S QU CALIGNMENT

MULTIPLE SEQUENCE

8/8/2019 tics Day 1

90/96

QALIGNMENT

KALIGN

T-COFFEE

COBALT

DIALIGN

KALIGN

8/8/2019 tics Day 1

91/96

KALIGN

Kalign is a fast alignment method forprotein and nucleotide sequences.

It uses a fast approximate string matching

algorithm to estimate sequence distancesquickly and accurately.

As a result Kalign is very fast compared to

other programs and can align 1500sequences in under 10 seconds.

T-COFFEE

8/8/2019 tics Day 1

92/96

T COFFEE T-Coffee is a multiple sequence alignment program.

Multiple sequence alignment programs are meant toalign a set of sequences previously gathered using otherprograms such as blast,

The main characteristic of T-Coffee is that it will allowyou to combine results obtained with several alignmentmethods.

For instance if you have an alignment coming fromClustalW2, an other alignment coming from Dialign, anda structural alignment of some of your sequences, T-Coffee will combine all that information and produce anew multiple sequence having the best agreement whith

all these methods. By default, T-Coffee will compare all you sequences two

by two, producing a global alignment and a series oflocal alignments (using lalign). The program will thencombine all these alignments into a multiple alignment.

COBALT
http://www.ebi.ac.uk/Tools/clustalw2/http://www.ebi.ac.uk/Tools/clustalw2/

8/8/2019 tics Day 1

93/96

COBALT

COBALT (Constraint based MultipleAlignment Tool) is a multiple sequencealignment tool that finds a collection of

pairwise constraints derived fromconserved domain database, protein motifdatabase, and sequence similarity, usingRPS-BLAST, BLASTP, and PHI-BLAST.

Pairwise constraints are then incorporatedinto a progressive multiple alignment.

DIALIGN

8/8/2019 tics Day 1

94/96

DIALIGN

DIALIGN is a software program for multiple sequencealignment developed by BurkhardMorgensternet al.

While standard alignment methods rely on comparingsingle residues and imposing gap penalties, DIALIGNconstructs pairwise and multiple alignments bycomparing entire segments of the sequences.

No gap penalty is used.

This approach can be used for both global and localalignment, but it is particularly successful in situations

where sequences share only localhomologies. http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.html

DIALIGN
http://www.gobics.de/burkhard/http://www.gobics.de/burkhard/http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://www.gobics.de/burkhard/http://www.gobics.de/burkhard/

8/8/2019 tics Day 1

95/96

DIALIGN

Names of the aligned sequences are shown onthe left hand side of the alignment.

Numbers on the left hand side of the alignmentdenote the position of the first residue in a line

within the respective sequence. Capital letters denote aligned residues. Lower-case letters denote residues not

considered to be aligned by DIALIGN. Thus, ifa lower-case letter is standing in the samecolumn with other letters, this is pure chance;these residues are not considered to behomologous.

DIALIGN

8/8/2019 tics Day 1

96/96

DIALIGN

The number of `*' characters below thealignment reflects the degree of local similarity

among sequences.

The number of `*' characters is normalized suchthat regions of maximum similarity have N`*'

characters per column. Ncan be specified by

the user. By default, N= 5. Note that the number

of `*' characters depicts the relative degree ofsimilarity within an alignment, since in every

alignment, the region of maximum similarity gets

Date post:	29-May-2018
Category:	Documents
Upload:	richa-singh
View:	222 times
Download:	0 times

tics Day 1

Documents