of 31
8/8/2019 Biological Databases Genbank
1/31
BIOLOGICAL DATABASES
8/8/2019 Biological Databases Genbank
2/31
8/8/2019 Biological Databases Genbank
3/31
Sequence Databses
8/8/2019 Biological Databases Genbank
4/31
Other Databses
8/8/2019 Biological Databases Genbank
5/31
The Nucleotide Giants
GenBank
DDBJDNA Databank ofJapan
EMBLEuropean MolecularBiology Laboratory
8/8/2019 Biological Databases Genbank
6/31
8/8/2019 Biological Databases Genbank
7/31
GenBank
The GenBank sequence database is an annotated
collection of all publicly available nucleotide sequences
and theirprotein translations. This database is produced
at National Center for Biotechnology Information (NCBI)
as part of an international collaboration with the
European Molecular Biology Laboratory (EMBL), DataLibrary from the European Bioinformatics Institute (EBI)
and the DNA Data Bank of Japan (DDBJ).
8/8/2019 Biological Databases Genbank
8/31
History
Initially, GenBank was built and maintained at LosAlamos National Laboratory (LANL). In the early 1990s,this responsibility was awarded to NCBI throughcongressional mandate. NCBI undertook the task ofscanning the literature for sequences and manuallytyping the sequences into the database. Staff thenadded annotation to these records, based uponinformation in the published article.
This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences are firstdeposited into publicly available databases
(DDBJ/EMBL/GenBank) so that the Accession numbercan be cited and the sequence can be retrieved whenthe article is published.
NCBI began accepting direct submissions to GenBank in1993 and received data from LANL until 1996.
8/8/2019 Biological Databases Genbank
9/31
International Collaboration
GenBank
DDBJEMBL
8/8/2019 Biological Databases Genbank
10/31
International Collaboration
In February, 1986 , the GenBank database became part of the
International Nucleotide Sequence Database Collaboration with the
EMBL database (European Bioinformatics Institute
[http://www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome
Sequence Database (GSDB; LANL, Los Alamos, NM).
Subsequently, the GSDB was removed and DDBJ
[http://www.ddbj.nig.ac.jp/] (Mishima, Japan) joined the group in
1987. Each database has its own set of submission and retrieval
tools, but the three databases exchange data daily so that all three
databases should contain the same set of sequences.
An entry can only be updated by the database that initially
prepared it to avoid conflicting data at the three sites.
8/8/2019 Biological Databases Genbank
11/31
International Collaboration
The Collaboration created a Feature Table Definition
[http://www.ncbi.nlm.nih.gov/collab/FT/index.html]
that outlines legal features and syntax for the DDBJ,
EMBL, and GenBank feature tables. The purpose of thisdocument is to standardize annotation across the
databases. The presentation and format of the data are
different in the three databases, however, the underlying
biological information is the same.
The International Nucleotide Sequence Database Collaboration alsoexchanges new and updated records daily. Therefore, all sequencespresent in GenBank are also present in DDBJ and EMBL
8/8/2019 Biological Databases Genbank
12/31
How to access them ?
Main SitesMain Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.ebi.ac.uk/DDBJ : http://www.ddbj.nig.ac.jp
8/8/2019 Biological Databases Genbank
13/31
8/8/2019 Biological Databases Genbank
14/31
8/8/2019 Biological Databases Genbank
15/31
THE GENBANK FLATFILE:
A DISSECTION
In FASTA format
The GenBank flatfile (GBFF) is the elementary
unit of information in the GenBank database. It is
one of the most commonly used formats in the
representation of biological sequences.
8/8/2019 Biological Databases Genbank
16/31
EMBL and DDBJ
The European counterpart to GenBank is the European Molecular Biology
Laboratory Nucleotide Sequence Database (EMBL) located at the European
Bioinformatics institute (EBI).
Another primary nucleotide sequence database, the DNA Database of Japan
(DDBJ) [ddbj], is operated by the Center for Information Biology (CIB) [cib] in
Japan and is the primary nucleotide sequence database for Asia. The three database operators NCBI, EBI, and CIB comprise the International
Nucleotide Sequence Database Collaboration and synchronize their databases
every 24 h. A query of all three individual databases is therefore not necessary,
nor is it required to enter a new nucleotide sequence into all three databases.
While the database format of DDBJ is identical to that of NCBI, that of EMBL
differs somewhat.
8/8/2019 Biological Databases Genbank
17/31
8/8/2019 Biological Databases Genbank
18/31
The Sequence Retrieval System
SRS was developed at EBI to manage primary
and secondary biological databases (Etzold etal. 1996). SRS can also facilitate complex
queries. Operation of SRS is the same at
either DDBJ or EBI and the following section
describes the system at EBI.
8/8/2019 Biological Databases Genbank
19/31
Protein Database
SWISSPROT One of the most important collections of annotated protein sequences is
the Swissprot database [swissprot] of the Swiss Institute of
Bioinformatics (SIB), which also operates the Expert Protein Analysis
System (Expasy) server [expasy]. The Swissprot database is high quality database as it is manually
curated
Furthermore, Swissprot is part of the UniProt databases (see Sect. 3.2.2
Uniprot) collectively known as the UniProt Knowledgebase
(UniProtKB).
Because SIB specialists can not keep pace with the growing number ofnew entries, a supplement to Swissprot has been developed, the
TrEMBL database. TrEMBL stands forTranslated EMBL and contains all
nucleic acid to protein translations of the EMBL database that have not
yet been included in Swissprot. All entries are annotated automatically,
and so their quality is less than those curated.
Both databases can be accessed via the Swissprot main page.
8/8/2019 Biological Databases Genbank
20/31
8/8/2019 Biological Databases Genbank
21/31
NCBI Protein Database
Another well-known protein sequence database is maintained at
the NCBI.
This database, however, is not a single database but a
compilation of entries found in other protein sequence databases.
For example, the NCBI database contains entries from Swissprot,the PIR database [pir], the PDB database [pdb], protein
translations of the GenBank database, as well as from a number
of other sequence databases.
Its format corresponds to that ofGenBank and queries are carried
out analogously to those ofGenBank via the Entrez system ofNCBI.
8/8/2019 Biological Databases Genbank
22/31
Universal Protein Resource (UniProt) The UnitProt Consortium
2007), which unites the information in the three protein databases,
Swissprot, TrEMBL, and PIR.
UniProt consists of three parts, the UniProt Knowledgebase
(UniProtKB), the UniProt Reference Clusters Database (UniRef),and the UniProt Archive (UniPArc), a collection of protein
sequences and their history.
UniProtKB is a comprehensive directory of protein annotations
and is based on the Swissprot and TrEMBL databases.
UniRef is a nonredundant sequence database that allows for fastsimilarity searches. The database exists in three versions:
UniRef100, UniRef90, and UniRef50.
8/8/2019 Biological Databases Genbank
23/31
Secondary Databases
8/8/2019 Biological Databases Genbank
24/31
PROSITE
An important secondary biological database is Prosite (Falquet et
al. 2002) resident at the SIB
Classifi cation of proteins in Prosite is determined using single
conserved motifs i.e., short sequence regions (1020 amino
acids) that are conserved in related proteins and usually have akey role in the proteins function.
A motif is derived from multiple alignments (see Chap. 4) and
saved in the database as a regular expression .
[GSTNE][GSTNE]--[GSTQCR][GSTQCR]--[FYW][FYW]--{ANW}{ANW}--x(2)x(2)--P.P.
Besides searching for keywords, one can examine a sequence forthe presence of Prosite motifs. Furthermore, using the algorithm
ScanProsite, Prosite offers the possibility to search Swissprot,
TrEMBL, and PDB for protein sequences that contain a user-defi
ned pattern.
8/8/2019 Biological Databases Genbank
25/31
PRINTS
The Prints database [prints] (Attwood et al. 2003) uses fi
ngerprints to classify sequences.
Fingerprints consist of several sequence motifs, represented in
the Prints database by short local ungapped alignments
The Prints database takes advantage of the fact that proteinsusually contain functional regions that result in several sequence
motifs per protein.
Besides information on how to derive a fi ngerprint and judge its
quality, Prints database also offers cross-references to entries in
related databases, thus permitting access to more informationregarding the protein family.
8/8/2019 Biological Databases Genbank
26/31
Pfam
The Pfam database [pfam] (Bateman et al. 2002) classifi es
protein families according to profiles.
The Pfam database [pfam] (Bateman et al. 2002) classifi es
protein families according to profi les. A profi le is a pattern that
evaluates the probability of the appearance of a given amino acid,an insertion or a deletion at every position in a protein sequence.
Pfam is based on sequence alignments.
Further sequences are then automatically added to the individual
alignments of the Swissprot database.
The resulting alignments should represent functionally interestingstructures and contain evolutionarily related sequences.
Because of the partly automatic construction of the alignments,
however, it is also possible that sequence alignments arise that
have no evolutionary relationship to one other. Therefore, results
of a search against the Pfam database should be carefullyreviewed.
8/8/2019 Biological Databases Genbank
27/31
InterPro
The Integrated Resource of Protein Families,
Domains, and Sites (Interpro) [interpro] (Mulder et al.
2007) integrates important secondary databases into a
comprehensive signature database. Interpro merges the databases Swissprot, TrEMBL,
Prosite, Pfam, Prints, ProDom, Smart, and TIGRFAMs
[tigr] and thereby allows a simple and simultaneous
query of these databases.
The result page combines the output of the individual
queries. This makes for a fast comparison of the
results while taking into account the strengths and
weaknesses of the individual databases.
8/8/2019 Biological Databases Genbank
28/31
Other Databases
GenotypePhenotype Databases For diseases to emerge and progress, several genes or their
products are frequently required. The identifi cation of genes
relevant to disease is, therefore, of vital importance in a
target-based approach for rational drug development.
A number of genotype-phenotype databases have been
established that record relationships between genes and the
biological properties of organisms.
OMIM Online Mendelian Inheritance In Man
dbGap
OMIA Online Mendelian Inheritance In Animals (except
Mice and Human)
Mouse Genome Database
FlyBase & WormBase
8/8/2019 Biological Databases Genbank
29/31
Molecular Structure Databases
PDB
SCOP
CATH
Protein Data Bank
Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H).
8/8/2019 Biological Databases Genbank
30/31
PDB
The Protein Data Bank (PDB) is a database of experimentally determined
crystal structures of biological macromolecules.
The PDB was founded at the Brookhaven National Laboratory in 1971,
reflected in the frequent use of the name Brookhaven Protein Data Bank.
About 46,000 macromolecule structures are stored in the PDB database(as of September 2007).
These are predominantly proteins, but also include DNA and RNA
structures and proteinnucleic acid complexes.
As of 2002, only those crystal structures that have been solved
experimentally are stored in the PDB database, whereas data of
theoretical protein models are kept in their own section [pdb-models]. The PDB database offers a number of query options. A textbased
search for a PDB-ID or a keyword can be initiated on the main page.
8/8/2019 Biological Databases Genbank
31/31
SCOP
Proteins that perform a similar biological unction and are evolutionary
related must have a similar structural organization, at least in the region
of their active centers. It should, therefore, be possible to predict the
function of an unknown protein by comparison of its structural
organization with that of known proteins. Two databases, SCOP and
CATH, provide such predictions.
SCOP (Structural Classifi cation Of Proteins) [scop] (Murzin et al. 1995)
classifi es proteins of a known structure in a hierarchical manner. The
three main classifi cations are families, super families, and folds. Families
describe proteins with a clear evolutionary relationship to each other and
are limited by a sequence identity that must be at least 30% over the total
length of the proteins.