Biological Databases Genbank

8/8/2019 Biological Databases Genbank

1/31

BIOLOGICAL DATABASES


2/31


3/31

Sequence Databses


4/31

Other Databses


5/31

The Nucleotide Giants

GenBank

DDBJDNA Databank ofJapan

EMBLEuropean MolecularBiology Laboratory


6/31


7/31

GenBank

The GenBank sequence database is an annotated

collection of all publicly available nucleotide sequences

and theirprotein translations. This database is produced

at National Center for Biotechnology Information (NCBI)

as part of an international collaboration with the

European Molecular Biology Laboratory (EMBL), DataLibrary from the European Bioinformatics Institute (EBI)

and the DNA Data Bank of Japan (DDBJ).


8/31

History

Initially, GenBank was built and maintained at LosAlamos National Laboratory (LANL). In the early 1990s,this responsibility was awarded to NCBI throughcongressional mandate. NCBI undertook the task ofscanning the literature for sequences and manuallytyping the sequences into the database. Staff thenadded annotation to these records, based uponinformation in the published article.

This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences are firstdeposited into publicly available databases

(DDBJ/EMBL/GenBank) so that the Accession numbercan be cited and the sequence can be retrieved whenthe article is published.

NCBI began accepting direct submissions to GenBank in1993 and received data from LANL until 1996.


9/31

International Collaboration

GenBank

DDBJEMBL


10/31


In February, 1986 , the GenBank database became part of the

International Nucleotide Sequence Database Collaboration with the

EMBL database (European Bioinformatics Institute

[http://www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome

Sequence Database (GSDB; LANL, Los Alamos, NM).

Subsequently, the GSDB was removed and DDBJ

[http://www.ddbj.nig.ac.jp/] (Mishima, Japan) joined the group in

1987. Each database has its own set of submission and retrieval

tools, but the three databases exchange data daily so that all three

databases should contain the same set of sequences.

An entry can only be updated by the database that initially

prepared it to avoid conflicting data at the three sites.


11/31


The Collaboration created a Feature Table Definition

[http://www.ncbi.nlm.nih.gov/collab/FT/index.html]

that outlines legal features and syntax for the DDBJ,

EMBL, and GenBank feature tables. The purpose of thisdocument is to standardize annotation across the

databases. The presentation and format of the data are

different in the three databases, however, the underlying

biological information is the same.

The International Nucleotide Sequence Database Collaboration alsoexchanges new and updated records daily. Therefore, all sequencespresent in GenBank are also present in DDBJ and EMBL


12/31

How to access them ?

Main SitesMain Sites

NCBI : http://www.ncbi.nlm.nih.gov/

EMBL : http://www.ebi.ac.uk/DDBJ : http://www.ddbj.nig.ac.jp


13/31


14/31


15/31

THE GENBANK FLATFILE:

A DISSECTION

In FASTA format

The GenBank flatfile (GBFF) is the elementary

unit of information in the GenBank database. It is

one of the most commonly used formats in the

representation of biological sequences.


16/31

EMBL and DDBJ

The European counterpart to GenBank is the European Molecular Biology

Laboratory Nucleotide Sequence Database (EMBL) located at the European

Bioinformatics institute (EBI).

Another primary nucleotide sequence database, the DNA Database of Japan

(DDBJ) [ddbj], is operated by the Center for Information Biology (CIB) [cib] in

Japan and is the primary nucleotide sequence database for Asia. The three database operators NCBI, EBI, and CIB comprise the International

Nucleotide Sequence Database Collaboration and synchronize their databases

every 24 h. A query of all three individual databases is therefore not necessary,

nor is it required to enter a new nucleotide sequence into all three databases.

While the database format of DDBJ is identical to that of NCBI, that of EMBL

differs somewhat.


17/31


18/31

The Sequence Retrieval System

SRS was developed at EBI to manage primary

and secondary biological databases (Etzold etal. 1996). SRS can also facilitate complex

queries. Operation of SRS is the same at

either DDBJ or EBI and the following section

describes the system at EBI.


19/31

Protein Database

SWISSPROT One of the most important collections of annotated protein sequences is

the Swissprot database [swissprot] of the Swiss Institute of

Bioinformatics (SIB), which also operates the Expert Protein Analysis

System (Expasy) server [expasy]. The Swissprot database is high quality database as it is manually

curated

Furthermore, Swissprot is part of the UniProt databases (see Sect. 3.2.2

Uniprot) collectively known as the UniProt Knowledgebase

(UniProtKB).

Because SIB specialists can not keep pace with the growing number ofnew entries, a supplement to Swissprot has been developed, the

TrEMBL database. TrEMBL stands forTranslated EMBL and contains all

nucleic acid to protein translations of the EMBL database that have not

yet been included in Swissprot. All entries are annotated automatically,

and so their quality is less than those curated.

Both databases can be accessed via the Swissprot main page.


20/31


21/31

NCBI Protein Database

Another well-known protein sequence database is maintained at

the NCBI.

This database, however, is not a single database but a

compilation of entries found in other protein sequence databases.

For example, the NCBI database contains entries from Swissprot,the PIR database [pir], the PDB database [pdb], protein

translations of the GenBank database, as well as from a number

of other sequence databases.

Its format corresponds to that ofGenBank and queries are carried

out analogously to those ofGenBank via the Entrez system ofNCBI.


22/31

Universal Protein Resource (UniProt) The UnitProt Consortium

2007), which unites the information in the three protein databases,

Swissprot, TrEMBL, and PIR.

UniProt consists of three parts, the UniProt Knowledgebase

(UniProtKB), the UniProt Reference Clusters Database (UniRef),and the UniProt Archive (UniPArc), a collection of protein

sequences and their history.

UniProtKB is a comprehensive directory of protein annotations

and is based on the Swissprot and TrEMBL databases.

UniRef is a nonredundant sequence database that allows for fastsimilarity searches. The database exists in three versions:

UniRef100, UniRef90, and UniRef50.


23/31

Secondary Databases


24/31

PROSITE

An important secondary biological database is Prosite (Falquet et

al. 2002) resident at the SIB

Classifi cation of proteins in Prosite is determined using single

conserved motifs i.e., short sequence regions (1020 amino

acids) that are conserved in related proteins and usually have akey role in the proteins function.

A motif is derived from multiple alignments (see Chap. 4) and

saved in the database as a regular expression .

[GSTNE][GSTNE]--[GSTQCR][GSTQCR]--[FYW][FYW]--{ANW}{ANW}--x(2)x(2)--P.P.

Besides searching for keywords, one can examine a sequence forthe presence of Prosite motifs. Furthermore, using the algorithm

ScanProsite, Prosite offers the possibility to search Swissprot,

TrEMBL, and PDB for protein sequences that contain a user-defi

ned pattern.


25/31

PRINTS

The Prints database [prints] (Attwood et al. 2003) uses fi

ngerprints to classify sequences.

Fingerprints consist of several sequence motifs, represented in

the Prints database by short local ungapped alignments

The Prints database takes advantage of the fact that proteinsusually contain functional regions that result in several sequence

motifs per protein.

Besides information on how to derive a fi ngerprint and judge its

quality, Prints database also offers cross-references to entries in

related databases, thus permitting access to more informationregarding the protein family.


26/31

Pfam

The Pfam database [pfam] (Bateman et al. 2002) classifi es

protein families according to profiles.

The Pfam database [pfam] (Bateman et al. 2002) classifi es

protein families according to profi les. A profi le is a pattern that

evaluates the probability of the appearance of a given amino acid,an insertion or a deletion at every position in a protein sequence.

Pfam is based on sequence alignments.

Further sequences are then automatically added to the individual

alignments of the Swissprot database.

The resulting alignments should represent functionally interestingstructures and contain evolutionarily related sequences.

Because of the partly automatic construction of the alignments,

however, it is also possible that sequence alignments arise that

have no evolutionary relationship to one other. Therefore, results

of a search against the Pfam database should be carefullyreviewed.


27/31

InterPro

The Integrated Resource of Protein Families,

Domains, and Sites (Interpro) [interpro] (Mulder et al.

2007) integrates important secondary databases into a

comprehensive signature database. Interpro merges the databases Swissprot, TrEMBL,

Prosite, Pfam, Prints, ProDom, Smart, and TIGRFAMs

[tigr] and thereby allows a simple and simultaneous

query of these databases.

The result page combines the output of the individual

queries. This makes for a fast comparison of the

results while taking into account the strengths and

weaknesses of the individual databases.


28/31

Other Databases

GenotypePhenotype Databases For diseases to emerge and progress, several genes or their

products are frequently required. The identifi cation of genes

relevant to disease is, therefore, of vital importance in a

target-based approach for rational drug development.

A number of genotype-phenotype databases have been

established that record relationships between genes and the

biological properties of organisms.

OMIM Online Mendelian Inheritance In Man

dbGap

OMIA Online Mendelian Inheritance In Animals (except

Mice and Human)

Mouse Genome Database

FlyBase & WormBase


29/31

Molecular Structure Databases

PDB

SCOP

CATH

Protein Data Bank

Class (C), Architecture (A), Topology (T), and Homologous Superfamily (H).


30/31

PDB

The Protein Data Bank (PDB) is a database of experimentally determined

crystal structures of biological macromolecules.

The PDB was founded at the Brookhaven National Laboratory in 1971,

reflected in the frequent use of the name Brookhaven Protein Data Bank.

About 46,000 macromolecule structures are stored in the PDB database(as of September 2007).

These are predominantly proteins, but also include DNA and RNA

structures and proteinnucleic acid complexes.

As of 2002, only those crystal structures that have been solved

experimentally are stored in the PDB database, whereas data of

theoretical protein models are kept in their own section [pdb-models]. The PDB database offers a number of query options. A textbased

search for a PDB-ID or a keyword can be initiated on the main page.


31/31

SCOP

Proteins that perform a similar biological unction and are evolutionary

related must have a similar structural organization, at least in the region

of their active centers. It should, therefore, be possible to predict the

function of an unknown protein by comparison of its structural

organization with that of known proteins. Two databases, SCOP and

CATH, provide such predictions.

SCOP (Structural Classifi cation Of Proteins) [scop] (Murzin et al. 1995)

classifi es proteins of a known structure in a hierarchical manner. The

three main classifi cations are families, super families, and folds. Families

describe proteins with a clear evolutionary relationship to each other and

are limited by a sequence identity that must be at least 30% over the total

length of the proteins.

Date post:	09-Apr-2018
Category:	Documents
Upload:	jaineem
View:	223 times
Download:	0 times

Biological Databases Genbank

Documents