Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 225 times |
Download: | 0 times |
Content
• DNA Sequence databases• Protein databases• Gene prediction• Accession numbers
• NCBI website• Ensembl website
Nucleotide databases
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformaticsInstitute
www.ebi.ac.uk/embl/
Housed at NCBI
NationalCenter forBiotechnologyInformation
www.ncbi.nlm.nih.gov/Genbank/
Housed in Japan
www.ddbj.nig.ac.jp/Welcome-e.html
The underlying raw DNA sequences are identical
>100,000 species are represented in GenBank
all species 196,538
viruses 5,214
bacteria 14,258
archaea 500
eukaryota 171,843
NCBI nucleotide databases
• GenBank• Individual submissions• Bulk submissions (Genome centers)
• High throughput sequencing (DNA)• Expressed Sequence Tags (mRNA)
• RefSeq• Curated subset of GenBank• “Reference” sequence• Single sequence per locus / molecule
Protein databases
• NCBI• RefSeq and Protein
• EBI• Swiss-Prot, PIR and TrEMBL → UniProt
• Translated from nucleotide sequence• Curated• Combined
UniProt versus GenBank and RefSeq
UniProt
Produced by SIB, EBI
& Georgetown U.
Protein data only
Curated in SwissProt, not in TrEMBL
GenBank/RefSeq
Produced by INSDC and NCBI
Protein and nucleotide data
Curated in RefSeq, not in GenBank
Accession numbers
Label to unambiguously identify a sequence
Examples (all for retinol-binding protein, RBP4):
protein
DNA
RNA
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)RBP4 HUGO genenames
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 UniProt protein1KT7 Protein Data Bank structure record
From Sequence to Genes
• Gene prediction• Extrinsic
• Search for genes based on observed mRNA / Protein sequences
• UniGene
• Ab initio• Predict genes based on genomic sequence alone• Promoter sequence• Poly(A) tail binding sites, CpG islands, splicing sites
UniGene
• Predict genes based on ESTs• EST:
• DNA sequence corresponding to mRNA from expressed gene
• ~500 base pairs long• Sequenced from a cDNA library
• Cluster ESTs from many cDNA libraries to predict distinct genes
EST clusters
This is a gene with1 EST associated;the cluster size is 1
This is a gene with10 ESTs associated;the cluster size is 10
1 2 3-4 5-8 9-16 17-32 129-256
257-512
33-64 65-128 513-1024
1025-2048
2049-4096
4097-8192
8193-16384
16385-32768
32769-65536
40986
18424 17855
13411
8288
5332 4607 4075 4052 3958
1902710 210 57 17 6 1
UniGene clusters
Cluster size
Nu
mb
er o
f cl
ust
ers
Likely to be real genes
Gene databases
• Ensembl (EBI)• Automatic annotation: mRNA and protein
sequence• Curated annotation: Vega project
• Entrez Gene (NCBI)• Links RefSeq sequences to external annotations
Web sites for biological databases
• NCBI www.ncbi.nlm.nih.gov
• EBI www.ebi.ac.uk
• ENSEMBL www.ensembl.org (= at EBI)