Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | kelly-george |
View: | 215 times |
Download: | 1 times |
An information explosion!
Bioinformatics
Computational tools are developed to collect, organize and analyze a wide variety of biological data
Advances in DNA sequencing technologies have accelerated the pace of discovery. Much of the process is now automated.
What is a database?
Which databases are important for molecular cell biology research?
How is information processed in databases?
Literature
Nucleotide
Protein
Organism
Structure
Function
Biological databases use different organizing principles
Hyperlinks connect records in different databases
Databases are organized collections of information
Information is stored in records
Accession #
Field 1 ........................
Field 2 .......................
Data........................................................................................
Databases assign each record a unique accession number using their own numbering system
Fields are used to cross-reference the data. Records can be searched by fields.
Data is entered in the record using a defined format
Bioinformaticians work with computer scientists to set up the database structure
Curators review and link records within and between databases
The information in databases ultimately derives from experimental data
Researchers do experiments
Researchers analyze data and write
papers
Data is published in
journals
to PubMe
d
nucleotide sequencesmuta
nts
, phen
otyp
e in
fo structural
coordinates
Curators will process the submissions and link entries in different databases
What is a database?
Which databases are important for molecular cell biology research?
How is information processed in databases?
Largest collection is housed at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine
NLM-NCBI complex in Bethesda MD
Large staff of curators process the information and compile information into derivative databases
Biologists use hundreds of different databases from around the world, some with similar foci
NCBI maintains both primary and derivative databasesWe’ll look at three of them
PubMed is the premier literature database in the world
SGD is a derivative database serving the yeast research community
Grew out of decades of research
Genome project provided a systematic organization for genes
What is a database?
Which databases are important for molecular cell biology research?
How is information processed in databases?
Questions for today:
Most records in the Protein database have been derived by automated translation of nucleotide sequences
GenBankNucleotide sequences
Annotated nucleic acid sequences are submitted to GenBank from many sources, including genome projects, individual investigators, and other databases – there is considerable REDUNDANCY in the information
ProteinAmino acid sequences
Automated translations of nucleotide sequences
Experimentally determined amino acid sequences and information from other protein databases
RefSeqNon-redundant nucleotide and
protein sequences
Sequences are compiled to generate non-redundant reference sequences
Curators are responsible for data flow between the NCBI databases
On a larger scale: Genome projects have produced the reference sequences in
nucleotide databases
(robots and computers do much of the work)
1. Pieces of chromosomal DNA are sequenced, each ~1000 bp long
S. cerevisiae genome is ~12 Mbp – how many reads would be necessary to cover each base pair in the genome once?
2. Overlapping sequence reads are aligned until sequences of entire chromosomes were complete
Computer algorithms identify areas of sequence overlap
Process is repeated to align long stretches of sequence
Complete chromosome sequences are submitted to GenBank
GenBank NC_####### (non-redundant chromosome) sequences
3. Chromosomal sequences are analyzed for the presence of potential transcripts (open reading frames; ORFs)
ORFs are characterized by an under-representation of stop codons
ORF-finding computer algorithms look for sequences that• begin with a methionine• methionine is separated from a stop codon in the same reading
frame by a large number of amino acids (often 100, equiv. to 300bp)
GenBank NM_####### records are predicted ORFs
4. Protein sequences are computationally predicted from ORF sequences
GenBank NP_###### records
Genes were given systematic (locus) names by their positions on chromosomes
Systematic name for MET1: YKR069W
Y (A-P) (L or R) (ORF number) (W or C)
yeast left or right arm of chromosome
sense strand is Watson or Crick strand (coding sequence is read 5’ to 3’)
ORF number, counting away from the centromere (position = 0)
chromosome1 = A2 = Betc.
Left arm Right arm
centromere
W W
C C