Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | domenic-jeffrey-curtis |
View: | 214 times |
Download: | 0 times |
Basic Concepts of Bioinformatics
Mitcon Biopharma Chaitanya Velhal
What is BIOINFORMATICS
All aspects of gathering, storing, handling, analyzing, interpreting and spreading vast amounts of biological information in databases. gene sequences, biological activity/function, pharmacological activity, biological structure, molecular structure, protein-protein interactions, gene expression. Uses computers and statistical techniques to accomplish research objectives, for example, to discover a new pharmaceutical or herbicide.
3
Biology Chemistry
StatisticsComputer
Science
Bioinformatics
Bioinformatics encompasses the use of tools and techniques from three separate disciplines;
•Molecular biology (the source of the data to be analyzed),• Computer science (supplies the hardware for running analysis and the networks to communicate the results),• Data analysis algorithms which strictly define bioinformatics.
All of the information needed to build an organism is contained in its DNA. If we could understand it, we would know how life works.◦ Preventing and curing diseases like cancer
(which is caused by mutations in DNA) and inherited diseases.
◦ Curing infectious diseases (everything from AIDS and malaria to the common cold). If we understand how a microorganism works, we can figure out how to block it.
◦ Understanding genetic and evolutionary relationships between species
◦ Understanding genetic relationships between humans. Projects exist to understand human genetic diversity. Also, sequencing the Neanderthal genome. Ancient DNA: currently it is thought that under ideal
conditions (continuously kept frozen), there is a limit of about 1 million years for DNA survival. So, Jurassic Park will probably remain fiction.
Why it’s useful
DNA is just a long string of 4 letters (nucleotides, or bases): Adenine, Guanine, Cytosine, and Thymine.◦ Which we will just refer to as A,
C, G, and T
Each DNA molecule has 2 strands, with the bases paired in the center◦ A on one strand always pairs
with T on the other strand◦ G pairs with C.◦ the strands run in opposite
directions (like roads) Since the two DNA strands
are complementary, there is no need to write down both strands
DNA
each chromosome is a long piece of DNA◦ B. megaterium genome is a circle (like most
bacteria) of about 5 million bases.◦ Human chromosomes are 100-200 million bases
long. We have 46 chromosomes (2 sets of 23, one set from each parent).
genes are just regions on that DNA. It is not obvious where genes are if you look at a DNA sequence.◦ there is a lot of DNA that is not part of genes: in
humans only 2% at most of the DNA is part of any gene.
◦ Bacteria use more of their DNA: 80% of the B. meg chromosome is genes.
B. meg has about 1 gene per 1000 base pairs (bp) of DNA. About 5000 genes
Humans have about 25,000 genes. ◦ We are far more complicated than bacteria:
regulation of the genes is very complicated in humans
◦ We use the same gene in different ways in different tissues
Chromosomes and Genes
Most genes code for proteins: each gene contains the information necessary to make one protein.
Proteins are the most important type of macromolecule.◦ Structure: collagen in skin, keratin in hair,
crystallin in eye.◦ Enzymes: all metabolic transformations,
building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins.
◦ Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins.
◦ Also: nutrition (egg yolk), hormones, defense, movement
Genes and Proteins
Proteins are long chains of amino acids. There are 20 different amino acids coded in
DNA There are only 4 DNA bases, so you need 3 DNA
bases to code for the 20 amino acids◦ 4 x 4 x 4 = 64 possible 3 base combinations
(codons)◦ Each codon codes for one amino acid◦ Most amino acids have more than one possible
codon Genes start at a start codon and end at a stop
codon. 3 codons are stop codons: all genes end at a
stop codon. Start codons are a bit trickier, since they are
used in the middle of genes as well as at the beginning◦ in eukaryotes, ATG is always the start codon,
making Methionine (Met) the first amino acid in all proteins (but in many proteins it is immediately removed).
◦ In prokaryotes, ATG, GTG, or TTG can be used as a start codon. B. meg prefers ATG, but about 30% of the genes start with GTG or TTG.
The Genetic Code
In bioinformatics, we generallyignore the fact that RNA uses thebase uracil (U) in place of T.
Brief history of bioinformatics: other important steps
• Development of sequence retrieval methods (1970-80s)
• Development of principles of sequence alignment (1980s)
• Prediction of RNA secondary structure (1980s)
• Prediction of protein secondary structure and 3D (1980-90s)
• The FASTA and BLAST methods for DB search (1980-90s)
• Prediction of genes (1990s)
• Studies of complete genome sequences (late 1990s –2000s)
Organizing biological knowledge in databases Biological raw data are stored in public databanks (such as Genbank or EMBL for primary DNA sequences).
The data can be submitted and accessed via the world wide web.
Protein sequence databanks like trEMBL provide the most likely translation of all coding sequences in the EMBL databank. Sequence data are prominent, but also other data are stored, e. g. yeast two–hybrid screens, expression arrays, systematic gene–knock–out experiments, and metabolic pathways.
Data Schema in Warehousing :A Gene Expression Example
Gene ExpressionWarehouse
ProteinDisease
SNP
Enzyme
Pathway
Known Gene
SequenceCluster
Affy Fragment
Sequence
LocusLink
MGD
ExPASySwissProt
PDBOMIM
NCBIdbSNP
ExPASyEnzyme
KEGG
SPAD
UniGene
Genbank
NMR
Metabolite
“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome (and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
Genome
Protein
Gene = DNA
RNA Primary Sequence
Gene therapy DrugsInhibitors/activators
DNA binding drugs RNA binding drugs
Central dogma of modern drug discovery
Drug DesignThe information present in DNA is expressed via RNA molecules into proteins which are responsible for carrying out various activities.
This information flow is called the central dogma of molecular biology .
Potential drugs can bind to DNA, RNA or proteins to suppress or enhance the action at any stage in the pathway
All organisms self replicate due to the presence of genetic material DNA, the polynucleotide consisting of four bases Adenine (A), Thymine (T), Guanine (G) and Cytosine (C)
The entire DNA content of the cell is known as the genome. The segment of genome that is transcribed into RNA is called a gene.
Hereditary information is transferred in the form of genes containing the four bases. Understanding these genes is one of the modern day challenges.
History of BioinformaticsYear Subject Name MBP
(Millions of base pairs)
1995 Haemophilus Influenza 1.8
1996 Bakers Yeast 12.1
1997 E.Coli 4.7
2000 Pseudomonas aeruginosa A. ThalianaD. Melonagaster
6.3100180
2001 Human Genome 3,000
2002 House Mouse 2,500
We have sequenced and identified genes. So we know what they do
The sequences are stored in databases
So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases.
Since there are large number of databases we cannot do sequence alignment for each and every sequence
So heuristics must be used again.
18
Database Searches
Sequence info is stored in databases
So that they can be manipulated easily
The db are located at diff places They exchange info on a daily
basis so that they are up-to-date and are in sync
Primary db – sequence data
19
Databases
As there are many db which one to search? Some are good in some aspects and weak in others?
Composite db is the answer – which has several db for its base data
Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db
20
Composite DB
OWL has these as their primary db◦ SWISS PROT (top priority)◦ PIR◦ GenBank◦ NRL-3D
21
Composite DB
Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic
i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates
22
Genomics
Gene in sequence data – needle in a haystack
However as the needle is different from the haystack genes are not diff from the rest of the sequence data
Is whole array of nt we try to find and border mark a set o nt as a gene
This is one of the challenges of bioinformatics
23
Genomics - Finding Genes
Organism Genome Size (Mb) bp * 1,000,000
Gene Number
Web Site
Yeast 13.5 6,241 http://genome-www.stanford.edu/Saccharomyces
Fruit Flies 180 13,601 http://flybase.bio.indiana.edu
Homo Sapiens
3,000 45,000 http://www.ncbi.nlm.nih.gov/genome/guide
Proteome is the sum total of an organisms proteins
More difficult than genomics
◦ 4 20◦ Simple chemical makeup complex◦ Can duplicate can’t
25
Proteomics
Is one of the biggest challenges of bioinformatics and esp. biochemistry
No algorithm is there now to consistently predict the structure of proteins
26
Protein Structure Prediction
Comparative Modeling◦ Target proteins structure is compared with related
proteins◦ Proteins with similar sequences are searched for
structures
27
Structure Prediction methods
The taxonomical system reflects evolutionary relationships
Phylogenetic trees are things which reflect the evolutionary relationship thru a picture/graph
Rooted trees where there is only one ancestor Un rooted trees just showing the relationship Phylogenetic trees reconstruction algorithms are
also an area of research
28
Phylogenetics
Pharmacogenomics◦ Not all drugs work on all patients, some good drugs cause
death in some patients◦ So by doing a gene analysis before the treatment the
offensive drugs can be avoided◦ Also drugs which cause death to most can be used on a
minority to whose genes that drug is well suited.◦ Customized treatment
Gene Therapy◦ Replace or supply the defective or missing gene◦ E.g: Insulin and Factor VIII or Haemophilia
29
Medical Implications
Diagnosis of disease◦ Identification of genes which cause
the disease will help detect disease at early stage e.g. Huntington disease -
Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment
Death in 10-15 years The gene responsible for the disease
has been identified Contains excessively repeated sections
of CAG So once analyzed the couple can be
counseled
30
Diagnosis of Disease
Can go up to 15yrs and $700million One of the goals of bioinformatics is to
reduce the time and cost involved with it. The process
◦ Discovery Computational methods can improves this
◦ Testing
31
Drug Design
Target identification◦ Identifying the molecule on which the germs
relies for its survival◦ Then we develop another molecule i.e. drug
which will bind to the target◦ So the germ will not be able to interact with
the target.◦ Proteins are the most common targets
32
Discovery
For example HIV produces HIV protease which is a protein and which in turn eat other proteins
This HIV protease has an active site where it binds to other molecules
So HIV drug will go and bind with that active site◦ Easily said than done!
33
Discovery…
Lead compounds are the molecules that go and bind to the target protein’s active site
Traditionally this has been a trial and error method
Now this is being moved into the realm of computers
34
Discovery…