Mitcon Biopharma Chaitanya Velhal. What is BIOINFORMATICS All aspects of gathering, storing,...

Basic Concepts of Bioinformatics

Mitcon Biopharma Chaitanya Velhal

What is BIOINFORMATICS

All aspects of gathering, storing, handling, analyzing, interpreting and spreading vast amounts of biological information in databases. gene sequences, biological activity/function, pharmacological activity, biological structure, molecular structure, protein-protein interactions, gene expression. Uses computers and statistical techniques to accomplish research objectives, for example, to discover a new pharmaceutical or herbicide.

3

Biology Chemistry

StatisticsComputer

Science

Bioinformatics

Bioinformatics encompasses the use of tools and techniques from three separate disciplines;

•Molecular biology (the source of the data to be analyzed),• Computer science (supplies the hardware for running analysis and the networks to communicate the results),• Data analysis algorithms which strictly define bioinformatics.

All of the information needed to build an organism is contained in its DNA. If we could understand it, we would know how life works.◦ Preventing and curing diseases like cancer

(which is caused by mutations in DNA) and inherited diseases.

◦ Curing infectious diseases (everything from AIDS and malaria to the common cold). If we understand how a microorganism works, we can figure out how to block it.

◦ Understanding genetic and evolutionary relationships between species

◦ Understanding genetic relationships between humans. Projects exist to understand human genetic diversity. Also, sequencing the Neanderthal genome. Ancient DNA: currently it is thought that under ideal

conditions (continuously kept frozen), there is a limit of about 1 million years for DNA survival. So, Jurassic Park will probably remain fiction.

Why it’s useful

DNA is just a long string of 4 letters (nucleotides, or bases): Adenine, Guanine, Cytosine, and Thymine.◦ Which we will just refer to as A,

C, G, and T

Each DNA molecule has 2 strands, with the bases paired in the center◦ A on one strand always pairs

with T on the other strand◦ G pairs with C.◦ the strands run in opposite

directions (like roads) Since the two DNA strands

are complementary, there is no need to write down both strands

DNA

each chromosome is a long piece of DNA◦ B. megaterium genome is a circle (like most

bacteria) of about 5 million bases.◦ Human chromosomes are 100-200 million bases

long. We have 46 chromosomes (2 sets of 23, one set from each parent).

genes are just regions on that DNA. It is not obvious where genes are if you look at a DNA sequence.◦ there is a lot of DNA that is not part of genes: in

humans only 2% at most of the DNA is part of any gene.

◦ Bacteria use more of their DNA: 80% of the B. meg chromosome is genes.

B. meg has about 1 gene per 1000 base pairs (bp) of DNA. About 5000 genes

Humans have about 25,000 genes. ◦ We are far more complicated than bacteria:

regulation of the genes is very complicated in humans

◦ We use the same gene in different ways in different tissues

Chromosomes and Genes

Most genes code for proteins: each gene contains the information necessary to make one protein.

Proteins are the most important type of macromolecule.◦ Structure: collagen in skin, keratin in hair,

crystallin in eye.◦ Enzymes: all metabolic transformations,

building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins.

◦ Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins.

◦ Also: nutrition (egg yolk), hormones, defense, movement

Genes and Proteins

Proteins are long chains of amino acids. There are 20 different amino acids coded in

DNA There are only 4 DNA bases, so you need 3 DNA

bases to code for the 20 amino acids◦ 4 x 4 x 4 = 64 possible 3 base combinations

(codons)◦ Each codon codes for one amino acid◦ Most amino acids have more than one possible

codon Genes start at a start codon and end at a stop

codon. 3 codons are stop codons: all genes end at a

stop codon. Start codons are a bit trickier, since they are

used in the middle of genes as well as at the beginning◦ in eukaryotes, ATG is always the start codon,

making Methionine (Met) the first amino acid in all proteins (but in many proteins it is immediately removed).

◦ In prokaryotes, ATG, GTG, or TTG can be used as a start codon. B. meg prefers ATG, but about 30% of the genes start with GTG or TTG.

The Genetic Code

In bioinformatics, we generallyignore the fact that RNA uses thebase uracil (U) in place of T.

Brief history of bioinformatics: other important steps

• Development of sequence retrieval methods (1970-80s)

• Development of principles of sequence alignment (1980s)

• Prediction of RNA secondary structure (1980s)

• Prediction of protein secondary structure and 3D (1980-90s)

• The FASTA and BLAST methods for DB search (1980-90s)

• Prediction of genes (1990s)

• Studies of complete genome sequences (late 1990s –2000s)

Organizing biological knowledge in databases Biological raw data are stored in public databanks (such as Genbank or EMBL for primary DNA sequences).

The data can be submitted and accessed via the world wide web.

Protein sequence databanks like trEMBL provide the most likely translation of all coding sequences in the EMBL databank. Sequence data are prominent, but also other data are stored, e. g. yeast two–hybrid screens, expression arrays, systematic gene–knock–out experiments, and metabolic pathways.

Data Schema in Warehousing :A Gene Expression Example

Gene ExpressionWarehouse

ProteinDisease

SNP

Enzyme

Pathway

Known Gene

SequenceCluster

Affy Fragment

Sequence

LocusLink

MGD

ExPASySwissProt

PDBOMIM

NCBIdbSNP

ExPASyEnzyme

KEGG

SPAD

UniGene

Genbank

NMR

Metabolite

“Ten Important Bioinformatics Databases”

GenBank www.ncbi.nlm.nih.gov nucleotide sequences

Ensembl www.ensembl.org human/mouse genome (and others)

PubMed www.ncbi.nlm.nih.gov literature references

NR www.ncbi.nlm.nih.gov protein sequences

SWISS-PROT www.expasy.ch protein sequences

InterPro www.ebi.ac.uk protein domains

OMIM www.ncbi.nlm.nih.gov genetic diseases

Enzymes www.chem.qmul.ac.uk enzymes

PDB www.rcsb.org/pdb/ protein structures

KEGG www.genome.ad.jp metabolic pathways

Source: Bioinformatics for Dummies

http://www.ncbi.nlm.nih.gov/

http://www.ensembl.org/



http://www.expasy.ch/

http://www.ebi.ac.uk/


http://www.chem.qmul.ac.uk/

http://www.rcsb.org/pdb/

http://www.genome.ad.jp/

Genome

Protein

Gene = DNA

RNA Primary Sequence

Gene therapy DrugsInhibitors/activators

DNA binding drugs RNA binding drugs

Central dogma of modern drug discovery

Drug DesignThe information present in DNA is expressed via RNA molecules into proteins which are responsible for carrying out various activities.

This information flow is called the central dogma of molecular biology .

Potential drugs can bind to DNA, RNA or proteins to suppress or enhance the action at any stage in the pathway

All organisms self replicate due to the presence of genetic material DNA, the polynucleotide consisting of four bases Adenine (A), Thymine (T), Guanine (G) and Cytosine (C)

The entire DNA content of the cell is known as the genome. The segment of genome that is transcribed into RNA is called a gene.

Hereditary information is transferred in the form of genes containing the four bases. Understanding these genes is one of the modern day challenges.

History of BioinformaticsYear Subject Name MBP

(Millions of base pairs)

1995 Haemophilus Influenza 1.8

1996 Bakers Yeast 12.1

1997 E.Coli 4.7

2000 Pseudomonas aeruginosa A. ThalianaD. Melonagaster

6.3100180

2001 Human Genome 3,000

2002 House Mouse 2,500

We have sequenced and identified genes. So we know what they do

The sequences are stored in databases

So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases.

Since there are large number of databases we cannot do sequence alignment for each and every sequence

So heuristics must be used again.

18

Database Searches

Sequence info is stored in databases

So that they can be manipulated easily

The db are located at diff places They exchange info on a daily

basis so that they are up-to-date and are in sync

Primary db – sequence data

19

Databases

As there are many db which one to search? Some are good in some aspects and weak in others?

Composite db is the answer – which has several db for its base data

Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db

20

Composite DB

OWL has these as their primary db◦ SWISS PROT (top priority)◦ PIR◦ GenBank◦ NRL-3D

21

Composite DB

Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic

i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates

22

Genomics

Gene in sequence data – needle in a haystack

However as the needle is different from the haystack genes are not diff from the rest of the sequence data

Is whole array of nt we try to find and border mark a set o nt as a gene

This is one of the challenges of bioinformatics

23

Genomics - Finding Genes

Organism Genome Size (Mb) bp * 1,000,000

Gene Number

Web Site

Yeast 13.5 6,241 http://genome-www.stanford.edu/Saccharomyces

Fruit Flies 180 13,601 http://flybase.bio.indiana.edu

Homo Sapiens

3,000 45,000 http://www.ncbi.nlm.nih.gov/genome/guide

Proteome is the sum total of an organisms proteins

More difficult than genomics

◦ 4 20◦ Simple chemical makeup complex◦ Can duplicate can’t

25

Proteomics

Is one of the biggest challenges of bioinformatics and esp. biochemistry

No algorithm is there now to consistently predict the structure of proteins

26

Protein Structure Prediction

Comparative Modeling◦ Target proteins structure is compared with related

proteins◦ Proteins with similar sequences are searched for

structures

27

Structure Prediction methods

The taxonomical system reflects evolutionary relationships

Phylogenetic trees are things which reflect the evolutionary relationship thru a picture/graph

Rooted trees where there is only one ancestor Un rooted trees just showing the relationship Phylogenetic trees reconstruction algorithms are

also an area of research

28

Phylogenetics

Pharmacogenomics◦ Not all drugs work on all patients, some good drugs cause

death in some patients◦ So by doing a gene analysis before the treatment the

offensive drugs can be avoided◦ Also drugs which cause death to most can be used on a

minority to whose genes that drug is well suited.◦ Customized treatment

Gene Therapy◦ Replace or supply the defective or missing gene◦ E.g: Insulin and Factor VIII or Haemophilia

29

Medical Implications

Diagnosis of disease◦ Identification of genes which cause

the disease will help detect disease at early stage e.g. Huntington disease -

Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment

Death in 10-15 years The gene responsible for the disease

has been identified Contains excessively repeated sections

of CAG So once analyzed the couple can be

counseled

30

Diagnosis of Disease

Can go up to 15yrs and $700million One of the goals of bioinformatics is to

reduce the time and cost involved with it. The process

◦ Discovery Computational methods can improves this

◦ Testing

31

Drug Design

Target identification◦ Identifying the molecule on which the germs

relies for its survival◦ Then we develop another molecule i.e. drug

which will bind to the target◦ So the germ will not be able to interact with

the target.◦ Proteins are the most common targets

32

Discovery

For example HIV produces HIV protease which is a protein and which in turn eat other proteins

This HIV protease has an active site where it binds to other molecules

So HIV drug will go and bind with that active site◦ Easily said than done!

33

Discovery…

Lead compounds are the molecules that go and bind to the target protein’s active site

Traditionally this has been a trial and error method

Now this is being moved into the realm of computers

34

Discovery…

Date post:	17-Jan-2016
Category:	Documents
Upload:	domenic-jeffrey-curtis
View:	214 times
Download:	0 times

Mitcon Biopharma Chaitanya Velhal. What is BIOINFORMATICS All aspects of gathering, storing,...

Documents