Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | gloria-lester |
View: | 215 times |
Download: | 0 times |
Bioinformatics
By: Jesus CabanBoxuan GU
What is Bioinformatics? Bioinformatics has been defined as a
means for analyzing, comparing, graphically displaying, modeling, storing, systemizing, searching, and ultimately distributing biological information, which includes sequences, structures, function, and phylogeny.
Researches in Bioinformatics the study of DNA structure and its
functions gene and protein expressions protein production, structure and
functions genetic regulatory systems clinical applications
Biology employs a digital language for representing its information using the four basic alphabets (A, C, G, T).
All the chromosomes in an organism‘s cell have been represented and being dentified using these alphabets.
The demanding challenge here is to determine how this digital language of the chromosomes is being converted into the three-dimensional and sometimes four-dimensional languages of living and breathing organisms.
Bioinformatics language
Central Dogma of Molecular Biology
DNA RNA Protein
StructuresProcesses
CellEnvironment
Protein structure
How many different protein structures are there?How different are they?
Protein structure
What is this?Is the DNA normal?
Computational Biology - Applications and Approaches
The Overview of DNA and RNA
Deoxyribonucleic acid (DNA) is a macromolecular chain of nucleotides that serves as a basic carrier of genetic information and is able to self-replicate. DNA can be represented as a sequence of nucleotide bases.
DNA DNA sequences are typically from
thousands to millions of bases long. DNA usually consists of two strands of complementary nucleotide sequences that are base paired to each other.
DNA in humans forms a linear chain, but DNA can also form a circular molecule.
DNA
A hypothetical double-stranded DNA molecule can be represented as ACGTGGTAGAGACCCTGTGTGATAGACCACGGGTA
TGCACCATCTCTGGGACACACTATCTGGTGCCCAT
As A pairs with T and C pairs with G and vice versa
Here A - Adenine, C - Cytocine, G - Guanine, T - Thymine
RNA The RNA is the same as above DNA with the
exception that T is replaced by U, which represents uricil nucleotide.
An organism is further classified into two types. Eukaryotes - higher-order organisms whose DNA
is enclosed in a cell nucleus. E.g. humans Prokaryotes - organisms such as bacteria whose
DNA is not enclosed in a nucleus. E.g. bacteria.
GENE
A gene is a contiguous interval of DNA that contains the information needed to code for a protein. Genes form the basic units of heredity.
The Biological problem in CS
computer scientists were empowered to consider a variety of biologically important problems defined primarily on sequences, or (more in the computer science vernacular) on strings
reconstructing long strings of DNA from overlapping string fragments
Cont.• determining physical and genetic maps from
probe data under various experimental protocols;
• storing, retrieving, and comparing DNA strings
• comparing two or more strings for similarities• searching databases for related strings and
substrings• defining and exploring different notions of
string relationships
Cont.
• looking for new or ill-defined patterns occurring frequently in DNA
• looking for structural patterns in DNA and protein
• determining secondary(two-dimensional) structure of RNA
• finding conserved, but faint, patterns in many DNA and protein sequences; and more
Challenges In molecular biology, there are several hundred
specialized databases holding raw DNA, RNA, and amino acid strings, or processed patterns (called motifs) derived from the raw string data.
The currently available algorithms for this problem are a little bit slow and erroneous due in the contexts that billions of DNA sequences are stored in present day DNA and protein databases available world wide
Cont.
the question here is that exact matching will remain a problem of interest as the size of the databases grow exponentially and
also because it will continue to be a subtask needed for more complex searches that will be devised in the near future to fulfill the various and advanced requirements of molecular biologists
DNA contamination
Contamination is often caused by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism or the contamination is from the DNA of the host itself.
Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (Polymerase chain reaction) used to make copies of the desired DNA.
Cont.
DNA contamination problem can be represented as follows: Given a string S1 (the newly isolated and sequenced
string of DNA) and a known string S2 (the combined sources of possible contamination), find all substrings of S2 that occur in S1 and that are longer than some given length l.
These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string.
Matching and Alignment
Exact string matching Knuth-Morris-Pratt and Boyer-Moore Exact matching with a set of
patterns Aho-Corasick Inexact matching Edit Distance and dynamic programming Sequence alignment problems Multiple alignment problems
How to Solve
Naive method-Slide P along T and for each alignment, compare characters from left to right. O(n*(m-n+1)).
Knuth-Morris-Pratt(KMP algorithm)-O(n+m).
String is not Sequence A string is not the same as the
concept of a (sub)sequence in biology!
(Sub)sequences in the biological literature refer to strings that might be interspersed with other characters, such as gaps
More on Bioinformatics
Bioinformatics facts
3 billion chemical base pairs make up human DNA
There are about 30,000 genes There are about 100,000 proteins Changes in a single base pair are
responsible for may illness
Genomics
Genome complete set of genetic instructions for
making an organism Genomics
attempts to analyze or compare the entire genetic complement of a species
Genome Project
U.S. Human Genome Project was a 13-year effort coordinated by the Department of Energy and the National Institutes of Health.
Goals: identify genes in human DNA determine chemical base pairs create databases tools for data analysis
The Feb. 16, 2001, issue of Science, contains the first analysis of the working draft human genome sequence.
More on Genomics
Comparative Genomics: the management and analysis of the millions of data points that result from Genomics
Functional Genomics: ways of identifying gene functions and associations
Structural Genomic: whole-genome analysis
Modern Molecular Biology
signals are received at the cell surface, and travel eventually to the nucleus
transcription factors cause the signal to be converted into a change in expression of a gene
The gene products are converted to proteins in the cytoplasm
where they can now effect further changes in the cell.
From: Genes for Geeks. http://www.hpcf.upr.edu/~humberto/presentations
http://www.bioteach.ubc.ca
Proteins
Proteins make up about 15% of the mass of the average person
Proteins are most of the components of cells
Polypeptides: small soluble proteins consisting of a few amino acids linked together
3D structures are composed of one or more polypeptides
An amino acid is a small organic molecule, there are about 20 different amino acids
Structural Biology
The function of a protein is completely determined by its structure (3D shape)
The structure of a protein is completely determined by the sequence of its polypeptide components
The first biopolymers to be sequenced were proteins, but now it is much simpler, faster, and cheaper to sequence DNA
Genomics Language
Genomic DNA is a linear sequence of 4 nucleotides (A, C, G, T)
DNA forms the double helix by pairing with its reverse complement (A-T, G-C)
Genomic DNA contains many genes, each of which is formed from one or more exons (stretches of genomic DNA), separated by introns
A gene is copied into complementary RNA in a process called transcription (U substitutes T)
Protein Sequence Alignments
1. Assigning functions to unknown proteins
2. Determine relatedness of organisms
3. Identify structurally and functionally important elements
4. Make predictions about the 3D structure
Sequence alignment problem
Given two sequences over an alphabet Σ, and a cost function that assigns a cost to an alignment, find an alignment of the string that minimizes the cost.
Example: DNA or protein sequences, find the best
match between them Best match is the one with the minimum
the cost
Sequence alignment problem
Usually solved with dynamic programming
Recall your algorithm class? O(m*n)
G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A
G A A T T C A G T T A | | | | | | G G A T _ C _ G _ _ A
From: http://www.sbc.su.se/~per/molbioinfo2001/dynprog/adv_dynamic.html
The Protein Folding Problem
Given a sequence S, and an integer E, is there a fold that has –E or lower energy? Has been shown to be NP-complete in
2D (HAMILTONIAN PATH)
The Inverse Protein Folding (IFP) problem
Given a target structure or conformation of a protein G, find a sequence S of length n that: Has G as it’s minimum energy state. Has the lowest degeneracy (number of
other conformations with the same energy) of any possible sequence.
The Heuristic Sequence Design (HSD) Problem
IFP is a NP-complete, the best known algorithm must search over all possible conformations of all possible sequences.
HSD problems try to simplify the computation by restricting the problem The Canonical Method: find the sequence
with at most λn hydrophobic residues The Grand Canonical Method: Change the
energy function
Bioinformatics Software
GCG
Genetics Computer Group The Wisconsin Package for Sequence
Analysis Consists of 130+ integrated programs Web based, command-line and X window
system.
SeqWeb
Database Searching and Retrieval for GCG
Comparison Protein Analysis Mapping Pattern Recognition
EMBOSS
EMBOSS is a suite where you will find around 100 bioinformatics programs
Sequence alignment Database search with sequence
pattern Protein motif identification Link: http://emboss.org/
Artemis
Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation
Link: http://www.sanger.ac.uk/Software/Artemis/
Rasmol
RasMol is a free program which displays molecular structure
http://www.umass.edu/microbio/rasmol/index2.htm
Bio-Knoppix
Bio-knoppix is a customized distribution of Knoppix enhanced for bioinformatics applications and presentations.
Link: http://bioknoppix.hpcf.upr.edu/
Some useful sites http://www.bioinformatics.ca/links_directory/ http://www.hgmp.mrc.ac.uk/GenomeWeb/docs-theory.html http://biotech.icmb.utexas.edu/pages/bioinform/biresources.html http://scop.berkeley.edu/ http://www.cse.ucsc.edu/~karplus/compbio_pages.html http://www.peterindia.net/ComputBioArticles.html http://bioknoppix.hpcf.upr.edu/
http://www.ornl.gov/sci/techresources/Human_Genome http://bioinformatics.org/