Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as...

Bioinformatics

By: Jesus CabanBoxuan GU

What is Bioinformatics? Bioinformatics has been defined as a

means for analyzing, comparing, graphically displaying, modeling, storing, systemizing, searching, and ultimately distributing biological information, which includes sequences, structures, function, and phylogeny.

Researches in Bioinformatics the study of DNA structure and its

functions gene and protein expressions protein production, structure and

functions genetic regulatory systems clinical applications

Biology employs a digital language for representing its information using the four basic alphabets (A, C, G, T).

All the chromosomes in an organism‘s cell have been represented and being dentified using these alphabets.

The demanding challenge here is to determine how this digital language of the chromosomes is being converted into the three-dimensional and sometimes four-dimensional languages of living and breathing organisms.

Bioinformatics language

Central Dogma of Molecular Biology

DNA RNA Protein

StructuresProcesses

CellEnvironment

Protein structure

How many different protein structures are there?How different are they?

Protein structure

What is this?Is the DNA normal?

Computational Biology - Applications and Approaches

The Overview of DNA and RNA

Deoxyribonucleic acid (DNA) is a macromolecular chain of nucleotides that serves as a basic carrier of genetic information and is able to self-replicate. DNA can be represented as a sequence of nucleotide bases.

DNA DNA sequences are typically from

thousands to millions of bases long. DNA usually consists of two strands of complementary nucleotide sequences that are base paired to each other.

DNA in humans forms a linear chain, but DNA can also form a circular molecule.

DNA

A hypothetical double-stranded DNA molecule can be represented as ACGTGGTAGAGACCCTGTGTGATAGACCACGGGTA

TGCACCATCTCTGGGACACACTATCTGGTGCCCAT

As A pairs with T and C pairs with G and vice versa

Here A - Adenine, C - Cytocine, G - Guanine, T - Thymine

RNA The RNA is the same as above DNA with the

exception that T is replaced by U, which represents uricil nucleotide.

An organism is further classified into two types. Eukaryotes - higher-order organisms whose DNA

is enclosed in a cell nucleus. E.g. humans Prokaryotes - organisms such as bacteria whose

DNA is not enclosed in a nucleus. E.g. bacteria.

GENE

A gene is a contiguous interval of DNA that contains the information needed to code for a protein. Genes form the basic units of heredity.

The Biological problem in CS

computer scientists were empowered to consider a variety of biologically important problems defined primarily on sequences, or (more in the computer science vernacular) on strings

reconstructing long strings of DNA from overlapping string fragments

Cont.• determining physical and genetic maps from

probe data under various experimental protocols;

• storing, retrieving, and comparing DNA strings

• comparing two or more strings for similarities• searching databases for related strings and

substrings• defining and exploring different notions of

string relationships

Cont.

• looking for new or ill-defined patterns occurring frequently in DNA

• looking for structural patterns in DNA and protein

• determining secondary(two-dimensional) structure of RNA

• finding conserved, but faint, patterns in many DNA and protein sequences; and more

Challenges In molecular biology, there are several hundred

specialized databases holding raw DNA, RNA, and amino acid strings, or processed patterns (called motifs) derived from the raw string data.

The currently available algorithms for this problem are a little bit slow and erroneous due in the contexts that billions of DNA sequences are stored in present day DNA and protein databases available world wide

Cont.

the question here is that exact matching will remain a problem of interest as the size of the databases grow exponentially and

also because it will continue to be a subtask needed for more complex searches that will be devised in the near future to fulfill the various and advanced requirements of molecular biologists

DNA contamination

Contamination is often caused by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism or the contamination is from the DNA of the host itself.

Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (Polymerase chain reaction) used to make copies of the desired DNA.

Cont.

DNA contamination problem can be represented as follows: Given a string S1 (the newly isolated and sequenced

string of DNA) and a known string S2 (the combined sources of possible contamination), find all substrings of S2 that occur in S1 and that are longer than some given length l.

These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string.

Matching and Alignment

Exact string matching Knuth-Morris-Pratt and Boyer-Moore Exact matching with a set of

patterns Aho-Corasick Inexact matching Edit Distance and dynamic programming Sequence alignment problems Multiple alignment problems

How to Solve

Naive method-Slide P along T and for each alignment, compare characters from left to right. O(n*(m-n+1)).

Knuth-Morris-Pratt(KMP algorithm)-O(n+m).

String is not Sequence A string is not the same as the

concept of a (sub)sequence in biology!

(Sub)sequences in the biological literature refer to strings that might be interspersed with other characters, such as gaps

More on Bioinformatics

Bioinformatics facts

3 billion chemical base pairs make up human DNA

There are about 30,000 genes There are about 100,000 proteins Changes in a single base pair are

responsible for may illness

Genomics

Genome complete set of genetic instructions for

making an organism Genomics

attempts to analyze or compare the entire genetic complement of a species

Genome Project

U.S. Human Genome Project was a 13-year effort coordinated by the Department of Energy and the National Institutes of Health.

Goals: identify genes in human DNA determine chemical base pairs create databases tools for data analysis

The Feb. 16, 2001, issue of Science, contains the first analysis of the working draft human genome sequence.

More on Genomics

Comparative Genomics: the management and analysis of the millions of data points that result from Genomics

Functional Genomics: ways of identifying gene functions and associations

Structural Genomic: whole-genome analysis

Modern Molecular Biology

signals are received at the cell surface, and travel eventually to the nucleus

transcription factors cause the signal to be converted into a change in expression of a gene

The gene products are converted to proteins in the cytoplasm

where they can now effect further changes in the cell.

From: Genes for Geeks. http://www.hpcf.upr.edu/~humberto/presentations

http://www.bioteach.ubc.ca

Proteins

Proteins make up about 15% of the mass of the average person

Proteins are most of the components of cells

Polypeptides: small soluble proteins consisting of a few amino acids linked together

3D structures are composed of one or more polypeptides

An amino acid is a small organic molecule, there are about 20 different amino acids

Structural Biology

The function of a protein is completely determined by its structure (3D shape)

The structure of a protein is completely determined by the sequence of its polypeptide components

The first biopolymers to be sequenced were proteins, but now it is much simpler, faster, and cheaper to sequence DNA

Genomics Language

Genomic DNA is a linear sequence of 4 nucleotides (A, C, G, T)

DNA forms the double helix by pairing with its reverse complement (A-T, G-C)

Genomic DNA contains many genes, each of which is formed from one or more exons (stretches of genomic DNA), separated by introns

A gene is copied into complementary RNA in a process called transcription (U substitutes T)

Protein Sequence Alignments

1. Assigning functions to unknown proteins

2. Determine relatedness of organisms

3. Identify structurally and functionally important elements

4. Make predictions about the 3D structure

Sequence alignment problem

Given two sequences over an alphabet Σ, and a cost function that assigns a cost to an alignment, find an alignment of the string that minimizes the cost.

Example: DNA or protein sequences, find the best

match between them Best match is the one with the minimum

the cost

Sequence alignment problem

Usually solved with dynamic programming

Recall your algorithm class? O(m*n)

G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A

G A A T T C A G T T A | | | | | | G G A T _ C _ G _ _ A

From: http://www.sbc.su.se/~per/molbioinfo2001/dynprog/adv_dynamic.html

The Protein Folding Problem

Given a sequence S, and an integer E, is there a fold that has –E or lower energy? Has been shown to be NP-complete in

2D (HAMILTONIAN PATH)

The Inverse Protein Folding (IFP) problem

Given a target structure or conformation of a protein G, find a sequence S of length n that: Has G as it’s minimum energy state. Has the lowest degeneracy (number of

other conformations with the same energy) of any possible sequence.

The Heuristic Sequence Design (HSD) Problem

IFP is a NP-complete, the best known algorithm must search over all possible conformations of all possible sequences.

HSD problems try to simplify the computation by restricting the problem The Canonical Method: find the sequence

with at most λn hydrophobic residues The Grand Canonical Method: Change the

energy function

Bioinformatics Software

GCG

Genetics Computer Group The Wisconsin Package for Sequence

Analysis Consists of 130+ integrated programs Web based, command-line and X window

system.

SeqWeb

Database Searching and Retrieval for GCG

Comparison Protein Analysis Mapping Pattern Recognition

EMBOSS

EMBOSS is a suite where you will find around 100 bioinformatics programs

Sequence alignment Database search with sequence

pattern Protein motif identification Link: http://emboss.org/

Artemis

Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation

Link: http://www.sanger.ac.uk/Software/Artemis/

Rasmol

RasMol is a free program which displays molecular structure

http://www.umass.edu/microbio/rasmol/index2.htm

Bio-Knoppix

Bio-knoppix is a customized distribution of Knoppix enhanced for bioinformatics applications and presentations.

Link: http://bioknoppix.hpcf.upr.edu/

Some useful sites http://www.bioinformatics.ca/links_directory/ http://www.hgmp.mrc.ac.uk/GenomeWeb/docs-theory.html http://biotech.icmb.utexas.edu/pages/bioinform/biresources.html http://scop.berkeley.edu/ http://www.cse.ucsc.edu/~karplus/compbio_pages.html http://www.peterindia.net/ComputBioArticles.html http://bioknoppix.hpcf.upr.edu/

http://www.ornl.gov/sci/techresources/Human_Genome http://bioinformatics.org/

Date post:	13-Jan-2016
Category:	Documents
Upload:	gloria-lester
View:	215 times
Download:	0 times

Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as...

Documents