Bioinformatics

Bioinformatics

Cindy Burklow, Kyle Eli, Clay Harris

What is Bioinformatics?What is Bioinformatics?

− “Any use of computers to handle biological information.”

− Or, more specifically:

− “The use of computers to characterize the molecular components of living things.”


− Biomolecules

− “Doing Bioinformatics”

− And simulate!

− Classical bioinformatics deals primarily with sequence analysis

− Polymers

− Monomers

− Macromolecules

− Sequences


− “Post-genomic” era

− Comparative genomics

− New technologies to measure gene expression

− Large-scale methods for identifying gene function

− A shift to finding gene products

− Proteomics

− Structural Genomics

Bioinformatic FieldsBioinformatic Fields

Biophysics

Cheminformatics

Computational Biology

Genomics

Mathematical Biology

Medical informatics/Medinformatics

Pharmacogenomics

Pharmacogenetics

Proteomics

BLASTBLAST

− Basic Local Alignment Search Tool (BLAST)

− Collection of Software Program Tools

−

Software version 2.1.13 offered by National Center for Biotechnology Information at the National Institutes of Health (NCBI)

− Compares nucleotide or protein sequences to sequence databases

− Finds regions of local similarity between sequences

− Calculates the statistical significance of matches

− Helps infer functional relationships between sequences and identify members of gene families

BLASTBLAST

Offers different program tools & databases

Provides Guide to help users decide on which BLAST tool to used based on

Nature & size of the input query Primary goal of the search

BLAST search comprises four components:QueryDatabaseProgramSearch purpose/goal

BLASTBLAST

BLASTBLAST

Ways to interface with BLASTWays to interface with BLAST

− Uses Standardized application program interface (API) for accessing the NCBI QBIAst system

− Uses direct HTTP-encoded requests to NCBI web server

− Blast utilities allow you to run searches on your own computer

− NetBlast has command-line network clients that allow you to submit searches to NCBI

A Case Study of High-Throughput Biological Data Processing on Parallel Platforms

San Diego Supercomputer Center and Department of Pharmacology,

University of California

HistoryHistory

Work has been done in this area for over the past 20 years developing structure comparison algorithms for proteins structures

Traditionally uses conventional functionally-driven structure determination

Algorithm Classifications to build alignments:

Single Residues Fragments of multiple residues Secondary Structure Elements

CHALLENGE: Highly redundant datasets requiring very large computations to be performed to gain insight into the meaning of the data

Protein StructuresProtein Structures

What is important about Protein Structures?

Comparing a single data sequence string against a very large sequence database called Protein Data Bank (PDB)

Types of Comparisons

Sequence-Sequence

Sequence-Structure

Structure-Structure

Used for protein classifications, better understanding of function and clear explanation of distant homologous relationships not possible from sequence alone since sequence is more variable than structure

Scale of ProblemScale of Problem

Protein Data Bank of 35,000 chains

Pairwise comparison = average ~3 seconds.

Without considering redundancy or chain size a complete computation would take average….

((35,000 * 35,000)/2) * 3 seconds 21,000 processor-days or 58 YEARS!!!!

TIME IS A BIG PROBLEM!!!

ProblemsProblems

− Determination & Comparison of 3-D protein structures

− Massively parallel computations are needed

BackgroundBackground

Looking for more efficient way to analyze large data sets

Taking advantage of redundancy present in data sets

KEY: Data Preprocessing Step & Organization of data being searched BEFORE begin passed to PARALLEL COMPUTERS

Other Issues to ConsiderOther Issues to Consider

− Algorithm should give optimal performance

− Scale with the number of processors involved.

Optimization ProceduresOptimization Procedures

− Dynamic Programming

− Monte-Carlo

− Graph Theory

− Combinatorial Search

What does CEPAR stand for?What does CEPAR stand for?

Combinatorial Extension Algorithm

CE PARCE PAR

Parallel Mode

What is Combinatorial Extension Algorithm? What is Combinatorial Extension Algorithm?

Method of automatically aligning pairs of structures

Compiles an alignment of a give pair of protein chains by considering the chains sectioned into all possible octapeptide fragments, as defined by the backbone α-carbons

Those octapeptide pairs that have high distance-based similarity score are deemed “aligned fragment pairs” & used in the next step of analysis

Then the CE algorithms tries to join each Alignment Fragment Pairs (AFP) to a maximal number of other AFPs in order to create the longest possible alignment path through the two proteins in consideration (w/ allowance for gaps of up to 30 residues in either protein chain). Switch together a set of AFPs covering contiguous region.

After possible paths through two proteins are determined, CE uses additional heuristics to try to improve the final alignment

The 20 best scoring paths are compiled & proteins are directly compared based upon the super-imposition of the aligned residues.

The path that yields the lowest Root Mean Square Deviation (RMSD) is retained as the “optimal path”.

Then this path is subjected to dynamic programming on structural alignment directly between the two structures, which test all possible residue equivalences & resulting RMSD from their superposition.

What is Combinatorial Extension Algorithm? What is Combinatorial Extension Algorithm?

Parallel AlgorithmParallel Algorithm

CEPAR uses coarse-grain parallel implementation involving a master/worker strategy suitable for a massively parallel computer architecture.

A parallel algorithm, as opposed to a traditional serial algorithm, is one which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result.

What does CEPAR do?What does CEPAR do?

Finds pairwise protein structure similarities

Pairwise 3D protein structure comparison

Aligns protein structure from Protein Data Bank

Matches protein structure-to-structure

Runs on a large number of processors

How does CEPAR work?How does CEPAR work?

Optimizes the use of Combinatorial Extension algorithm for the pairwise alignment of polypeptide chains to manage comparative structural information

Builds a structurally representative set of protein chains & reveals structure similarities in the Protein Data Bank that scale with fast growing source of data

Only one master processor was used. It was not advantageous to use more than one master processor, because communication issues.

Each worker receives work assignment from master compares 2 entities contained in the assignment using CE algorithm, returns results of the comparison to the master & is ready to receive another assignment

Workers only need to communicate with the Master processor and not each other

Program written in C++ and uses MPI for communication between master & workers

How does CEPAR work?How does CEPAR work?

ComputerComputer

“Blue Horizon” – IBM SP parallel computer at the San Diego Supercomputer Center

1152 Power3+ processors each running at 375MHz

Sun Enterprise 10,000 server & Linux PC cluster

Software can work on any parallel machine or PC cluster with Message Passing Interface (MPI)

Assignments & Problem FormulationAssignments & Problem Formulation

Entity list of N entities where each entity is protein polypeptide chain characterized by amino acid sequence & a set of 3D coordinates

Algorithm for pairwise comparison of entities (CE)

Select Representative Protein Structure

Order of Operations

Representation Criteria NotesRepresentation Criteria NotesLooking for similarity criterion between representatives

Alignments not satisfying this criterion are not recorded

Output: List of representatives as well as entities represented by them & detailed information on alignment satisfying either representative or similarity criterion

It is not vector quantization (so to minimize computer time)

Representatives are randomly chosen instead of calculating the centroid of a cluster

Applied criteria is believed to adequately describes the structural space of the Protein Data Bank

Representation CriteriaRepresentation Criteria

2/|| 2121 LLLLL thr

Sequence Lengths of two entities: L1 & L2

Length difference threshold parameter: Lthr

2/21 LLAL thrali

Number of aligned positions: Lali

Alignment length threshold parameter: Athr

Representation CriteriaRepresentation Criteria

alithrgap LGL Gap threshold parameter: Gthr

Number of residues in gaps: Lgap

Final RMSD of the alignment RMSD < Rthr, where Rthr is the RMSD threshold parameter

Order of OperationOrder of Operation

− Entity-first (2-step)

− Family-first (2-step)

− Family-first (1-step)

New problems uncovered….New problems uncovered….

Running CEPAR in one step produces limited scalability causes….

Limited Scalability

WHY? At High processor count…1. Number of idle workers 2. Time taken for communication operations

Result of load imbalance at the end of the runBecause at this point most of the worker processors run out of tasks while only a few finish their last assignment.

Resource reservation systems on most public supercomputer reserve a block of processors making it impossible to release them one by one.

How to deal with Limited Scalability IssueHow to deal with Limited Scalability Issue

Idea Production Mode:

Number of processors assigned should not be more than Process Number < Threshold Number

Use Alternative: Two Steps instead of one

Utilizes early stopping condition, which causes the 1st of the two runs to abort when accumulated avg. idle time of workers exceeds a predefined amount (such as 20% of the total run time).

Then the remaining part of the calculation is then completed on a smaller number of processors.

Two other problems….Two other problems….

Master processor congestion

Redundancy in assignments

How to avoid congestion….

• Improve communications between processors • Implement advance buffering of assignments• Decrease amount of disk I/O• Implement single-CPU optimization techniques

Keys to successKeys to success

Detecting a match between rep & entity to avoid redundancy.

Important to sort rep in decreasing order of chance of being similar to the given entity.

Estimate chance by giving priority to those reps having a number of residues with 10% of the current entity AND by using similarity in amino acid content based on frequency profiles.

The approach is approximate but provides performance gains over a random/sequential choices of reps.

MPI CommunicationMPI Communication

At first it appears that the efficiency of MPI Communication appear to play an insignificant role in overall performance since communication time is small fraction of the overall CEPAR computation time. However time does add up and MPI does help.

Key: Select appropriate MPI send function for the hardware/software in hand.

Example: IBM’s implementation of MPI’s blocking send function MPI_Send() is not appropriate because this implementation does not buffer the msg for large msg sizes.

MPI Implementation that avoid buffering message can cause deadlock in some cases.

In CEPAR no deadlocks occur. However, master processor can be blocked while waiting for some worker processors to finish. MPI_BSend() function for buffered sends solves this problem.

ResultsResults

− Family-First approach outperformed the Entity-first approach.

− End-of-run load imbalance and allocation of processors were addressed with two-steps

− Careful Selection of MPI implementation

− Overall CEPAR performance….

Advantages of CEPARAdvantages of CEPAR

Ensure high performance computing optimal use

Analysis of large amounts of data

Can be used on any distribute-memory platform

Can scale with the number of processors involved

Saves time & computational resources

SummarySummary

− Efficient use of resource depends on meticulous design of the algorithm and software with performance & scalability given a high priority.

− Organization of data being feed to processors

− Optimization of algorithm for distribution of assignments

Proteomics

What is Proteomics?What is Proteomics?

− The study of the proteome.

− A proteome is “the set of proteins that can be expressed by the genetic material of an organism.”

− In other words, the study of all proteins, the interactions between them, and “their role in physiological and pathophysiological functions”.

− Hopefully will directly contribute to a full description of cellular function.

Challenges in Proteomics ResearchChallenges in Proteomics Research

− Limited and variable sample material.

− Sample Degradation.

− Vast dynamic range.

− For example, in human serum the concentration of albumin is 10 billion times greater than the concentration of the signaling protein interleukin-6.

Challenges in Proteomics Research (cont’d)Challenges in Proteomics Research (cont’d)

− Plethora of post-translational modifications.

− Nearly boundless tissue.

− Developmental and temporal specificity.

− Disease and drug perturbations.

− “…these difficulties render any comprehensive proteomics project an inherently intimidating and often humbling exercise.”

Five Pillars of Proteomics ResearchFive Pillars of Proteomics Research

− Mass spectrometry-based.

− Proteome-wide biochemical arrays.

− Systematic structural biology and imaging techniques.

− Proteome informatics.

− Clinical applications.

Mass spectrometry-based ProteomicsMass spectrometry-based Proteomics

− A primary driving force in proteomics.

− Advancements allow the identification of smaller proteins in more complex mixtures.

− Initially, research required separation of protein by two-dimensional gel electrophoresis before using mass spectrometry.

− Limited to the most abundant proteins.

Mass spectrometry-based Proteomics (cont’d)Mass spectrometry-based Proteomics (cont’d)

− Now, mass spectrometric analysis is used directly.

− Advancements are increasing sensitivity, robustness and data handling.

− Plenty of work to do…

− Much higher throughput and sensitivity is needed for observing proteome dynamics and cellular response.

− More complete sequence coverage.

− Process and workflow refinement.

− Automated protein identification.

− Detection of post-translational modification.

Array-based ProteomicsArray-based Proteomics

− Array of immobilized proteins on a support surface.

− One of the most active areas in biotechnology.

− Sensitive, high-throughput.

− Wide range of applications.

− Diagnostics.

− Protein-protein interaction.

− Protein expression profiling on a small or large scale.

− Target identification and validation in the pharmaceutical industry.

Array-based Proteomics (cont’d)Array-based Proteomics (cont’d)

− Arrays give an abundance of data for a single experiment.

− Data handling demands sophisticated software and data comparison analysis.

− Some of the software used for DNA arrays is applicable, along with much of the hardware and detection systems.

Structural ProteomicsStructural Proteomics

− Systematically understanding the structural basis for protein interactions and function.

− Full description of cell behavior requires structural information for all salient protein complexes and their organization at a cellular level.

− Requires a wide scale of measurements…

− From X-ray crystallography and nuclear magnetic resonance at the protein level…

− …to electron microscopy of mega-complexes and electron tomography for high-resolution visualization of the entire cellular environment.

− Modeling of dynamics and interaction through computer simulation.

InformaticsInformatics

− Proteomics research generates an enormous amount of data.

− A “simple” experiment for a single microbe involving 90 biological samples could generate 18TB of proteomics data.

− Sample documentation, rigorous process standards, and proper annotation are necessary.

− Software development requires a collaborative and documented design process.

− Data stored as XML with an agreed-upon schema.

− HUPO (Human Proteome Organization) defines community standards for data representation: http://psidev.sourceforge.net/

Informatics (cont’d)Informatics (cont’d)

− Considerable effort has been applied to interaction databases and systems biology software infrastructure.

− A system for automating protein identification from mass spectral data is needed for generating databases.

− Currently a manual and error-prone process.

− Much was learned from DNA array analysis.


− Current equipment is far from optimal.

− Manufacturers need time to build platforms tailored specifically for proteomics.

− Mass spectrometry should improve significantly.

− Large market for sensitive, affordable mass spectrometers.

− Robotics for sample preparation.

− Availability of large datasets will drive research.

− Modeling cellular behavior.


− Open access for proteomics researchers is needed.

− Academic institutions typically have the basic necessary tools.

− Mismanagement of data.

− Poor throughput.

− Equipment is extremely expensive.

− National proteome centers have been proposed to make expertise and equipment more available.


− Lessons learned from genome sequencing.

− Raw data must be publicly accessible on-line to foster a sense of participation.

− Agreements that mandate public accessibility and non-patenting of basic data

− Large-scale efforts must be coordinated to avoid duplication.

− Also, funding.

Clinical ProteomicsClinical Proteomics

− Proteomics impacts diagnostics as well as drug discovery.

− Most drug targets are proteins.

− Currently a variety of technological platforms in development.

− Still undecided as to which methods will work best.

− Robust and high-throughput nature of mass spectrometric instrumentation is imminently suited to clinical applications.

Clinical Proteomics (cont’d)Clinical Proteomics (cont’d)

− Protein- and antibody-based arrays with validated diagnostic readouts may also become amenable to the clinical setting.

− Proteomics accelerates drug discovery.

− Understanding biological networks within a cell will provide a basis for identifying suitable targets.

Computational Proteomics ExamplesComputational Proteomics Examples

− Protein Docking

− In cellular biology, function is accomplished by proteins interacting with themselves and other molecular components.

− Helps verify our understanding of the energetics of macromolecular interactions.

− Characterization of the structures of protein-protein complexes.

RosettaDockRosettaDock

RosettaDockRosettaDock

TreeDockTreeDock

− TreeDock uses a deterministic search

− Can explore all orientations at a very fine resolution in a reasonable amount of time.

TreedockTreedock

− Searching for docking configurations…

− Provide models of each molecule

− Provide anchors for each molecule

− Not necessary for small molecules, all atoms will be tried

TreedockTreedock

− One molecule has a fixed position, other is movable

− Movable molecule is translated, rotated while maintaining contact between anchors

− All positions are tried within a specified resolution

Synchrotron IR Analysis of Murine Synchrotron IR Analysis of Murine Abdominal Aortic AneurysmAbdominal Aortic Aneurysm

The ProblemThe Problem

− Abdominal aortic aneurysms (AAAs) occur in 5-7% of people over age 60 in the US

− Some individuals have aorta thickening but never have an AAA

− Chemical precursors to AAA are unknown

− Current drugs treat the symptoms not the cause

PurposePurpose

− Analysis of large 2D FTIR microspectroscopic data sets for anomalies to …

− Determine why infusion of Angiotensin II (AngII) into Apolipoprotein E (apoE) -/- knockout mice causes aorta thickening in some mice and aneurysm in other mice…

− Identify chemical precursors to AAA and ultimately…

− Save Lives!

Data Analysis Issues with 2D FTIR MicrospectroscopyData Analysis Issues with 2D FTIR Microspectroscopy

− Spectral features are a blend of what is in each sample

− Datasets are very continuous in nature (Principal Component Analysis (PCA) is often not sufficient to identify chemically similar clusters)

− Subclusters within each PC may be overlooked

− Large datasets (10s of GBs) require substantial computational resources for typical statistical analysis

Large Dataset ExampleLarge Dataset Example

Scores Analysis with Quantile Quantile PlotsScores Analysis with Quantile Quantile Plots(SAQQ) – The Concept(SAQQ) – The Concept

− Principal Component Analysis (PCA)

− Quantile-Quantile (QQ) Plotting of a single PC

− Linear regression to find “normal” distributions

− Average the original data to find multidimensional centers

− Calculate loadings with inverse principal axis transformation

SAQQ – The ConceptSAQQ – The Concept

− Calculate QBEAST distances to all points from each cluster center

− Reorganize distances into the original map configuration

− Create “digitally stained” images based upon distance (highlight spectral deviations from the normal distribution)

Principal Component AnalysisPrincipal Component Analysis

− Linear dimension-reduction technique

− Points in multidimensional space are projected onto a space of fewer dimensions

− Creates a new coordinate system based upon variance

− The first axis (PC) has the greatest variance of any projection, the second has the second greatest orthogonal variance, and so on…

SAQQ – The Quantile-Quantile PlotSAQQ – The Quantile-Quantile Plot

− Plot order statistics vs. normal cumulative distribution function

SAQQ – Linear Regression of the QQ plotSAQQ – Linear Regression of the QQ plot

1. Take the first (next) 10% of the data

2. Calculate r2 and compare to 0.9

3. If r2 > 0.9 add the next point and go to step 2

4. If r2 < 0.9 consider data a cluster and go to step 1

SAQQ – The Quantile-Quantile PlotSAQQ – The Quantile-Quantile Plot

− SAQQ must be applied to all PCs

SAQQ ContinuedSAQQ Continued

− Average the original data to find multidimensional centers

− Calculate loadings with inverse principal axis transformation

− Calculate QBEAST distances to all points from each cluster center

SAQQ – QBEAST DistancesSAQQ – QBEAST Distances

QBEAST Distances Mahalanobis Distances Euclidean Distances

− QBEAST takes into account skew as well as dispersion

− QBEAST is faster then Mahalanobis as n samples approach d dimensions

− QQ plot parameterizes non-normal distributions

SAQQ ContinuedSAQQ Continued

− Reorganize distances into the original map configuration

− Create “stained” images based upon distance (highlight spectral deviations from the normal distribution)

Cluster Analysis Using SAQQCluster Analysis Using SAQQ

6.25 x 6.25 6.25 x 6.25 μμm pixel size m pixel size (113 pixels x 102 pixels x 410 spectral data points)(113 pixels x 102 pixels x 410 spectral data points)

Separation of two Identical Gaussian ClustersSeparation of two Identical Gaussian Clusters

− 3 SDs (cluster displacement)

− 3 SDs (size increase)

− 4 SDs (size decrease)

The Problem – The Problem – FTIR Microspectroscopic Data OverloadFTIR Microspectroscopic Data Overload

− Approximately 1 GB of raw data per hour collected

− 100s of GB of data waiting to be analyzed

− Massive array size (250,000 x 1000 double-precision)

− Massive file sizes (~ 1 GB compressed binary)

Specific AimsSpecific Aims

Identify precursors to AAA by using SAQQ to

rapidly reduce data obtained from FTIR

microspectrometry producing digitally stained

images corresponding to those clusters.

Identify overlapping clusters of collagen I,

collagen III, elastin, macrophages, and necrotic

debris

SAQQ analysis of PC1 of x-bk-1SAQQ analysis of PC1 of x-bk-1

SAQQ analysis of PC2 of x-bk-1SAQQ analysis of PC2 of x-bk-1

Proposed Research on Abdominal Aortic AneurysmProposed Research on Abdominal Aortic Aneurysm

− Process data with SAQQ

− Understand vessel wall thickening

− Identify biochemical pathways to aneurysm

− Develop iterative SAQQ

− Apply to reduce 60 “stained” images down to 1

− Develop better linear fitting algorithms

ConclusionsConclusions

− SAQQ is a useful method as a digital staining technique

− SAQQ “stains” based upon chemical significance

− SAQQ allows progress in determining the chemical process behind AAA formation

ReferencesReferences

BLAST - http://www.ncbi.nlm.nih.gov/BLAST/

CEPAR - http://www.sdsc.edu/http://www.sdsc.edu/pb/papers/cepar.pdf

Protein Data Bank - http://www.rcsb.org/pdb

Bioinformatics Fields - http://www.bioplanet.com/bioinformatics_faq.html

http://www.bioplanet.com/bioinformatics_faq.html

ReferencesReferences

http://www.bioplanet.com/bioinformatics_faq.html

http://www.answers.com/proteome

From Genomics to Proteomics. M. Tyers, M. Mann. Nature 2003 Mar;422(6928);193-7.

http://www.chem.agilent.com/cag/feature/02-04/Feb04_Serum.htm

http://www.functionalgenomics.org.uk/sections/resources/protein_arrays.htm

http://doegenomestolife.org/research/facilities/fac3table1.shtml

Treedock: A Tool for Protein Docking Based on Minimizing van der Waals Energies. A. Fahmy, G. Wagner. JACS 2002; Vol 124, No. 7

Protein-Protein Docking with Simultaneous Optimization of Rigid-body Displacement and Side-chain Conformations. J. Gray, S. Moughon, C. Wang, O. Schueler-Furman, B. Kuhlman, C. Rohl, D. Baker. JMB 2003; Vol 331;281-299

Date post:	11-Jan-2016
Category:	Documents
Upload:	abba
View:	30 times
Download:	2 times

Bioinformatics

Documents