Analysis of Blue Gene Molecular Dynamicsparida/DIMACSworkshopJune20... · Overview Current...

Post on 12-Oct-2020

2 views 0 download

transcript

IBM T. J. Watson Research Center

IBM Computational Biology Center June 22, 2005 © 2005 IBM Corporation

Analysis of High BandwidthMolecular Dynamics Resultsfrom the Blue Gene Project

Frank Suits

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Overview

Current biomolecular simulations on Blue Gene– Simulations range from small proteins in water to large proteins in

lipid membranes

Types of output and storage needs– Small numbers of large systems; Large numbers of small systems

Ways to handle I/O– Data reduction is key

Role of visualization in data reduction– Several examples, including an experimental view of protein

sequence motifs

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

December 1999:

IBM Announces $100 Million Research Initiative to build World's Fastest Supercomputer

"Blue Gene" to Tackle Protein Folding Grand Challenge

YORKTOWN HEIGHTS, NY, December 6, 1999 -- IBM today announced a new $100 million exploratory research initiative to build a supercomputer 500 times more powerful than the world’s fastest computers today. The new computer -- nicknamed "Blue Gene" by IBM researchers -- will be capable of more than one quadrillion operations per second (one petaflop). This level of performance will make Blue Gene 1,000 times more powerful than the Deep Blue machine that beat world chess champion Garry Kasparov in 1997, and about 2 million times more powerful than today's top desktop PCs.

Blue Gene's massive computing power will initially be used to model the folding of human proteins, making this fundamental study of biology the company's first computing "grand challenge" since the Deep Blue experiment. Learning more about how proteins fold is expected to give medical researchers better understanding of diseases, as well as potential cures.

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Actual Blue Gene TimelineDecember 1999: Blue Gene project announcement

October 2000: Blue Matter software development begins

June 2003: First chips completed

November 2003: BG/L Half rack prototype (512 nodes) ranked #73 (1.435 TFlop/s).

May 2004: First production Blue Matter runs on membrane systems

November 2004: 16-rack Livermore system #1 in Top500 at 70 TFlop/s (1/4 of completed system)

May 2005: 32-rack Livermore system achieves 135 TFlop/s

May 2005: Watson 20-rack system, BG/W, completed – 91 TFlop/s, unofficially #2 in world.

Later in 2005: Livermore completes 64-rack system

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Molecular Dynamics Time Scales

10-15 10-12 10-9 10-6 10-3 1 103 106 109| | | | | | | | |

Bond Vibration

Adapted from “The Protein Folding Problem”, Chan and Dill, Physics Today, Feb. 1993

DNA Twisting

Helix-Coil Transition

Protein Folding

Electron Transfer

Hinge Motion

Ligand-Protein Binding

Lipid exchange via diffusion

Torsional correlation in lipid headgroups

Simulation Experiment

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Hairpin – first Blue Matter system – 5000 atoms

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

From Packets to Publications

What data are we analyzing?– Molecular dynamics data are output from each node as

individual binary packets of information

– These packets must be framed for each timestep and checked for completeness

– Then they are aggregated and stored in more usable form as energy traces or atom coordinates over time (trajectory)

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Systems studied with Blue MatterIncreasing size and complexity with increasing compute power

Hairpin in water – 5000 atoms– 237 serial runs on SP– 2 publications

Lipid and Lipid/Cholesterol bilayers –15000 atoms– 32-way MPI runs on SP

• Some on BG/L– Several papers and talks pending

Rhodopsin in Lipid/Cholesterol bilayer– 40000 atoms– Milestone system, running microsecond

scale simulation this year on BG/L– Significant interest in scientific results

expected

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

How much data?

50K atoms with pos, vel ⇒ 2.5MB “state”

1 rack yields 5 hours per nanosecond with 2 fstimestep ⇒ 100K steps/hour

1 rack produces 250 GB / hour

BGW (20 racks) produces 5 TB / hour

How to capture, analyze, and archive??

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Answer – don’t need all that data

Positions and velocities needed only rarely

Typically store positions every 500 timesteps at low resolution (16-bit) for analysis

Positions and velocities stored at full resolutions every 5000 timesteps or so, for “restart”

Immediate reduction and selective archiving of data makes it much more manageable

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Mystery Plot: Many stories in a simple line plot(Familiar data to all)

10

20

30

40

50

60

70

80

355 355.5 356 356.5 357 357.5 358 358.5 359 359.5 360

355 360

80

10

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Some analysis modes

Validation– Energy and momentum should be conserved– No temperature drift

Visual inspection of configuration and behavior– Sanity check – system is behaving as it should

Reduction to quantitative results that can match experiment– Diffusion constants, NMR-related correlation times,

lifetimes

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

The end result of validation: Excellent energy conservation

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Examples of analysis for Blue Gene Mol. Dynamics

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Hairpin free energy surface – Thermodynamic view (static)

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Free energy surface with trajectories: Kinetics

Each color isa separate trajectory

Some overlap,others are distinct

Can they be chainedtogether?

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

System moves among 30 bins. Stripchart viewHuge reduction: From complex arrangement of atoms to 5 bits

30 bins

237 trajectories in full set of runs. Build transition matrix

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Rad

ius

of g

yrat

ion

Number of native hydrogen bonds

Assign bins to free energy surface

0

1

23

4

5

::

::

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Markov view of protein folding kineticsCaptures timescale and pathways of proteins in landscape

Describing Protein Folding Kinetics byMolecular Dynamics Simulations. 1. TheorySwope, Pitera, SuitsJ. Phys. Chem. B; 2004 108(21) 6571-81

Describing Protein Folding Kinetics by MD Sim. 2Applications to Alanine Dipeptide and B HairpinJ. Phys. Chem. B; 2004; 108(21) 6582-94Swope, Pitera, Suits, Pitman, Eleftheriou,Fitch, Germain, Rayshubski, Ward, Zhestkov,Zhou

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Lipid/Cholesterol/Water membrane simulation

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Lipid Membrane – 13,000 atoms

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

How to quantify what’s going on?

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Cholesterols - how are they interacting?

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Cholesterol and Lipid Diffusion as r2 vs. time lag

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10 12 14

Time (ns)

r² (Ǻ

²)

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Diffusion “constants” as function of time lag

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10 12

Time (ns)

D (s

lope

of r

²/4)

(1E-

8 cm

²/s)

CholesterolLipid

Diffusion Calculations inSimulated Lipid-Cholesterol BilayersSuits, Pitman, FellerGordon Conference onComputational ChemistryJuly, 2004

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Lipid neighborhood around a cholesterol

Each lipid has two different “chains,” shown red and blue

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

2D contours give some idea of neighborhood,but only in slice. 3D possibilities?

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

3D isosurfaces of density show lipid distributed symmetrically,while cholesterols show strong orientation preference…

Red: Lipid Blue: Other cholesterols

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Also see water pulled in from aboveand cholesterols preferentially oriented to each other . . .

Red: water layer

Blue: other cholesterols

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

… while the two individual lipid chains have preference

Molecular Dynamics Investigation of Structure andDynamics of Cholesterol in a Polyunsaturated Lipid BilayerPitman, Suits, Mackerell, FellerEmerging Challenges in Membrane BiophysicsJune, 2004, Sun Valley, Idaho

Molecular Dynamics Investigation of the StructuralProperties of Phosphatidlyethanolamine Lipid BilayersPitman, Suits, FellerJ. Phys. Chem. B., 2005

Red and Blue: The two different chains on each lipid

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Current and future work

Milestone system currently in production on BG/L– Rhodopsin: GPCR protein in cholesterol/lipid bilayer– Light receptor, and represents large class of drug targets

Combines all aspects of previous simulations:– Protein behavior– Lipid membrane environment– Effect of cholesterol on membrane and protein

Rich with analysis opportunitiesEnsemble now running on many racks

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Rhodopsin and the Eye

http://www2.mrc-lmb.cam.ac.uk/groups/GS/eye.html

RetinaLight sensitive

Protein

Outer segment

of Rod

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

GPCR-based drugs among the 200 best-selling prescriptionsGPCR target Drug Disease Company 2000 sales(US $m)

Zantac AstraZeneca 870

Pepcid Merck 850

Claritin Schering-Plough 2,200

Allegra Aventia 1,100

Risperdal Psychosis Johnson & Johnson 1,600

Imitrex Migraine GlaxoSmithKline 1,100

BuSpar Anxiety Bristol-Myers Squibb 714

Zyprexa Schizophrenia Eli Lilly 2,400

Angiotensin receptors Cozaar Merck 1,700

Toprol-XL AstraZeneca 580

Coreg Congestive heart failure GlaxoSmithKline 250

Serevent Asthma GlaxoSmithKline 940

Muscarinic acetylcholine receptors

Atrovent COPD BoehringerIngelheim

600

GnRH receptors Zoladex Cancer AstraZeneca 740

Dopamine receptors Requip Parkinson’s diseases AstraZeneca 90

Prostaglandin (PGE1) receptors

Cytotec Ulcers Pharmacia 100

ADP receptors Plavix Stroke Bristol-Myers Squibb 900

Adrenoceptors

Hypertension

5-HT receptors

Allergies

Ulcers

Histamine receptors

http://www.predixpharm.com/market_table.htm

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Rhodopsin in lipid/cholesterol bilayer – 43,000 atoms

First scientific publication with BG/L hardware (JACS 2005)

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Bioinformatics visualization experiment

Find novel 2D visualization that captures protein motifs from simple patterns

Simple reduction of data with minimal transformation/heuristics

Let the eye find the patterns, if any

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Motifs from E-Coli Protein SequencesCount Pattern: :8 DEADR4 DEAEA6 DEAEL4 DEAER6 DEAIA5 DEAKA6 DEAKR: : (50,000 lines)

With large number of related patterns,

how to see relative pattern frequency?

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Motif visualization

Goals:– Provide an understandable reduction of the full data

– Show relative population of 3-character patterns

– Allow “drill-down” on selected patterns of interest

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

3D Scatter plot view of three-letter distribution

Origin is AAA, axes end at YAA, AYA, AAY.

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

2D representation of 4-char alphabet

A B

C D

A B

C D

A B

C D

A B

C D

A B

C DA B

C DCA CB

CC CD

DA DB

DC DD

AA AB

AC AD

BA BB

BC BD

Extend to 25-char alphabet:

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

2D representation with 25 character alphabet

AA ABAF

AYAX

AT

AA

AY

BA

BY

YA

YY

RESULT ⇒

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Result for E-coli

Single view shows distribution for all 3-char patterns

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Diff views

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Results of experiment

Novel view

Possibly interesting

Need to try with other data sets

Combine with hierarchical reordering of axes

Easy to try, and captures original data in form the eye can process without imposing bias

Simple form of data reduction

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Conclusions and future directions in Blue Gene Analysis

Analysis involves a staged reduction of infoUse visualization when appropriateUseful for results and validation – and insightBG/L machine continues to growMany molecular dynamics studies are ongoing and will accelerate as machine growsSmall number of large molecular systems, and ensembles of smaller systemsStay tuned for more results and publications

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Many Acknowledgements

Alex BalaeffBruce BerneMaria EleftheriouScott FellerBlake FitchRobert GermainAlan GrossfieldLaxmi Parida

Jed Pitera

Mike Pitman

Alex Rayshubskiy

Bill Swope

Chris Ward

Yuri Zhestkov

Ruhong Zhou

And… Blue Gene Hardware & System Software teams

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Backup

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Scaling Directions

timescaleco

mpl

exity

statist

ical ce

rtainty

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

BG/L communication network

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

Ocean view with Torus

IBM T. J. Watson Research Center

© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005

The science plan – a spectrum of projects

systematically cover a range of system sizes, topological complexity– discovering the "rules" of folding

– applying those rules to have impact on disease

address a broad range of scientific questions and impact areas:– thermodynamics

– folding kinetics

– folding-related disease (CF, Alzheimer's, GPCR's)

improve our understanding not just of protein folding but protein function

1LE1

1L2Y1EOM

1ENH

1BBL

1LMB

1FME

GPCR in membrane