Virtual Organizations: Building Interdisciplinary Collaborations
Chancellor’s Eminent ProfessorVice Chancellor for IT
University of North Carolina at Chapel Hill
Director, Renaissance Computing Institute
Acknowledgments
• Funding agencies– NIH
• Carolina Center for Exploratory Genetic Analysis (CCEGA)
– NSF• TeraGrid Science Gateways
– State of North Carolina• RENCI and ancillary Bioportal support
• RENCI staff– Alan Blatecky, Kevin Gamiel, Xiaojun Guan– Clark Jefferies, Howard Lander– John Magee, Ruth Marinshaw, Jeff Tilson– Lavanya Ramakrishnan
• And a host of others …
21st Century Challenges• The three fold way
– theory and scholarship– experiment and measurement– computation and analysis
• Supported by– distributed, multidisciplinary teams– multimodal collaboration systems– distributed, large scale data sources– leading edge computing systems– distributed experimental facilities
• Socialization and community– multidisciplinary groups– geographic distribution– new enabling technologies– creation of 21st century IT infrastructure
• sustainable, multidisciplinary communities
• “Come as you are” response
Th
eory
Exp
erim
ent
Computation
Exemplar 21st Century Challenges
• Population growth in sensitive areas– severe weather sensitivity
• national impact– geobiology and environment– economics and finance– sociology and policy
• Economics and health care– longitudinal public health data
• environmental interactions– genetic susceptibility
• heart disease, cancer, Alzheimer's– privacy and insurance– public policy and coordination
Mean Onset of Alzheimer’s Disease• apolipoprotein (apo)
– apoE2, apoE3 and apoE4 alleles• on chromosome 19
– apoE4 allele• 40% to 60% of Alzheimer's patients• not the only cause for Alzheimer’s
• apo gene inheritance– ~25% inherit 1 copy of apoE4 allele
• Alzheimer's risk increases 4X
– 2% inherit 2 copies of apoE4 allele• Alzheimer's risk increases 10X
60 65 70 75 80 85
1.0
0.8
0.6
0.4
0.2
0P
ropo
rtio
n of
eac
hge
noty
pe u
naffe
cted
Age at onset
2/3
2/43/3
3/4
4/4
Source: Alan Roses, GSK
Big QuestionsDNA
sequenceProtein
structure
Homology basedprotein structure
prediction
Protein sequence and regulation
SequenceAnnotation
Q
Y
R
CGT
TAC
CAG
TATAP
rom
ote
rM
es
sa
ge
Protein/enzymefunction
Molecularsimulations
Bacteria and cells
Pathwaysimulations
Metabolic pathwaysand regulatory networks
Networkanalysis
Dataintegration
Organs, Organisms and Ecologies
Multi-proteinmachines
Identify Genes
Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4
Predictive Disease Susceptibility
Physiology
Metabolism Endocrine
Proteome
Immune Transcriptome
BiomarkerSignatures
Morphometrics
Pharmacokinetics
EthnicityEnvironment
AgeGender
Genetics and Disease Susceptibility
Source: Terry Magnuson, UNC
PITAC Report Contents• Computational Science: Ensuring
America’s Competitiveness 1. A Wake-up Call: The Challenges to U.S.
Preeminence and Competitiveness2. Medieval or Modern? Research and Education
Structures for the 21st Century3. Multi-decade Roadmap for Computational Science4. Sustained Infrastructure for Discovery and
Competitiveness5. Research and Development Challenges
• Two key appendices– Examples of Computational Science at Work– Computational Science Warnings – A Message
Rarely Heeded
• Available at www.nitrd.gov
Life Science Lessons from Astronomy
• Historically, discoveries accrued to those– with access to unique data– who built next generation telescopes
• Two things changed– growing costs and complexity of telescopes– emergence of whole sky surveys
• The result – virtual astronomy– discovering significant patterns
• analysis of rich image/catalog databases
– understanding complex astrophysical systems • integrated data/large numerical simulations
{Inter}national Virtual Observatory
Cluster Galaxy Morphology Analysis Portal
clusters
Chandra SIA
Skyview SIA
DSS SIA
2. Look up clusterin internally storedcatalog
1.User’s Machine
webbrowser
User selectsa cluster
3. X-ray and Optical Images retrieved via SIA interface
4. User launchesdistributed analysis
NED Cone Search
CADC CNOC Cone Search
5. Initial Galaxy Catalog generated via Cone Search
DSS SIA
CNOC SIA
6. Image cutout pointers merged into catalog
Morphology CalculationService
7.Morphological parameters calculatedon grid for each galaxy
8.User downloads finaltable and images for analysis & visualization
Source: Ray Plante, NCSA
The Bioinformatics Challenge• Challenge
– the rise of quantitative biology• burgeoning bioinformatics data
– complex analysis and modeling problems– education and training in new technologies
• Reality– diverse tools with idiosyncratic interfaces
• steep learning curves– software development by diverse groups– distributed, databases with diverse metadata
• Need– integrated, easy-to-use toolset with standard interfaces– extensible mechanisms that hide idiosyncrasies– tool and bioinformatics training
• The solution– bioinformatics infrastructure and coupled training
Need: Simple, Easy-To-Use Tools
“Genome. Bought the book. Hard to read.”
Eric Lander
Web and Social Processes• Google
– it’s a search engine, it’s a verb, …
• Blogs– published self-expression
• Instant Messenger– social networks
• Wireless messaging– semi-synchronous
• Internet commerce– the dot.com boom/bust– EBay, Amazon
• Spam, phishing, …– anti-social behavior
Benefits of Standards
• Interoperability• Separation of concerns• Reuse• Independence• Dependability• Sharing• Commonality• Shared knowledge base
– knowledge reuse– simplification (one hopes)
What’s A Grid/Web Service?
http://http://
http://http://
Web: Uniform access to documents
Grid/Web Services: Flexible, high-performance access to resources and services for distributed communities
Sensors andinstruments
Data archives
Computers
Softwarecatalogs
Colleagues
It’s been 12 years!
Grid History: I-Way at SC’95• A prototype national infrastructure
– 17 sites, connected by • vBNS and six other ATM networks
– 60 applications
• Features– I-POPs for site access– Kerberos authentication– manual scheduling– distributed communication libraries
• Experiences– led to Globus Grid toolkit
• Concurrent industry needs– led to web services for B2B interoperation
Web Services: “Commercial Grids”
• From browser-centric to service-centric– from human-computer to computer-computer– structured negotiation and response
• Workflow creation and management– end-to-end service negotiation– inter-organizational interaction
• Prerequisites– metadata standard for service descriptions– standard communication mechanisms– resource discovery and registration
eBay Web Services Architecture
• Over 40% of eBay's listings are now via API calls
Source: IBM
Web Services: A DefinitionA web service is … designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact … [using] its description using SOAP-messages, … using HTTP with an XML serialization ....
W3C Working Draft, August 2003
• SOAP (Simple Object Access Protocol)
ServiceProvider
ServiceConsumer
ServiceBroker
Publish
LocateInvoke
SOAPSOAP
SOAP
WSDL
• WSDL (Web Services Description Language)
UDDI
• UDDI (Universal Description, Discovery and Integration)
Technology Push
Source: Gartner Group
European myGrid Architecture
Source: www.mygrid.org
The Bioinformatics Challenges• Complex, multilevel models
– integration and in silico designs• Information visualization
– complexity and scale• Data models and ontologies
– community definition• Data federation, storage and management
– shared access and support• User access portals
– web-based tool and service interfaces • Packaging, distribution and deployment
– community building
Multilevel Cellular Models• Signaling networks
– environmental triggers and behavior• e.g., cell lifecycle
– different pathways in each tissue type • Metabolic networks
– measurable products in pathway – many systems are steady state– negative feedback leads to stabilization
• Protein interaction networks– localization of proteins that interact for function– protein-protein interactions for specific actions
• Gene regulatory networks– many things affect gene product concentration– nucleic-nucleic, protein-nucleic interactions
• Computing, physics, engineering and biology– control theory, mathematical models, phase spaces– from biological cartoons to predictive models
• e.g., microRNAs and gene expression controls
Biological Models• Simulation and prediction
– structures and dynamics
• Reasoning and discovery– reverse engineering
10-12
Bond Motion Catalysis
Diffusion
TranscriptionTranslation
Growth &Division
10-9 10-6 10-3 100 103 106
100
Metabolites Proteins Ribosomes Prokaryotes Eukaryotes
102 104 106 108 1010 1012
Temporal (seconds)
Spatial (nM3)
Biophysical and Environmental Modeling
Genomics
Proteomics
Cell biochemistryand structure
Cilia
Mucus
Airway/flow
Source: Ric Boucher, UNC
Data Heterogeneity and Complexity
DiseaseDisease
DiseaseDrug
DiseaseClinical
trialPhenotype
ProteinProtein
StructureProtein
SequenceP-P
interactions
Proteome
Gene sequenceGenome
sequence
Gene expressionGene
expression
homology
Genomic, proteomic, transcriptomic, metabalomic, protein-protein interactions, regulatory bio-networks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), …
Source: Carole Goble (Manchester)
Source: Robert Morris, IBM
Sensor Data Overload
• High resolution brain imaging– 4.5 petabytes (PB) per brain
Source: Chris Johnson, Utah Art Toga, UCLA
RENCI: What Is It?• Statewide objectives
– create broad benefit in a competitive world– engage industry, academia, government and citizens
• Four target areas– public benefit
• supporting urban planning, disaster response, …– economic development
• helping companies and people with innovative ideas– research engagement across disciplines
• catalyzing new projects and increasing success• building multidisciplinary partnerships
– education and outreach• providing hands on experiences and broadening participation
• Mechanisms and approaches– partnerships and collaborations– infrastructure as needed to accomplish goals
Extant Data Models
Faculty, Staff & Students
Virtuous Cycle
InterdisciplinaryResearch & Education
Carolina Center for Exploratory Genetic Analysis (CCEGA)
Statistical &Computational
Techniques
ExperimentalGenetics Portal
Driving Problems
Analysis Techniques
PromotingMutual
Awareness
InteroperableData
Management
CCEGA Participants• Coordination team
– Dan Reed, RENCI– Terry Magnuson, CCGS– Alan Blatecky, RENCI– Kirk Wilhelmsen, CCGS
• Eleven departments/institutes– Biostatistics– Cancer Center– Genetics– Computer Science– Epidemiology– Genetics– Health Science Library– Information and Library Science– Pharmacy– RENCI– Statistics
• Campus wide support– from many sources
• Project participants– Brad Hemminger, Information & Library Science– James Evans, Genetics– Kevin Gamiel, RENCI– Xiaojun Guan, RENCI– Barrie Hays, Health Science Library– Clark Jefferies, RENCI– Ethan Lange, Genetics– Andrew Nobel, Statistics– Karen Mohlke, Genetics– Kari North, Epidemiology– Susan Paulsen, Computer Science– Fernando Manuel Pardo, Genetics– Charles Perou, Cancer Center– Lavanya Ramakrishnan, RENCI– Jan Prins, Computer Science– Patrick Sullivan, Genetics– Lisa Susswein, Cancer Center– David Threadgill, Genetics– Alexander Tropsha, Pharmacy– K.T.L. Vaughan, Health Science Library– Fred Wright, Biostatistics– Wei Wang, Computer Science– Fei Zou, Biostatistics
Data: From Lab and Clinic to Analysis• Independent data management
– data security– version control– redundancy– controlled access
Clinical
LaboratoryAnalysis
ELSI
Source: Brad Hemmenger, UNC
• NIH CCEGA– Carolina Center for Exploratory Genetic Analysis
Analysis
LAB
ELSI
Integration &Informatics
Clin
ic
Analysis
Data Management and Information Viz
…..
Information MiningModule
Information MiningModule
Information VisualizationModule
Information VisualizationModule
GenBank
Taxonomy Annotation
Taxonomy Annotation
Ontology AnnotationOntology
Annotation
Annotated Domain Literature
Annotated Domain Literature
Published Domain Literature
DB Schema Ontology
Annotation
From SNPs to HapMap
• Single Nucleotide Polymorphisms (SNPs)– one in ~1200 bases differ across individuals– SNPs act as markers to locate genes
• Common groups of SNPs are shared – i.e., form a haplotype
• HapMap data sources– 90 Yoruba individuals (30 trios) from Nigeria (YRI)– 90 individuals (30 trios) of European descent from Utah (CEU)– 45 Han Chinese individuals from Beijing (CHB)– 45 Japanese individuals from Tokyo (JPT)
• ~3,500,000 SNPs typed– basis for association studies for disease identification
CCEGA HapMap Simulator
• Synthetic data– disease models– model testing
• mining bakeoffs
Carolina Bioportal• Three overlapping target groups
– undergraduate education– graduate education and research– academic/industrial research
• Features– access to common bioinformatics tools– extensible toolkit and infrastructure
• OGCE and National Middleware Initiative (NMI)• leverages emerging international standards
– remotely accessible or locally deployable– packaged and distributed with documentation
• National reach and community– TeraGrid deployment
• science gateway• Education and training
– hands-on workshops• clusters, Grids, portals and bioinformatics
Distributed Grid and Web Services
Resource Layer(from PCs to Supercomputers)
Grid Portals
Launch, configureand control Application Interface
Workflow service
App InstanceApp InstanceApp Instance
SecuritySecurity
Data ManagementServiceData Management
Service
AccountingServiceAccounting
ServiceLogging
Logging
Event/MessageServiceEvent/Message
Service
PolicyPolicy
Administration& MonitoringAdministration
& MonitoringGrid Orchestration
Grid Orchestration
Registries andName bindingRegistries and
Name binding
Reservations And SchedulingReservations
And Scheduling
Open Grid Service Architecture Layer
Open Grid Service Infrastructure (web service component model)
Online instruments
Source: Dennis Gannon, Indiana
PISEApplication
XML Description
HTML Files
Bioportal
GatekeeperGridFTP MyProxy
OGCE User Databases
Job History Database
Application Processing
InterfaceGenerator
VelocityFiles
ApplicationProcessing
CommandFiles
Authentication,Grid Credential
User Profile
Job SubmissionJob
Records
RemoteFile
Access
Bioportal Architecture
www.ncbioportal.org
ApplicationDatabases
Localcluster• OGCE toolkit
– used by cyberinfrastructure projects• LEAD, NEES, PACI, DOE, TeraGrid …
Putting the Technologies Together
NC Bioportal
OGCE Toolkit (Grid middleware)
Chef (collaboration/standard portlets)
Velocity(templateengine)
Jakarta Jetspeed(enterprise portal)
Turbine(web app
framework)
Tomcat(Apacheservlet
container)
GridPortlets,
CoG
Databases
BioApplications
PISE(XML
Wrapper)
VMC
Community Software Toolkit: Lessons
• NSF PACI Alliance “In a Box” toolkits– cluster software (aka OSCAR)– Grid infrastructure (aka NMI)– Access Grid for distributed collaboration– tiled display walls for visualization
• Distribution materials– software and training materials
• CDs and web• Community workshops and training
– Linux Clusters Institute– MSI HPC workshops– hands on training
• Lowering the entry barrier– usage and deployment
• Bioportal distribution– workshops, tutorials– training materials– road shows
NC Bioportal: What’s Next• Engagement
– workshops, experiences and deployments• Infrastructure
– dynamic job scheduling across multiple sites– migration to OGCE 2.0– fully automated database updates– workflow construction and processing
• Portal tool suite– expanded applications and databases
• phylogeny, morphology, microarray analysis, …
• Training materials– additional modules based on user feedback– workshop materials packaged for self-study
• Leverage national presence– TeraGrid/NCSA bioinformatics portal
The Vision of Grid/Web Services
“… Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do.”– Book of Genesis
Peter BruegelThe Tower of Babel (1563)
Interdisciplinary Collaborations
• Appropriate reward structures– well-matched time constants
• Intellectual equality– balanced recognition of contributions
• Research/infrastructure distinctions– timelines and people needs differ
• Confidentiality and openness– academic/industry collaboration perspectives
• Intellectual property– background IP and differential disciplinary models
Some Thoughts on the Future• Grids/web services are not a panacea
– we have seen this movie before• standards debates can be endless• make new mistakes, not the same old ones
– code is shifted from modules to interfaces
• Danger of “Death by CS Abstraction”– “all problems can be solved by another level of indirection”
• Appropriate decomposition is a challenge– performance, usability, flexibility
• Generality and extensibility really matter– incremental aggregation and interoperability– data management and federation
• Better questions, not just private capabilities– limited by creativity not resources
The Cambrian Explosion
• Most phyla appear– sponges, archaeocyathids, brachiopods– trilobites, primitive mollusks, echinoderms
• Indeed, most appeared quickly!– Tommotian and Atdbanian – as little as five million years
• Lessons for computing– it doesn’t take long when conditions are right
• raw materials and environment
– leave fossil records if you want to be remembered!
Thanks for the Invitation!