© 2012 IBM Corporation
Big Data Analytics for Life & Agricultural Sciences
1
© 2012 IBM Corporation2
IBM Big Data Session
IBM Big Data Platform OverviewBill Zanine, GM IBM Advanced Analytics for Big Data
IBM Big Data Life Sciences Bill Zanine Healthcare & Research Use Cases
© 2012 IBM Corporation
The IBM Big Data Platform
InfoSphere BigInsights
Hadoop-based low latency analytics for variety and volume
IBM Puredata System for Operational Analytics
BI+Ad Hoc Analytics Structured Data
IBM Smart Analytics System
Operational Analytics on Structured Data
InfoSphere Streams
Low Latency Analytics for streaming data
MPP Data Warehouse
Stream ComputingHadoop
High-Performance Computing
IBM Blue Gene/Q
High-performance for computationally intensive application
Platform Computing
High-performance framework for distributed computing
© 2012 IBM Corporation
IBM Big Data Strategy: Optimized Analytic & Compute Platforms
BI / Reporting Exploration / Visualization
FunctionalApp
IndustryApp
Predictive Analytics
Content Analytics
Analytic Applications
IBM Big Data Platform
Systems Management
Application Development
Visualization & Discovery
Analytic Accelerators
Information Integration & Governance
HadoopSystem
Stream Computing
Data Warehouse
Big data, business strategy and new analytics require optimized analytic platforms:
• Integrate and manage the full variety, velocity and volume of data
• Variety, velocity and volume further drive the need for optimized compute platforms
• Operate on the data in it’s native form, on it’s current location, minimizing movement
• Simplify data management, embed governance, seamlessly leverage tools across platforms
High-Performance Computing
© 2012 IBM Corporation
Big Data Computing
Peta-scale Analytics Real-time Analytics
IBM Big Data Platforms Span the Breadth of Data Intensive Analytic Computing Needs
No single architecture can fulfill all the compute needs for big data analytics, exploration and reporting
Specialized architectures and technologies are optimized to solve specific compute profiles, data volumes and data types
Architectures must be leveraged appropriately for scalable, cost-effective computing
Highly-Planned, Batch
Scalable Batch Unstructured
Batch & Interactive Structured
Highly Planned Autonomous
PetascaleComputing
PetascaleData Processing
PetascaleInteractive Data
Processing
In-LineAnalytics
© 2012 IBM Corporation
Data Intensive Computing for Life Sciences
Sensor Data:Manufacturing, Medical,
Environmental & Lab
StreamingData
Peta-ScaleComputing
DataIntensive,InteractiveComputing
Peta-ScaleData
Sequencer Data, Assays,Medical Records,
Tissue & Cell Imaging
Simulations & ModelsPublication Graphs
Image Metadata, Genomic Profiles,
Environmental & Patient Attribution
Data IntensiveCompute
Acceleration
Patient MonitoringQuality Control
SNP AlignmentImage ClassificationAttribute Extraction
Protein ScienceMolecular Dynamics
Complex Graph Analysis
Translational MedicineGenotype/PhenotypePredictive Healthcare
Complex SimulationsMatrix Acceleration
Compute Acceleration
© 2012 IBM Corporation
Tackling Computational Challenges in Life Sciences
The Vast Datascape of Life Sciences
Scalable, Cost Efficient Genomic Processing and Analysis
Scalable Sensor Processing and Geospatial Analytics
Health Outcomes and Epidemiology
7
© 2012 IBM Corporation
Combining Big Data and Smarter Analytics toImprove Performance in the Life Sciences
8
Weather
Topography
Soi
lVe
geta
tion
Dem
ographicsU
tilities
Environmental Records
Gene
Assembly
Proteomics
Simulations
Geospatial AnalysisClustering
Translational Analysis
Gene Interaction
Claims
Tissue
Lab
Res
ults
Long
itudi
nal
Diagnosis
Prescription
HealthRecords
SNP
Proteins
Hum
anTi
ssue R
NA
Plant
BiologicRecords
YieldD
isease
PlantRecords
© 2012 IBM Corporation
Scalable Genomic Data Processing
9
© 2012 IBM Corporation
SUNY Buffalo – Center for Computational ResearchData Intensive Discovery Initiative
10
© 2012 IBM Corporation
SUNY Buffalo – Large Gene Interaction AnalyticsUB Center for Protein Therapeutics
Use the new algorithms and add multiple variables that before were nearly impossible to achieve
Reduce the time required to conduct analysis from 27.2 hours without the IBM Puredata data warehouse appliance to 11.7 minutes with it
Carry out their research with little to no database administration
Publish multiple articles in scientific journals, with more in process
Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push the IBM Puredata data warehouse appliance platform
11
© 2012 IBM Corporation
Revolution R – Genome Wide Association Study Genome Wide Association Study (GWAS)
– An epidemiological study that consists of the examination of many common genetic variants in different individuals to see if any variant is associated with a trait
Revolution R allows the bio-statisticians to work with Puredata as if they were simply using R on their desktop
– Simplicity with performance considered a “game-changer” by end-users
CRAN Library support allows them to benefit from the aggregate knowledge of the R community
– Extensive selection of packages for bio-statistics and relevant analytic techniques allows developers to be significantly more productive
“What finished in 2 hours 47 minutes on the 1000-6 was still running on the HPC environment and was not estimated to complete until after 14+ days.”- Manager – IT Solution Delivery
© 2012 IBM Corporation
EC2 vs Puredata: Bowtie Financial Value Proposition
What amount of processing on EC2 equates to the cost of a Puredata system?
Bowtie on EC2– Assume the new system will be used for Bowtie, and only Bowtie– Today, Bowtie takes 3 hours on 320 EC2 CPU– Cost of each Bowtie run = 3 hours * 320 CPU * $0.68 per CPU per hour
• $653* per Bowtie Run
Bowtie on Puredata– TF-6 costs $600K*, or $200K per Year assuming 3 year deferral– How many times would a customer have to run Bowtie on EC2 (On-Demand)
for the same expenditure?• $200K per year / $653 per run = 306 Bowtie Runs per Year
Basically, Puredata is a better financial value proposition– If the need to run Bowtie exceeds 300 times a year, for 3 years,
Also, Puredata TF-6 offers 9x the processing capacity for Bowtie of a comparably priced EC2 environment
*Costs are relative, based upon list pricing 2010
14
Benchmarks on Tanay ZX Series
For all the options traded in US in a given day as reported by OPRA (500K to 1 million trades),
implied volatility can be calculated by Tanay ZX Series in less than 500 milliseconds
15
• Predictive Modeling– Relationship between conditions, genetics,
demographics and outcomes
– Survival modeling
• Monte Carlo Simulations– Gibbs Sampling for MCMC studies on patient data
– Drug design - molecular docking
• Gene Sequencing– Parallelized Basic Local Alignment Search Tool
(BLAST)
Applications in Healthcare
15
© 2012 IBM Corporation
Sensor & Geospatial Data Processing
16
© 2012 IBM Corporation
High Speed Analytics on Farm Telematics
Yield data (GPS), Soil Data, Common Land Units(CLUs), Elevation, Farm Plots
Example –Farm Equipment Company
Intersect:
48 million crop yield records (points)
30 million Common Land Units
Result:
~411,000 Summary Records by CLU
(min, max, avg yield)
IBM Puredata Spatial - Precision Farming
Total Time ~ 45 min
Page 17
“We would not even attempt to do this large a process on Oracle.”-Customer GIS Analyst
© 2012 IBM Corporation18
VestasA global wind energy company based in Denmark
Business Challenge Improve placement of wind turbines – save time, increase
output, extend service life
Project objectives Leverage large volume of weather data (2.8 PB today; ~16 PB
by 2015) Reduce modeling time from weeks to hours. Optimize ongoing operations.
Why IBM? Domain expertise Reliability, security, scalability, and integrated solution Standard enterprise software support Single-vendor for software, hardware, storage, support
Information Management
The Solution: IBM InfoSphere
BigInsights Enterprise Edition
IBM xSeries hardware
© 2012 IBM Corporation19
University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units
Business Challenge – Premature births and associated health risks are on the rise.– Enormous data loss from patient monitoring equipment. 3600
readings/hr reduced to 1 spot reading/hr– Analyze physical behaviors (heart rate, respiration, etc) it is
possible to determine when the body is coming under stress.
Project objectives– Analyze ALL the data in real-time to detect when a baby is
becoming unwell earlier than is possible today. – Reduce avg length of stay in neonatal intensive care reducing
healthcare costs.
The benefits– Analyze ~90 million points of data per day per patient in real-
time . . . every reading taken is analyzed.– Able to stream the data into a database, and shown that the
process can keep pace with the incoming data.
Information Management
Solution Components:
InfoSphere Streams On premises In the cloud
Warehouse to correlate physical behavior across different populations.
Models developed in warehouse used to analyze streaming data.
“I could see that there were enormous opportunities to capture, store and utilize this data in real time to improve the quality of care for neonatal babies.”
Dr. Carolyn McGregorCanada Research Chair in Health InformaticsUniversity of Ontario Institute of Technology
© 2012 IBM Corporation
Pacific Northwest Smart Grid Demonstration Project
Capabilities:
Stream Computing – real-time control system
Deep Analytics Appliance – analyze massive data sets
Demonstrates scalability from 100 to 500K homes while retaining 10 years’ historical data
60k metered customers in 5 states
Accommodates ad hoc analysis of price fluctuation, energy consumption profiles, risk, fraud detection, grid health, etc.
© 2012 IBM Corporation
Hardcore Research
21
© 2012 IBM Corporation
Computational Biology and Healthcare – Groups and Projects
Computational Biology Center (Watson Research Lab)– Comparative Genomics– Protein Folding– DNA Transistor (nanopore sequencing)
Healthcare Informatics (Almaden Research Lab) ***– AALIM: Advanced Analytics for Information Management– The Spatiotemporal Epidemiological Modeler (STEM)– Genome-Wide Association Studies for Predictive Healthcare ***
Healthcare Solutions (Haifa Research Lab)– HIV Therapy Prediction (based on virus DNA markers)– HYPERGENES (genetics of hypertension)
© 2012 IBM Corporation
Aligner
Aligner
Aligner
• • •
DNAReads
AlignedReads
Variation Caller
Variation Caller
Variation Caller
• • •Chr 1 SNPs
Chr 22 SNPs
Chr Y SNPs
Chr 1
Chr 1
Chr 22
Chr 22
Chr 22
Chr Y
Variation byChromosomal Region
MAPStep
REDUCEStep
Bioinformatics on Hadoop: Alignment and Variant Calling
This DNA sequence analysis workflow is implemented in the academic software Crossbow (Bowtie aligner + SOAPsnp variant caller)
© 2012 IBM Corporation
Health Outcomes and Epidemiology
24
© 2012 IBM Corporation25
University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units
Business Challenge Premature births and associated health risks are on the rise. Enormous data loss from patient monitoring equipment. 3600
readings/hr reduced to 1 spot reading/hr Analyze physical behaviors (heart rate, respiration, etc) it is possible
to determine when the body is coming under stress.
Project objectives Analyze ALL the data in real-time to detect when a baby is becoming
unwell earlier than is possible today. Reduce avg length of stay in neonatal intensive care reducing
healthcare costs.
The benefits Analyze ~90 million points of data per day per patient in real-time . . .
every reading taken is analyzed. Able to stream the data into a database, and shown that the process
can keep pace with the incoming data.
Information Management
Solution Components:
InfoSphere Streams On premises In the cloud
Warehouse to correlate physical behavior across different populations.
Models developed in warehouse used to analyze streaming data.
© 2012 IBM Corporation
Health Analytics – GPRD & OMOP Benchmarking
SAS w/ Puredata
GPRD copy 65 hours ~14 hours
OMOP copy Failed to complete out of memory
~14 hours
GPRD to GPRDOMOP transformation
Two weeks ~51 minutes
Small SAS Analysis 6h 38m 0h 3m
Large SAS Analysis Failed to complete out of memory
16m 47s
Standard “OSCAR”Analysis
Failed to complete out of memory
6 h 21m
IHCIS summary n/a 48 seconds
General Practice Research Database – GPRD– European Clinical Records Database - 11 Million Patients, 24 Years– Pharmacovigilance, Drug Utilization and Health Outcomes Analytics
Observation Medical Outcomes Partnership - OMOP– Shared Models and Methods for Health Outcomes Analytics
Migration of GPRD & OMOP data models, ETL and SAS programs to Puredata– 266 GB of Raw Data– Compressed to 75 GB
Puredata 1000-6
SAS & SAS Access 9.2
© 2012 IBM Corporation
Improved Clinical Insights
Reduction in time to perform data inquiries and analytics
Advance analytic capabilities for clinical analysis
Ability of end-users to gain immediate benefit without significant retooling or new technologies
Improvement Metrics
Immediate 10x performance improvement on existing SAS applications with little to no rework or original codeLeveraging traditional S&M data assets as an early indicator for more detailed investigations with clinical dataCompany data management strategy focused on centralizing data assets on Puredata to improve analytic capabilities
Result
Post-launch monitoring of clinical data required the manual integration of data across several data providers
Differences in data assets did not facilitate integration over time
Flat-file oriented processing significantly increase complexity of analysis
Problem
Data inquiries performed overnight via manually developed SAS programs
Pre-formatted data sets (to simplify integration) did not enable access to unique characteristics of different sources
Significant data duplication at the risk of good data management
Effect
Migration of SAS-based flat files to a relational environment in Puredata
Optimization of existing SAS code to best leverage Puredata in-database functionality
Integration of clinical data from Premiere and IMS with sales & marketing data from IMS, WK and SDI
Implementation Scope
© 2012 IBM Corporation
Revolution R - Medication Response Modeling Understand the major causes of morbidity and mortality related to inaccurate
dosages of medications – Relating individual responses to medication with genetic makeup and environmental
factors indicated by biomarker measurement
The Puredata architecture allowed for the re-use of the existing investment in R– 20x+ performance improvement over existing HPC infrastructure– Business logic of the R code remained intact– Speed at which they could interrogate the data allowed them to play with many models
Explored use of in-database analytics (dbLytix)– 1000x performance improvement of existing HPC infrastructure
HPC EnvironmentIBM Puredata
1000-12IBM Puredata
1000-12Language R R (nzTapply) dbLytixPlatform Linux HPC IBM Puredata IBM PuredataNodes TBD (100+) 12 12Deployment R server Revolution R Enterprise dbLytixInterface Linux Command Line Desktop GUI (nzTapply) 24.9 BillionElapsed Time 36+ hours 2 hours 2 minutes
© 2012 IBM Corporation
Optum Insight – Predictive Healthcare with Fuzzy Logix
Predict who is at risk to develop diabetes 6 months, 1 year & 2 years out– Provide more advanced intervention services for health plans– Based upon medical observations and patient demographics– In-memory analytic environment was limited to analyzing small sets of patient
cohorts with limited computational capability
“Optum Insight could not do this work without Netezza or Fuzzy Logix”– Leveraged 79x more medical observations, processed 150x faster– From 150 variables to 1700, with a capacity for 5000+– Fast iterative analysis for continuous model improvement
In-Memory AnalyticsLinux Server
PuredataModel v.1 - UDA
PuredataModel v.2 - UDTF
Nodes Multi-Core/4GB RAM 24 CPU/96 Core 24 CPU/96 CoreObservations (rows) 2 Million 2 Million 14 MillionVariables (dimensions) 150 1500 1700Evaluations (matrix elements) 300 Million 2.7 Billion 23.8 BillionPuredata Elapsed Time > 5 Hours 30 minutes 4 hoursImprovement* 10x 150x
*Improvement estimates are conservative. Models on Puredata leveraged 10x the amount of data while performing several additional calculations that were not feasible with in-memory
solution
© 2012 IBM Corporation
Health Insurance Provider – SAS SQL Pass-Thru
IBM BCU 1000-12
CPU 22 24 CPU/96 Cores
Storage 32.0TB 32.0TB
Data 1.5TB 0.8TB
Indices 1.5TB Not Applicable
Duplication 15.0TB Not Applicable
Tuning 6-months 1-wk POC
• IBM BCU database used for SAS applications> 5 years of policyholder data> “Nasty” queries – “Decision Tree across Neural
Networks by zipcode”> Had just undergone 6 months of tuning
• Puredata POC loaded raw data with no optimization
> Testing used SAS/Access for ODBC> Will be even faster with SAS/Access for Puredata> 15x average performance improvement> Over 26x on long-running analytics
SAS2
SAS3
Mar
ketin
g
Mixe
d Que
ry
Reten
tion1
Reten
tion2
Rewrit
e
Indic
ation
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Puredata easily outperforms a finely tuned BCU
TwinFin-12
IBM BCU
Sec
on
ds
© 2012 IBM Corporation
Catalina Marketing – In-Database Analytics
35X improvement in staff productivity – Model development reduced from
2+ months to 2 days– 10’s of models to 100’s per year
with the same staff
Increased depth of data per model– 150 to 3.2 Million features– 1 Million to 14.5 Trillion records per
analysis
Impressive ROI on IT investment– 12 Million times more data
processed per CPU per day– Direct correlation of models
development to revenue
1 4 24 400 4000
20
40
60
80
0
50,000
100,000
150,000
200,000
While Increasing Depth of Data
Days for Model Build Rows/Day/CPU (Millions)
Number of CPU
Day
s
Ro
ws
(Mil
lio
ns)
1 4 24 400 4000
10
20
30
40
Developing Models Faster...
Speedup
Number of CPU
Sp
eed
up
© 2012 IBM Corporation
Catalina – SAS Scoring Accelerator
PC SAS Unix SAS SAS MP Connect Puredata SASSQL Pass-Thru
Puredata SAS Scoring Accelerator
Nodes 1 CPU 4 CPU 24 CPU 400 CPU 400 CPU
Storage 0.5 TB 5 TB 15 TB 120 TB 120 TB
#rows(data volume)
1 Million 4 Billion 4 Billion 14 Trillion 140 Trillion
#columns(dimensions, features)
30 Brands, 5 Variables
30 Brands, 800 Categories,
5 Variables
30 Brands, 800 Categories,
5 Variables
80,000 Brands, 800 Categories,
5 Variables
80,000 Brands, 800 Categories,
5 VariablesModel Build and Deploy
70-84 Days 35 Days 10 Days 3 Days 2 Days
Rows/CPU/Day 14,286 28,571,429 16,666,667 12,083,333,333 175,625,000,000
Per CPU Speedup from PC SAS
2,000 X 1,167 X 845,833 X 12,293,750 X
Rows/Day 14,286 114,285,714 400,000,000 4,833,333,333,333 70,250,000,000,000
Per Day Speedup from PC SAS
8,000 X 28,000 X 338,333,333 X 4,917,500,000 X
© 2012 IBM Corporation
Why computational pharmaco-epidemiology?• FDA will be implementing a system to track safety of drugs and devices through active
surveillance on tens to hundreds of terabytes of claims data (FDA Mini-Sentinel)
• Pharma’s want to innovate ahead of Sentinel, find new markets and risks
• Payers want to measure their insured population, providers, outcomes, ACO
• Comparative effectiveness research is a top priority for next-generation healthcare
What is the Harvard Computational Pharmaco-Epidemiology Program?
•Harvard Medical School faculty selects Puredata for pharmaco-epi complex analytics, studies on drug effectiveness & safety
• 100% of computation run on Puredata
• Faculty are in Methods Core of FDA Mini-Sentinel & globally esteemed
Harvard Medical School collaboration
Why is it special?• These end users have no IT budget, no DBA’s, period!