Big Data Analytics for Life & Agricultural Sciences

© 2012 IBM Corporation

Big Data Analytics for Life & Agricultural Sciences

1

© 2012 IBM Corporation2

IBM Big Data Session

IBM Big Data Platform OverviewBill Zanine, GM IBM Advanced Analytics for Big Data

IBM Big Data Life Sciences Bill Zanine Healthcare & Research Use Cases


The IBM Big Data Platform

InfoSphere BigInsights

Hadoop-based low latency analytics for variety and volume

IBM Puredata System for Operational Analytics

BI+Ad Hoc Analytics Structured Data

IBM Smart Analytics System

Operational Analytics on Structured Data

InfoSphere Streams

Low Latency Analytics for streaming data

MPP Data Warehouse

Stream ComputingHadoop

High-Performance Computing

IBM Blue Gene/Q

High-performance for computationally intensive application

Platform Computing

High-performance framework for distributed computing


IBM Big Data Strategy: Optimized Analytic & Compute Platforms

BI / Reporting Exploration / Visualization

FunctionalApp

IndustryApp

Predictive Analytics

Content Analytics

Analytic Applications

IBM Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Analytic Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

Big data, business strategy and new analytics require optimized analytic platforms:

• Integrate and manage the full variety, velocity and volume of data

• Variety, velocity and volume further drive the need for optimized compute platforms

• Operate on the data in it’s native form, on it’s current location, minimizing movement

• Simplify data management, embed governance, seamlessly leverage tools across platforms

High-Performance Computing


Big Data Computing

Peta-scale Analytics Real-time Analytics

IBM Big Data Platforms Span the Breadth of Data Intensive Analytic Computing Needs

No single architecture can fulfill all the compute needs for big data analytics, exploration and reporting

Specialized architectures and technologies are optimized to solve specific compute profiles, data volumes and data types

Architectures must be leveraged appropriately for scalable, cost-effective computing

Highly-Planned, Batch

Scalable Batch Unstructured

Batch & Interactive Structured

Highly Planned Autonomous

PetascaleComputing

PetascaleData Processing

PetascaleInteractive Data

Processing

In-LineAnalytics


Data Intensive Computing for Life Sciences

Sensor Data:Manufacturing, Medical,

Environmental & Lab

StreamingData

Peta-ScaleComputing

DataIntensive,InteractiveComputing

Peta-ScaleData

Sequencer Data, Assays,Medical Records,

Tissue & Cell Imaging

Simulations & ModelsPublication Graphs

Image Metadata, Genomic Profiles,

Environmental & Patient Attribution

Data IntensiveCompute

Acceleration

Patient MonitoringQuality Control

SNP AlignmentImage ClassificationAttribute Extraction

Protein ScienceMolecular Dynamics

Complex Graph Analysis

Translational MedicineGenotype/PhenotypePredictive Healthcare

Complex SimulationsMatrix Acceleration

Compute Acceleration


Tackling Computational Challenges in Life Sciences

The Vast Datascape of Life Sciences

Scalable, Cost Efficient Genomic Processing and Analysis

Scalable Sensor Processing and Geospatial Analytics

Health Outcomes and Epidemiology

7


Combining Big Data and Smarter Analytics toImprove Performance in the Life Sciences

8

Weather

Topography

Soi

lVe

geta

tion

Dem

ographicsU

tilities

Environmental Records

Gene

Assembly

Proteomics

Simulations

Geospatial AnalysisClustering

Translational Analysis

Gene Interaction

Claims

Tissue

Lab

Res

ults

Long

itudi

nal

Diagnosis

Prescription

HealthRecords

SNP

Proteins

Hum

anTi

ssue R

NA

Plant

BiologicRecords

YieldD

isease

PlantRecords


Scalable Genomic Data Processing

9


SUNY Buffalo – Center for Computational ResearchData Intensive Discovery Initiative

10


SUNY Buffalo – Large Gene Interaction AnalyticsUB Center for Protein Therapeutics

Use the new algorithms and add multiple variables that before were nearly impossible to achieve

Reduce the time required to conduct analysis from 27.2 hours without the IBM Puredata data warehouse appliance to 11.7 minutes with it

Carry out their research with little to no database administration

Publish multiple articles in scientific journals, with more in process

Proceed with studies based on ‘vector phenotypes’—a more complex variable that will further push the IBM Puredata data warehouse appliance platform

11


Revolution R – Genome Wide Association Study Genome Wide Association Study (GWAS)

– An epidemiological study that consists of the examination of many common genetic variants in different individuals to see if any variant is associated with a trait

Revolution R allows the bio-statisticians to work with Puredata as if they were simply using R on their desktop

– Simplicity with performance considered a “game-changer” by end-users

CRAN Library support allows them to benefit from the aggregate knowledge of the R community

– Extensive selection of packages for bio-statistics and relevant analytic techniques allows developers to be significantly more productive

“What finished in 2 hours 47 minutes on the 1000-6 was still running on the HPC environment and was not estimated to complete until after 14+ days.”- Manager – IT Solution Delivery


EC2 vs Puredata: Bowtie Financial Value Proposition

What amount of processing on EC2 equates to the cost of a Puredata system?

Bowtie on EC2– Assume the new system will be used for Bowtie, and only Bowtie– Today, Bowtie takes 3 hours on 320 EC2 CPU– Cost of each Bowtie run = 3 hours * 320 CPU * $0.68 per CPU per hour

• $653* per Bowtie Run

Bowtie on Puredata– TF-6 costs $600K*, or $200K per Year assuming 3 year deferral– How many times would a customer have to run Bowtie on EC2 (On-Demand)

for the same expenditure?• $200K per year / $653 per run = 306 Bowtie Runs per Year

Basically, Puredata is a better financial value proposition– If the need to run Bowtie exceeds 300 times a year, for 3 years,

Also, Puredata TF-6 offers 9x the processing capacity for Bowtie of a comparably priced EC2 environment

*Costs are relative, based upon list pricing 2010

14

Benchmarks on Tanay ZX Series

For all the options traded in US in a given day as reported by OPRA (500K to 1 million trades),

implied volatility can be calculated by Tanay ZX Series in less than 500 milliseconds

15

• Predictive Modeling– Relationship between conditions, genetics,

demographics and outcomes

– Survival modeling

• Monte Carlo Simulations– Gibbs Sampling for MCMC studies on patient data

– Drug design - molecular docking

• Gene Sequencing– Parallelized Basic Local Alignment Search Tool

(BLAST)

Applications in Healthcare

15


Sensor & Geospatial Data Processing

16


High Speed Analytics on Farm Telematics

Yield data (GPS), Soil Data, Common Land Units(CLUs), Elevation, Farm Plots

Example –Farm Equipment Company

Intersect:

48 million crop yield records (points)

30 million Common Land Units

Result:

~411,000 Summary Records by CLU

(min, max, avg yield)

IBM Puredata Spatial - Precision Farming

Total Time ~ 45 min

Page 17

“We would not even attempt to do this large a process on Oracle.”-Customer GIS Analyst


VestasA global wind energy company based in Denmark

Business Challenge Improve placement of wind turbines – save time, increase

output, extend service life

Project objectives Leverage large volume of weather data (2.8 PB today; ~16 PB

by 2015) Reduce modeling time from weeks to hours. Optimize ongoing operations.

Why IBM? Domain expertise Reliability, security, scalability, and integrated solution Standard enterprise software support Single-vendor for software, hardware, storage, support

Information Management

The Solution: IBM InfoSphere

BigInsights Enterprise Edition

IBM xSeries hardware


University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units

Business Challenge – Premature births and associated health risks are on the rise.– Enormous data loss from patient monitoring equipment. 3600

readings/hr reduced to 1 spot reading/hr– Analyze physical behaviors (heart rate, respiration, etc) it is

possible to determine when the body is coming under stress.

Project objectives– Analyze ALL the data in real-time to detect when a baby is

becoming unwell earlier than is possible today. – Reduce avg length of stay in neonatal intensive care reducing

healthcare costs.

The benefits– Analyze ~90 million points of data per day per patient in real-

time . . . every reading taken is analyzed.– Able to stream the data into a database, and shown that the

process can keep pace with the incoming data.


Solution Components:

InfoSphere Streams On premises In the cloud

Warehouse to correlate physical behavior across different populations.

Models developed in warehouse used to analyze streaming data.

“I could see that there were enormous opportunities to capture, store and utilize this data in real time to improve the quality of care for neonatal babies.”

Dr. Carolyn McGregorCanada Research Chair in Health InformaticsUniversity of Ontario Institute of Technology

http://www.uoit.ca/EN/index.html


Pacific Northwest Smart Grid Demonstration Project

Capabilities:

Stream Computing – real-time control system

Deep Analytics Appliance – analyze massive data sets

Demonstrates scalability from 100 to 500K homes while retaining 10 years’ historical data

60k metered customers in 5 states

Accommodates ad hoc analysis of price fluctuation, energy consumption profiles, risk, fraud detection, grid health, etc.


Hardcore Research

21


Computational Biology and Healthcare – Groups and Projects

Computational Biology Center (Watson Research Lab)– Comparative Genomics– Protein Folding– DNA Transistor (nanopore sequencing)

Healthcare Informatics (Almaden Research Lab) ***– AALIM: Advanced Analytics for Information Management– The Spatiotemporal Epidemiological Modeler (STEM)– Genome-Wide Association Studies for Predictive Healthcare ***

Healthcare Solutions (Haifa Research Lab)– HIV Therapy Prediction (based on virus DNA markers)– HYPERGENES (genetics of hypertension)


Aligner

Aligner

Aligner

• • •

DNAReads

AlignedReads

Variation Caller

Variation Caller

Variation Caller

• • •Chr 1 SNPs

Chr 22 SNPs

Chr Y SNPs

Chr 1

Chr 1

Chr 22

Chr 22

Chr 22

Chr Y

Variation byChromosomal Region

MAPStep

REDUCEStep

Bioinformatics on Hadoop: Alignment and Variant Calling

This DNA sequence analysis workflow is implemented in the academic software Crossbow (Bowtie aligner + SOAPsnp variant caller)


Health Outcomes and Epidemiology

24


University of Ontario Institute of TechnologyDetecting life-threatening conditions in neonatal care units

Business Challenge Premature births and associated health risks are on the rise. Enormous data loss from patient monitoring equipment. 3600

readings/hr reduced to 1 spot reading/hr Analyze physical behaviors (heart rate, respiration, etc) it is possible

to determine when the body is coming under stress.

Project objectives Analyze ALL the data in real-time to detect when a baby is becoming

unwell earlier than is possible today. Reduce avg length of stay in neonatal intensive care reducing

healthcare costs.

The benefits Analyze ~90 million points of data per day per patient in real-time . . .

every reading taken is analyzed. Able to stream the data into a database, and shown that the process

can keep pace with the incoming data.


Solution Components:

InfoSphere Streams On premises In the cloud

Warehouse to correlate physical behavior across different populations.

Models developed in warehouse used to analyze streaming data.

http://www.uoit.ca/EN/index.html


Health Analytics – GPRD & OMOP Benchmarking

SAS w/ Puredata

GPRD copy 65 hours ~14 hours

OMOP copy Failed to complete out of memory

~14 hours

GPRD to GPRDOMOP transformation

Two weeks ~51 minutes

Small SAS Analysis 6h 38m 0h 3m

Large SAS Analysis Failed to complete out of memory

16m 47s

Standard “OSCAR”Analysis

Failed to complete out of memory

6 h 21m

IHCIS summary n/a 48 seconds

General Practice Research Database – GPRD– European Clinical Records Database - 11 Million Patients, 24 Years– Pharmacovigilance, Drug Utilization and Health Outcomes Analytics

Observation Medical Outcomes Partnership - OMOP– Shared Models and Methods for Health Outcomes Analytics

Migration of GPRD & OMOP data models, ETL and SAS programs to Puredata– 266 GB of Raw Data– Compressed to 75 GB

Puredata 1000-6

SAS & SAS Access 9.2


Improved Clinical Insights

Reduction in time to perform data inquiries and analytics

Advance analytic capabilities for clinical analysis

Ability of end-users to gain immediate benefit without significant retooling or new technologies

Improvement Metrics

Immediate 10x performance improvement on existing SAS applications with little to no rework or original codeLeveraging traditional S&M data assets as an early indicator for more detailed investigations with clinical dataCompany data management strategy focused on centralizing data assets on Puredata to improve analytic capabilities

Result

Post-launch monitoring of clinical data required the manual integration of data across several data providers

Differences in data assets did not facilitate integration over time

Flat-file oriented processing significantly increase complexity of analysis

Problem

Data inquiries performed overnight via manually developed SAS programs

Pre-formatted data sets (to simplify integration) did not enable access to unique characteristics of different sources

Significant data duplication at the risk of good data management

Effect

Migration of SAS-based flat files to a relational environment in Puredata

Optimization of existing SAS code to best leverage Puredata in-database functionality

Integration of clinical data from Premiere and IMS with sales & marketing data from IMS, WK and SDI

Implementation Scope


Revolution R - Medication Response Modeling Understand the major causes of morbidity and mortality related to inaccurate

dosages of medications – Relating individual responses to medication with genetic makeup and environmental

factors indicated by biomarker measurement

The Puredata architecture allowed for the re-use of the existing investment in R– 20x+ performance improvement over existing HPC infrastructure– Business logic of the R code remained intact– Speed at which they could interrogate the data allowed them to play with many models

Explored use of in-database analytics (dbLytix)– 1000x performance improvement of existing HPC infrastructure

HPC EnvironmentIBM Puredata

1000-12IBM Puredata

1000-12Language R R (nzTapply) dbLytixPlatform Linux HPC IBM Puredata IBM PuredataNodes TBD (100+) 12 12Deployment R server Revolution R Enterprise dbLytixInterface Linux Command Line Desktop GUI (nzTapply) 24.9 BillionElapsed Time 36+ hours 2 hours 2 minutes


Optum Insight – Predictive Healthcare with Fuzzy Logix

Predict who is at risk to develop diabetes 6 months, 1 year & 2 years out– Provide more advanced intervention services for health plans– Based upon medical observations and patient demographics– In-memory analytic environment was limited to analyzing small sets of patient

cohorts with limited computational capability

“Optum Insight could not do this work without Netezza or Fuzzy Logix”– Leveraged 79x more medical observations, processed 150x faster– From 150 variables to 1700, with a capacity for 5000+– Fast iterative analysis for continuous model improvement

In-Memory AnalyticsLinux Server

PuredataModel v.1 - UDA

PuredataModel v.2 - UDTF

Nodes Multi-Core/4GB RAM 24 CPU/96 Core 24 CPU/96 CoreObservations (rows) 2 Million 2 Million 14 MillionVariables (dimensions) 150 1500 1700Evaluations (matrix elements) 300 Million 2.7 Billion 23.8 BillionPuredata Elapsed Time > 5 Hours 30 minutes 4 hoursImprovement* 10x 150x

*Improvement estimates are conservative. Models on Puredata leveraged 10x the amount of data while performing several additional calculations that were not feasible with in-memory

solution


Health Insurance Provider – SAS SQL Pass-Thru

IBM BCU 1000-12

CPU 22 24 CPU/96 Cores

Storage 32.0TB 32.0TB

Data 1.5TB 0.8TB

Indices 1.5TB Not Applicable

Duplication 15.0TB Not Applicable

Tuning 6-months 1-wk POC

• IBM BCU database used for SAS applications> 5 years of policyholder data> “Nasty” queries – “Decision Tree across Neural

Networks by zipcode”> Had just undergone 6 months of tuning

• Puredata POC loaded raw data with no optimization

> Testing used SAS/Access for ODBC> Will be even faster with SAS/Access for Puredata> 15x average performance improvement> Over 26x on long-running analytics

SAS2

SAS3

Mar

ketin

g

Mixe

d Que

ry

Reten

tion1

Reten

tion2

Rewrit

e

Indic

ation

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

10,000

Puredata easily outperforms a finely tuned BCU

TwinFin-12

IBM BCU

Sec

on

ds


Catalina Marketing – In-Database Analytics

35X improvement in staff productivity – Model development reduced from

2+ months to 2 days– 10’s of models to 100’s per year

with the same staff

Increased depth of data per model– 150 to 3.2 Million features– 1 Million to 14.5 Trillion records per

analysis

Impressive ROI on IT investment– 12 Million times more data

processed per CPU per day– Direct correlation of models

development to revenue

1 4 24 400 4000

20

40

60

80

0

50,000

100,000

150,000

200,000

While Increasing Depth of Data

Days for Model Build Rows/Day/CPU (Millions)

Number of CPU

Day

s

Ro

ws

(Mil

lio

ns)

1 4 24 400 4000

10

20

30

40

Developing Models Faster...

Speedup

Number of CPU

Sp

eed

up


Catalina – SAS Scoring Accelerator

PC SAS Unix SAS SAS MP Connect Puredata SASSQL Pass-Thru

Puredata SAS Scoring Accelerator

Nodes 1 CPU 4 CPU 24 CPU 400 CPU 400 CPU

Storage 0.5 TB 5 TB 15 TB 120 TB 120 TB

#rows(data volume)

1 Million 4 Billion 4 Billion 14 Trillion 140 Trillion

#columns(dimensions, features)

30 Brands, 5 Variables

30 Brands, 800 Categories,

5 Variables

30 Brands, 800 Categories,

5 Variables

80,000 Brands, 800 Categories,

5 Variables

80,000 Brands, 800 Categories,

5 VariablesModel Build and Deploy

70-84 Days 35 Days 10 Days 3 Days 2 Days

Rows/CPU/Day 14,286 28,571,429 16,666,667 12,083,333,333 175,625,000,000

Per CPU Speedup from PC SAS

2,000 X 1,167 X 845,833 X 12,293,750 X

Rows/Day 14,286 114,285,714 400,000,000 4,833,333,333,333 70,250,000,000,000

Per Day Speedup from PC SAS

8,000 X 28,000 X 338,333,333 X 4,917,500,000 X


Why computational pharmaco-epidemiology?• FDA will be implementing a system to track safety of drugs and devices through active

surveillance on tens to hundreds of terabytes of claims data (FDA Mini-Sentinel)

• Pharma’s want to innovate ahead of Sentinel, find new markets and risks

• Payers want to measure their insured population, providers, outcomes, ACO

• Comparative effectiveness research is a top priority for next-generation healthcare

What is the Harvard Computational Pharmaco-Epidemiology Program?

•Harvard Medical School faculty selects Puredata for pharmaco-epi complex analytics, studies on drug effectiveness & safety

• 100% of computation run on Puredata

• Faculty are in Methods Core of FDA Mini-Sentinel & globally esteemed

Harvard Medical School collaboration

Why is it special?• These end users have no IT budget, no DBA’s, period!

Date post:	06-Jan-2016
Category:	Documents
Upload:	walker
View:	41 times
Download:	2 times

Big Data Analytics for Life & Agricultural Sciences

Documents