Industry Perspective: Big Data and Big Data Analytics
David BarnesProgram DirectorEmerging Internet TechnologiesIBM Software Group
What is Big Data?
The Adjacent Possible
Inexpensive disk+ Increased processing power
+ Data Warehouse+The Web
+ X
= Big Data
X=Sensors used to gather climate information, posts to social media sites, digital pictures and videos, transaction records, cell phone GPS signals, and more.
© 2010 IBM Corporation
161 exabytes of data were created in 2006 –3 million times the amount of information contained
in all the books ever written.
In 2010 the number reached hit 988 exabytes.
IDC estimates that 1.8 zettabytes were created and replicated in 2011.
© 2010 IBM Corporation
Every day, people create the equivalent of 2.5 quintillion bytes of data from sensors, mobile devices,
online transactions, and social networks.
Every month people send one billion Tweets and post 30 billion messages on Facebook.
90% (or more) of the world’s data is unstructured.
The true nature of information
Is noisy
Is often times dirty
Is often full of valuable information
Unstructured Data
© 2010 IBM Corporation
Big Data has swept into every industry and business function.
Businesses need to put the power of Big Data analytics in the hands of their business employees – Data Scientist is somewhat misleading.
“Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers.” – McKinsey Global Institute
The Big Data Imperative
9
Big Data Business Patterns
Computational Journalism
Chief Legal Officer
Retail Business Planner
IT Systems Management
Pharma - Clinical Trials
Business Fraud Detection
Evidence Based Medicine
Web Archiving
. . .
© 2010 IBM Corporation
Today’s Problem
Data growing at compound annual growth of 60%/year
Storage capacity continue to increase dramatically
Storage access speeds have not kept up
At transfer speed of 500 MB/sec - 1 terabyte of data will require ~30 mins to read from single drive
Enter Map/Reduce• Automates the mechanisms of large-scale distributed computation ( i.e. work
distribution, load balancing, replication, failure/recovery)
• Divide & Conquer: Split 1 terabyte split among 100 drives will require ~20 seconds to read
• M/R parallel processing model provides cost effective framework for new generation of analytic applications on unstructured or semi-structured data
© 2010 IBM Corporation
Requirement: A New Class of Big Data Applications
Big Data analytics must be brought to the line-of-business user.
•Leverage easy-to-use manipulation metaphors
•Use natural language technologies for analytics
•Provide rich visualizations to quickly identify insights
DemoBuyer Sentiment Analysis
© 2010 IBM Corporation SlideSharenomics - Rise of Social Economy
Social Media: Chiliean Earthquake 2010
2010 Chilean earthquake fifth largest earthquake in recorded history
The affected areas suffered major devastation - buildings, airports, hospitals, prisons, bridges, and roads were severely damaged
Land-based communications systems suffered major outages
The wireless 3G infrastructure remained intact and operational
13
© 2010 IBM Corporation SlideSharenomics - Rise of Social Economy
Social Media: Chiliean Earthquake 2010
14
Social networking on wireless networks major form of communications
Extreme Blue students collected 226 million Tweets, analyzed,categorized by incidence type and location
Tweets included - Can I get food? Can I get gas? Are the bridges down - images
The results were visualized
Completed in ~12 weeks
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
15
•Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-
expanding variety of sources•Velocity - Streaming data and large volume data movement
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
•Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-
expanding variety of sources•Velocity - Streaming data and large volume data movement
The Supercomputer is based on over 1,200 high powered IBM System X servers and can perform 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per
second.
Vestas expects its data sets will grow to 20-plus petabytes over the next four years.
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
•Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-
expanding variety of sources•Velocity - Streaming data and large volume data movement
© 2011 IBM Corporation
Seton Healthcare FamilyReducing CHF readmission to improve care
Business ChallengeSeton Healthcare strives to reduce the occurrence of high cost Congestive Heart Failure (CHF) readmissions by proactively identifying patients likely to be readmitted on an emergent basis.
What’s Smart?IBM Content and Predictive Analytics for Healthcare solution will help to better target and understand high-‐risk CHF patients for care management programs by:
Smarter Business Outcomes• Seton will be able to proactively target care management
and reduce re-‐admission of CHF patients.• Teaming unstructured content with predictive analytics,
Seton will be able to identify patients likely for re-‐admission and introduce early interventions to reduce cost, mortality
IBM solution• IBM Content and
Predictive Analytics for Healthcare
• IBM Cognos Business Intelligence
• IBM BAO solution services
• Utilizing natural language processing to extract key elements from unstructured History and Physical, Discharge Summaries, Echocardiogram Reports, and Consult Notes
• Leveraging predictive models that have demonstrated high positive predictive value against extracted elements of structured and unstructured data
• Providing an interface through which providers can intuitively navigate, interpret and take action
“IBM Content and Predictive Analytics for Healthcare uses the same type of natural language processing as IBM Watson, enabling us to leverage information in new ways not possible before. We can access an integrated view of relevant clinical and operational information to drive more informed decision making and optimize patient and operational outcomes.”
© 2011 IBM CorporaUon2 © 2011 IBM CorporaUon
IBM Content and PredicUve AnalyUcs for HealthcareThe Seton CHF Readmission SoluUon
Unstructured Data(Cerner Clinical Documenta0on: History and Physical, Discharge Summary, Echocardiogram.)
Structured Data(Avega Cost Data, DSS Admission History, DSS Procedure History, Cerner Clinical Events)
Raw Informa=on
Search and Visually Explore (Mine)
Monitor, Dashboard and Report (Cognos BI)
Ques%on and Answer*
Custom SoluBons
Dynamic Mul=modeInterac=on
IBM Content and Predic=ve Analy=cs
Content AnalyBcs•Natural Language Processing•Medical Fact and Rela0onship Extrac0on (Annota0on)
• Trend, PaIern, Anomaly,Devia0on Analysis
PredicBve AnalyBcs• Predic0ve Scoring and Probability Analysis
Analyzed and Visualized
Informa=on
Health Integra=on Framework
Data Warehouse and Model
Master Data Management
Advanced Case Management
Business AnalyBcsPartners (HLI) Specialized Research
IBM Watson for Healthcare
Confirm hypotheses or seek alternaFve ideas with confidence based responses from learned knowledge*
UUlizing natural language processing to extract key elements from unstructured History and Physical and Discharge Summary
Leveraging predicUve models that have demonstrated high posiUve predicUve value against extracted elements of structured and unstructured data
Providing an interface through which providers can intuiUvely navigate, interpret and take acUon
© 2011 IBM CorporaUon
The Data We Thought Would Be Useful … Wasn’t
• 113 candidate predictors from structured and unstructured data sources
• Structured data was less reliable then unstructured data – increased the reliance on unstructured data
New Unexpected Indicators Emerged … Highly Predic=ve Model
• 18 accurate indicators or predictors (see next slide)
Predictor Analysis % EncountersStructured Data
% Encounters Unstructured Data
Ejec0on Frac0on (LVEF) 2% 74%
Smoking Indicator 35%(65% Accurate)
81%(95% Accurate)
Living Arrangements <1% 73%(100% Accurate)
Drug and Alcohol Abuse 16% 81%
Assisted Living 0% 13%
What Really Causes Readmissions at SetonKey Findings
3
97% at 80th percen0le
49% at 20th percen0le
© 2011 IBM CorporaUon
Cognos dashboard reporUng system can help in monitoring the key clinical, operaUonal and financial metrics. More importantly, being able to track down the top priority cases for case management.
5
Visualizing the Results: Readmissions Dashboard
1.Clinical Sta=s=cs: admission count, readmission count and readmission rate
2.Opera=onal Sta=s=c: Counts of different length of stay periods
3.Financial Sta=s=c: Total direct cost by total admission and by readmission
4.Mortality: mortality rate5.Average length of stay 6.Average direct cost by total admission and by readmission only
7.PA Model Score: Distribu0on of propensity of readmission
1 2 3
4 5 6
7
© 2010 IBM Corporation
Big Data = Volume, Variety and Velocity
•Volume - Scale from terabytes to zettabytes•Variety - Relational and non-relational data types from an ever-
expanding variety of sources•Velocity - Streaming data and large volume data movement
© 2010 IBM Corporation
USC Annenberg School of Communications
© 2010 IBM Corporation
InfoSphere Streams
27
© 2010 IBM Corporation
Big Data Platform Vision
28
Big Data Enterprise Engines
Big Data Solutions
Internet Scale AnalyticsStreaming Analytics
Developers End Users Administrators
Big Data User Environments
Bringing Big Data to the EnterpriseClient and Partner Solutions
Open Source Foundational Components
Hadoop MapReduce HDFS Hbase Pig Lucene Jaql
AG
ENTS
INTEG
RATIO
N
Marketing
Warehouse Appliances
Data Warehouse
Database
Analytics
Business Intelligence
Master Data Mgmt
InfoSphere Warehouse
Netezza
InfoSphere MDM
DB2
SPSS
Cognos
Unica