Factors
Scores / Classes
User Inputs
Prediction or Selection
Scoring Rules
Structured
Data
EXAMPLES
Predictive Modeling Applications
• Credit Risk Analysis • Financial Networks
• Crime mapping
“The core innovation that Zillow
offers are its advanced statistical
predictive products, including the
Zestimate®, the Rent Zestimate
and the ZHVI® family of real
estate indexes. By using R in
production as well as research,
Zillow maximizes flexibility and
minimizes the latency in rolling
out updates and new products.”
• Statistical forecasting
Operational Announced
Central USIowa
West USCalifornia
North EuropeIreland
East USVirginia
East US 2Virginia
US GovVirginia
North Central US
Illinois
US GovIowa
South Central US
Texas
Brazil SouthSao Paulo
West EuropeNetherlands
China North *Beijing
China South *Shanghai
Japan EastSaitama
Japan WestOsakaIndia West
TBD
India EastTBD
East AsiaHong Kong
SE AsiaSingapore
Australia WestMelbourne
Australia EastSydney
* Operated by 21Vianet
http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
REAL TIME
BIG DATA
PREDICTIVE ANALYTICS
Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0
"CLOCK" by Heiko Klingele flickr.com/photos/divdax/3458668053/ CC-BY 2.0
Structured
Data
Log Files
Sensor Streams
Language Text
ExtractionIngestion
Historical
Data
”IO VAPOURA” by Jaya Prime
flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0
Factors
Scores / Classes
Decision Tree
Logistic Regression
Neural Network
K-means clustering
Ensemble Model
User ID
Browser
Time/Date / Location
Previous purchases
Friend data
Any known information
Product of most interest
Offer of most likely sale
Most relevant link
Forecast sale value
Optimal Bid
Prediction or Selection
Scoring Rules
Feature Selection
Sampling
Aggregation
Variable Trans-
formation
Model Estimation
Model Refinement
Model Comparison /
Bench-marking
Known Factors
Known OutcomesPredictive Model
Name Node
Data NodeData Node Data NodeData Node Data Node
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
MapReduce
HDFS
Factors
Score
Structured
Data
Factors
Scores
Actual Outcomes
Structured
Data
Phase “Big Data” “Real Time”
Unstructured
Data
Petabytes (or
Exabytes!)
Minutes to Hours
Advanced
Analytics
Gigabytes to
Terabytes
Minutes
Deployment Megabytes/second Milliseconds
Consumption Kilobytes Seconds
powerbi.microsoft.com/en-us/industries/airline
Data• SQL Server 2016 Big-data R analytics integrated with SQL Server
database
• HDInsight Cloud-based Hadoop clusters
Develop
• Microsoft R Server Big-data R with distributed and in-database
computing
• Visual Studio R Tools for Visual Studio: integrated development
environment for R
Deploy• Azure ML Studio ML, Python and R in cloud-based Experiment
workflows
• Cortana Analytics Suite Cloud-based R APIs and Virtual Machines
Consume• PowerBI Computations and charts from R scripts in dashboards
• Excel With Azure ML Web Services plug-in
cloud computing
2011 2016 5x increase
data science
Universities filling 300,000 US talent gap
90% of the data in the world today has been created in the last two years alone
bigdata
opensourceincluding R, Linux, Hadoop
Getting Started with R tutorials:
• http://mran.microsoft.com/documents/getting-started/
Import/export data from SQL tables
• RODBC package: http://mran.microsoft.com/packages/info/?RODBC
Machine Learning Task View
• http://mran.microsoft.com/taskview/info/?MachineLearning
Applied Predictive Modeling (Kuhn & Johnson, 2014)
• http://appliedpredictivemodeling.com/ & R “caret” package
http://blog.revolutionanalytics.com/2015/06/r-build-keynote.html/
Building a genetic disease risk application with RData
• Public genome data from 1000 Genomes
• About 2TB of raw data
Analytics Development
• Microsoft R Server
• VariantTools variant caller in R
Factors & Scores
• DNA Sample / genetic variations
• Risk association
Deployment and Consumption
• Expose as API
• Web page, phone app, etc
Data Platform
• HDInsight Hadoop 1800 Nodes
• Raw genome sequence data in HDFS
The Ultimate Business Analytics Training
Business analytics training doesn’t end today. Join us at the upcoming PASS Business Analytics Conference to gain more Power BI and Excel skills through practical, hands-on training that you can put to use immediately.
Like What You Heard?
Join David Smith again at the PASS BA Conference in the session:
“Power BI Desktop Deep Dive including R Integration”
May 2 – 4, 2016
San Jose, CA
REGISTER TODAYpassbaconference.com
Use discount code BACDATA for $150 savings*
Please Note: Discount Codes cannot be applied retroactively.