Date post: | 06-Aug-2015 |
Category: |
Technology |
Upload: | rajesh-nambiar |
View: | 795 times |
Download: | 1 times |
1 © Copyright 2015 EMC Corporation. All rights reserved. 1 © Copyright 2015 EMC Corporation. All rights reserved.
2 © Copyright 2015 EMC Corporation. All rights reserved.
INTERNET OF THINGS (IOT) TRENDS & IMPLICATIONS FOR ENTERPRISE & DATA SCIENCE
2 © Copyright 2015 EMC Corporation. All rights reserved.
3 © Copyright 2015 EMC Corporation. All rights reserved.
INTERNET OF THINGS (IOT) TRENDS & IMPLICATIONS FOR ENTERPRISE & DATA SCIENCE Rashmi Raghu, Principal Data Scientist Pivotal
Acknowledgments: Regunathan Radhakrishnan Kaushik Das
4 © Copyright 2015 EMC Corporation. All rights reserved.
What can the Internet of Things do in the real world?
5 © Copyright 2015 EMC Corporation. All rights reserved.
CAN WE PREVENT ACCIDENTS LIKE THE MACONDO DISASTER ?
6 © Copyright 2015 EMC Corporation. All rights reserved.
HOW DO WE KNOW WHEN SOMEONE BECOMES LIKELY TO DEFAULT OR CHURN?
7 © Copyright 2015 EMC Corporation. All rights reserved.
HOW DO WE KNOW A TREE HAS FALLEN ON A POWER LINE EVEN BEFORE THE
RESIDENTS COMPLAIN?
8 © Copyright 2015 EMC Corporation. All rights reserved.
MASHING BIG DATA WITH BIG
MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’
- IT COULD TRANSFORM GE'S BUSINESS - AND THE ECONOMY.
“
” JEFF IMMELT, CEO, GE
9 © Copyright 2015 EMC Corporation. All rights reserved.
THE POWER OF 1
R X
Increasing Freight Utilization Rail
Predictive Maintenance Healthcare
Predictive Diagnostics Power
Driving Outcomes That Matter
One Percent Improvement Equals
$27B Industry Value by Reducing System
Inefficiency
$63B Industry Value by Reducing Process
Inefficiency
$66B Industry Value with
Efficiency Improvements In Gas-fired Power
Plant Fleets Source: General Electric
10 © Copyright 2015 EMC Corporation. All rights reserved.
Creating A Smart System
11 © Copyright 2015 EMC Corporation. All rights reserved.
How does a human react to an external event?
12 © Copyright 2015 EMC Corporation. All rights reserved.
THE HUMAN BRAIN BRINGS IT ALL TOGETHER
The Brain: 1. Takes in the input from the eyes 2. Analyzes it to compute the trajectory of the ball 3. Tells the body what action to take to hit the ball
13 © Copyright 2015 EMC Corporation. All rights reserved.
HOW CAN AN ORGANIZATION BE SMART?
Human Being Organization
• Sense Organs – eyes, ears … • Limbs … • Nervous System • Brain
• Sensors • Actuators • Network • Digital Brain
14 © Copyright 2015 EMC Corporation. All rights reserved.
Sensors
Actuators
BUT WHERE IS THE BRAIN?
?
15 © Copyright 2015 EMC Corporation. All rights reserved.
LET’S BUILD A DIGITAL BRAIN
The brain brings it all together 1. Takes in the input from a large number of sensors 2. Builds a model and uses it to analyze incoming data 3. Tells the actuators what action to take Over a network of machines
16 © Copyright 2015 EMC Corporation. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and automate at scale
Perform advanced analytics Discover insights
Modernize data infrastructure
17 © Copyright 2015 EMC Corporation. All rights reserved.
SMART SYSTEMS = SENSORS + DIGITAL BRAIN + ACTUATORS
Problem Formulation
Modeling Step
Data Step Application Step
Data Science for Building Models
Sensors & Actuators
Data Lake
18 © Copyright 2015 EMC Corporation. All rights reserved.
World’s First Open Sourced,
Enterprise-Class Data Portfolio
+ Open Data Platform
PIVOTAL BIG DATA SUITE
OPEN AGILE CLOUD-READY
Modern Data Infrastructure
+ Advanced Analytics
+ Apps at Scale
Multiple Cloud Deployment Models
+ Big Data Suite on Pivotal
Cloud Foundry
19 © Copyright 2015 EMC Corporation. All rights reserved.
THE INTERNET OF THINGS JOURNEY STORE • Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE • Predictive Analytics
• Machine Learning
• Advanced Data Science
• Real-time Analytics
DEVELOP • Advanced Analytic Pipelines
• Real-time Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, IoT, and Mobile
INNOVATE • Agile Dev Expertise
• DevOps
• Hybrid Cloud
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
ENTERPRISE PAAS
20 © Copyright 2015 EMC Corporation. All rights reserved.
Data Science for IoT
21 © Copyright 2015 EMC Corporation. All rights reserved.
DATA SCIENCE?
App Development
Analytics
Business Intelligence Reporting
Visualization
Dashboards
Insights Big Data
Machine Learning Statistics
Mathematics Time Series
Algorithms
Databases
Software
Modeling
Queries
Real-Time
Sensors
Predictive Models
ETL
Research
Hadoop
Distributed Computing
MapReduce
SQL
In-Memory
OLAP
Text Mining
Unstructured Data
Open Source
Decision Science
Ad Hoc Queries
Hacking
In-Database Analytics
Internet of Things
Data Cleansing
Sentiment
22 © Copyright 2015 EMC Corporation. All rights reserved.
WHAT IS DATA SCIENCE?
The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.
DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
23 © Copyright 2015 EMC Corporation. All rights reserved.
Phase 1: Problem Formulation
Make sure you formulate a problem that is relevant to the goals and pain points of
the stakeholders
Phase 2: Data Step Build the right feature set
making full use of the volume, variety and
velocity of all available data
Phase 3: Modeling Step This is where you move from answering what, where and when to
answering why and what if?
Phase 4: Application Create a framework for
integrating the model with decision making processes and taking action using the
Internet of Things
THE EIGHTFOLD PATH OF DATA SCIENCE FOUR PHASES AND FOUR DIFFERENTIATING FACTORS
Technology Selection Select the right platform and the right set of tools for solving the problem at
hand
Iterative Approach Perform each phase in an agile manner, team up
with domain experts and SMEs, and iterate as
required
Creativity
Take the opportunity to innovate at every phase
Building a Narrative Create a fact-based narrative that clearly
communicates insights to stakeholders
http://blog.pivotal.io/data-science-pivotal/p-o-v/the-eightfold-path-of-data-science
24 © Copyright 2015 EMC Corporation. All rights reserved.
Gene Sequencing
Smart Grids COST TO SEQUENCE ONE GENOME HAS FALLEN FROM
$100M IN 2001
TO $10K IN 2011 TO $1K IN 2014
READING SMART METERS EVERY 15 MINUTES IS 3000X MORE DATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS 250 MILLION PHOTOS EACH DAY
BILLIONS OF DATA POINTS
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000 DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
25 © Copyright 2015 EMC Corporation. All rights reserved.
P L A T F O R M
DATA SCIENCE TOOLKIT
TOOLS LANGUAGES
SQL
26 © Copyright 2015 EMC Corporation. All rights reserved.
ANALYTICS WITH PIVOTAL A SINGLE ADDRESS FOR EVERYTHING ANALYTICS
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
27 © Copyright 2015 EMC Corporation. All rights reserved.
IoT Trends & Implications
28 © Copyright 2015 EMC Corporation. All rights reserved.
IOT UBIQUITY ACROSS VERTICALS
Healthcare & Lifestyle
Agriculture & Farming
Energy
Retail
Manufacturing & Heavy Industry
Transportation
Financial Services
Communications
Industrial Internet Smart Cities
Smart Homes Connected Cars
Connected Wearables …
29 © Copyright 2015 EMC Corporation. All rights reserved.
COMMON USE CASES ACROSS VERTICALS
Healthcare & Lifestyle
Agriculture & Farming
Energy
Retail
Manufacturing & Heavy Industry
Transportation
Financial Services
Communications
Industrial Internet Smart Cities
Smart Homes Connected Cars
Connected Wearables …
Predictive Maintenance
Predict Quality of Product / System Security Analytics
Demand Modeling Anomaly Detection Recommendation
Systems
30 © Copyright 2015 EMC Corporation. All rights reserved.
COMMON USE CASES ACROSS VERTICALS
Healthcare & Lifestyle
Agriculture & Farming
Energy
Retail
Manufacturing & Heavy Industry
Transportation
Financial Services
Communications
Industrial Internet Smart Cities
Smart Homes Connected Cars
Connected Wearables …
Demand Modeling Anomaly Detection
Recommendation Systems
Energy: Predict drilling equipment function and failure. Optimize drilling efficiency and
maintenance schedules
Manufacturing: Predict failure of disk drives in servers to help in planning replacement schedule
Healthcare: Network intrusion detection of covert threats and malware in medical devices
Energy: Detection of malware and threats in electricity distribution grid
Manufacturing: Predict vaccine potency based on manufacturing
sensor data
Transportation: Predict duration of road traffic incidents. Helps
improve quality of commute
Predictive Maintenance
Predict Quality of Product / System Security Analytics
31 © Copyright 2015 EMC Corporation. All rights reserved.
COMMON USE CASES ACROSS VERTICALS
Healthcare & Lifestyle
Agriculture & Farming
Energy
Retail
Manufacturing & Heavy Industry
Transportation
Financial Services
Communications
Industrial Internet Smart Cities
Smart Homes Connected Cars
Connected Wearables …
Predictive Maintenance
Predict Quality of Product / System Security Analytics
Demand Modeling Anomaly Detection Recommendation
Systems
32 © Copyright 2015 EMC Corporation. All rights reserved.
COMMON USE CASES ACROSS VERTICALS
Healthcare & Lifestyle
Agriculture & Farming
Energy
Retail
Manufacturing & Heavy Industry
Transportation
Financial Services
Communications
Industrial Internet Smart Cities
Smart Homes Connected Cars
Connected Wearables …
Predictive Maintenance
Predict Quality of Product / System Security Analytics
Retail: Recommend products / services at point of sale or while
browsing
Manufacturing: Recommend parts and service methods for
post-sale product maintenance
Retail: Enhanced and granular modeling of consumer demand using large historical data repositories and more accurate inventory / point-of-sale records
Energy: Smart meter data enables improved electricity demand modeling
Energy: Smart meter data enables detection of anomalous power usage patterns related to
theft, vegetation management issues, meter malfunction
Manufacturing: Identifying die failures or anomalies from wafer
bin map images
Demand Modeling Anomaly Detection Recommendation
Systems
33 © Copyright 2015 EMC Corporation. All rights reserved.
CHALLENGES IN IOT USE CASES
Data Location
Data Integration
Data Cleansing Labels, Labels, Labels
Normal vs. Anomaly
False Alarms vs. No Alarms
Value Creation
34 © Copyright 2015 EMC Corporation. All rights reserved.
– Data from ‘things’ and business processes on different platform(s) – Locating data centrally in Data Lake alleviates access issues
– Missing values, dirty data and the presence of sensor/device malfunctions require analytical techniques to resolve
– Integration of IoT data with business data and external data can be non-trivial – Tools from statistics and machine learning can play an important role here too. e.g.
Algorithms to recommend natural join mechanisms for different data sources
CHALLENGES IN IOT USE CASES
Data Location
Data Integration
Data Cleansing
35 © Copyright 2015 EMC Corporation. All rights reserved.
– New types of data collected from IoT devices might be poorly understood – Unsupervised learning techniques can help uncover normal and anomalous patterns
– Labeled data for training models hard to find – Problems can be approached in two stages: Unsupervised techniques to learn ‘labels’
from the data and subsequent Supervised Learning efforts
– How tolerant a user is to false alarms or no alarm when there should be one needs to be taken into account in predictive models
– Ensemble models can improve the predictive power of models and reduce issues
CHALLENGES IN IOT USE CASES
Labels, Labels, Labels
Normal vs. Anomaly
False Alarms vs. No Alarms
36 © Copyright 2015 EMC Corporation. All rights reserved.
IoT Use Cases
37 © Copyright 2015 EMC Corporation. All rights reserved.
Data: The New Oil Producing Value for the Oil & Gas
Industry
38 © Copyright 2015 EMC Corporation. All rights reserved.
• Oil and gas exploration and production activities generate large amounts of data from sensors, logistics, business operations and more
• Rise of cost-effective data collection, storage and computing devices is giving an established industry a new boost
• The promise of Data as “the new oil” is realized when we can tap into its value in a meaningful, cross-functional way to enhance decision-making, which provides the competitive advantage
DATA: THE NEW OIL
http://commons.wikimedia.org/wiki/File:Rig_wind_river.jpg
39 © Copyright 2015 EMC Corporation. All rights reserved.
• Predictive Maintenance - Take steps towards zero unplanned downtime
• Business Problem: Predict drilling equipment function and failure
• Motivation: Drilling wells and equipment failure during the process are expensive
– Drilling motor damage could account for 35% of rig non-productive time (NPT) and can cost $150,000 per incident1
– Given the presence of over 800,000 oil & gas wells in the US (as of 20092) the total cost of such incidents could amount to billions of dollars
• Goals: – Provide early warning system – Provide insights into prominent features impacting operation and failure – Reduction of non-productive drill time (and costs) – Reduction of failure incidents
PREDICTING EQUIPMENT FUNCTION AND FAILURE
1 The American Oil & Gas Reporter, April 2014 Cover Story 2 data.gov
40 © Copyright 2015 EMC Corporation. All rights reserved.
• Predicting drill rate-of-penetration (ROP) / drilling equipment failure
• Primary data sources: – Drill Rig Sensor Data: Depth, Rate of Penetration (ROP), RPM, Torque,
Weight on Bit, etc… ( >billions of records) – Operator Data: Drill Bit details, Failure details, Component details etc…
(>thousands of records)
PREDICTIVE ANALYTICS FOR DRILLING OPERATIONS
Data Integration
Feature Building Modeling
41 © Copyright 2015 EMC Corporation. All rights reserved.
• Platform for all phases of the analytics cycle
• Support development of complex and extensible predictive models to predict equipment function and failure
• Provide framework for integrating data from multiple sources across data warehouses and rig operators
• Ability to analyze both structured and unstructured data in a unified manner. For instance: – Support fast computation of hundreds of features over time windows within 100s of millions (or
billions / trillions) of records of time-series data – Natural language processing pipeline for analysis of operator comments to identify failures from
unstructured text
TECHNOLOGY SELECTION
PL/Python PL/R
42 © Copyright 2015 EMC Corporation. All rights reserved.
Drill Rig Sensor
data
• Need a comprehensive framework for data integration at scale
• Data cleansing – Removing NULLs and outliers – Missing value imputation – Manually entered data (operator data) is prone to errors – Invalid values for sensor measurements
• Standardizing columns – data sources do not use consistent entries in features / columns that link them e.g. well names across different data sources
COMPREHENSIVE DATA INTEGRATION FRAMEWORK
Integrated
Operator data
43 © Copyright 2015 EMC Corporation. All rights reserved.
COMPLEX FEATURE SET ACROSS MULTIPLE DATA SOURCES
• Depth • Rate of
Penetration • Torque • Weight on Bit • RPM • …
• Drill Bit details • Component
details etc. • Failure events • …
Features on Time Windows
• Mean • Median • Standard Deviation • Range • Skewness • …
Final Set of Features on Time Windows
Leverage GPDB / HAWQ (+ MADlib, PL/R and PL/Python as needed) for fast computation of hundreds of features over time windows within millions or
billions of rows of time-series data
Operator data
Drill Rig Sensor data
• Pivotal GPDB has built in support for dealing with time series data • SQL window functions: e.g. lead, lag, custom windows • More details in Pivotal’s Time Series Analysis blogs:
http://blog.pivotal.io/tag/time-series-analysis
44 © Copyright 2015 EMC Corporation. All rights reserved.
• Predict Rate-of-Penetration – Linear Regression – Elastic Net Regularized Regression
(Gaussian) – Support Vector Machines
DRILLING OPERATIONS EXAMPLES OF PREDICTIVE MODELS
• Predict occurrence of equipment failure in a chosen future time window
– Logistic Regression – Elastic Net Regularized Regression
(Binomial) – Support Vector Machines
• Predict remaining life of equipment – Cox Proportional Hazards Regression
45 © Copyright 2015 EMC Corporation. All rights reserved.
BIG DATA MACHINE LEARNING IN SQL http://madlib.net/
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis,
Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions PMML Export
46 © Copyright 2015 EMC Corporation. All rights reserved.
• Every incident identified prior to failure potentially saves hundreds of thousands of dollars1
• Ability to fully utilize big data – volume, variety and velocity
• Comprehensive data integration framework for multiple complex data sources
• Learn and implement best practices for: – Data capture techniques, flow, and curation – Platform and toolset for data fabric
• Build and operationalize complex and extensible predictive models
• Improve efficiency, reduce costs and risks
ONE STEP CLOSER TO ZERO UNPLANNED DOWNTIME …
1 The American Oil & Gas Reporter, April 2014 Cover Story
47 © Copyright 2015 EMC Corporation. All rights reserved.
Automatic Clustering of IT Infrastructure Alerts
48 © Copyright 2015 EMC Corporation. All rights reserved.
• Enterprise network is complex – Different technology components with lots of dependency
• Role of Reliability Engineering – Ensure 24x7 uptime, network monitoring and quick resolution
• Alerts are high volume; eyes-on-glass operation – Event logs, performance metric log, incident tickets
• What intelligence can we mine from data to improve operational efficiency for Reliability Engineering?
MOTIVATION
49 © Copyright 2015 EMC Corporation. All rights reserved.
DATA SOURCES Business Service
Examples: Debit Processing Fraud Detection
Network Storage Servers Middleware
IT infrastructure
Alert Management Tool
Rules based Engine
Operations Support
personnel
Incident tickets
Incident Resolution
Manually created
incidents
• Alerts (semi-structured) • Incidents (unstructured) • Incident Work Info
50 © Copyright 2015 EMC Corporation. All rights reserved.
• What are the typical failure patterns? – Top 20 incident types that operations support is busy with
• Given an incident and historical resolution data can we recommend a set of resolutions?
– Incident A is similar to incident B in the past => may be resolution for B applies to A
• Can we predict failures before they occur? – Collect component-wise logs to predict failures
OPPORTUNITIES FOR DATA-SCIENCE
51 © Copyright 2015 EMC Corporation. All rights reserved.
Data – Large volume: 10 million alerts
and incidents in 6 months from just one business service. There are numerous services – debit processing, fraud detection, etc..
– Multi-structured: Semi-structured and unstructured text
– No labeled data
CHALLENGES
Analytics – Alerts / incidents have short text – Clustering techniques at scale
for alerts / incidents – Cluster interpretability for
qualitative evaluation
52 © Copyright 2015 EMC Corporation. All rights reserved.
FRAMEWORK TO CLUSTER ALERTS WHAT ARE THE TYPICAL FAILURE PATTERNS?
Token Normaliz-
ation
Stopword Removal
Compute Distance Metric Clustering
Method
Visualize Results
Alerts text include specific dates, times, IP-addresses – such details not important for clustering
Common words across alerts within are not important for clustering
Compute string distance metrics (such as Jaccard distance) to compare alerts
Build a graph with alerts as nodes and “high” similarity between the nodes as edges. Detect Connected Components from the graph to identify clusters
53 © Copyright 2015 EMC Corporation. All rights reserved.
Alerts: 15,722INCs: 462
Alerts:1,556
Alerts:1,795
Alerts:1,537
Alerts:1,772
Alerts:1,420
Alerts: 5,630INCs: 93
Alerts: 2,778INCs: 65
Alerts: 13,899INCs: 264
CLUSTER VISUALIZATION
54 © Copyright 2015 EMC Corporation. All rights reserved.
MEAN-TIME-TO-RESOLVE FOR CLUSTERS
Health Service heartbeat failure
Process X is not running, please restartWeb session emulator process hung issueHyperion Foundation Services is not runningWebsite / URL unavailableSymantec critical system protection service is not runningServer booted, ensure critical apps are runningWeb Service / Probe URL unavailableCPU Utilization IssuePlayspan IssueAgent is not runningHard disk free space issue
427
453
494
737
983
1,060
1,093
1,352
2,280
2,837
2,911
3,950
749
628
330
370
151
17
24
12
11
9
Windows Open Systems
Space utilization issue
AIX Hardware errorProcess X is not running, please restartProcess X may have missed an execution intervalCPU Utilization IssueServer booted, ensure critical apps are runningSplunk Agent unavailableConnection failed issueLDAP connectivity issueNet Backup: History file process failure 4,248
1,206
1,902
805
685
820
760
444
434
591
1,407
835
227
270
55
25
17
Unix Open Systems Alert Counts
12,065
1,535
979
2,774
769
1,208
512
1,286
825
691
1,486
MTTR*
1,962,569
46,050
7,832
11,096
66,134
105,096
12,288
489,966
105,600
154,093
964,414
Total MTTR*
9,989
1,030
1,061
60
1,978
1,098
542
494
557
571
1,845
519
234
730,207
158,620
11,671
8,220
94,944
13,176
115,988
2,470
75,752
3,426
16,605
181,131
48,204
55 © Copyright 2015 EMC Corporation. All rights reserved.
CLUSTER VISUALIZATION – AN ALTERNATIVE TO WORD CLOUDS
56 © Copyright 2015 EMC Corporation. All rights reserved.
• Perform clustering on a 3-6 month window of data
• Assign incoming alerts to one of the existing clusters according to a distance criteria
– May find emerging patterns
• Monitor cluster statistics (Mean Time To Resolve, number of incidents etc.) on a dashboard
• Perform clustering again every few months on new data
OPERATIONALIZATION
57 © Copyright 2015 EMC Corporation. All rights reserved.
TECHNOLOGY Platform
SQL PL/Python PL/R
Data Science
Visualization
58 © Copyright 2015 EMC Corporation. All rights reserved.
• Understand where your alerts/incidents are coming from is an important step in improving infrastructure support
– Profiling classes of alerts for business intelligence – Resolution recommendation – Application or hardware failure prediction – Reduction in time to resolve alerts and incidents (reduce costs)
• Distributed computing architecture + proper algorithm choice are needed to deal with scalability
• Tuning clustering results require good cluster visualization techniques
TAKEAWAYS AND LESSONS LEARNED
59 © Copyright 2015 EMC Corporation. All rights reserved.
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Oil & Gas Use Case Webinar: – Video: https://www.youtube.com/watch?v=dhT-tjHCr9E – Slides: http://www.slideshare.net/Pivotal/data-as-thenewoil
• IT Operations Use Case Webinar: – Video: https://www.youtube.com/watch?v=2goBoBp1klg – KDD 2014 paper: “Unveiling Clusters of Events for Alert and Incident Management in
Large-Scale Enterprise IT”, http://dl.acm.org/citation.cfm?id=2623360
FOR FURTHER INFO, CHECKOUT…
61 © Copyright 2015 EMC Corporation. All rights reserved.
APPENDIX
62 © Copyright 2015 EMC Corporation. All rights reserved. 62 © Copyright 2013 Pivotal. All rights reserved.
Smart Meter Analytics
63 © Copyright 2015 EMC Corporation. All rights reserved.
The Digital Brain: Making a Smart Grid Smarter!
Action: Where (and when) to send trucks, preventive maintenance
The Digital Brain: Uses Fourier transform extracts patterns and flags outliers/anomalies
Input: Data from smart meters
64 © Copyright 2015 EMC Corporation. All rights reserved.
SMART METER ANALYTICS – SIGNIFICANT USE CASES • Load profiling • Theft prevention • Demand prediction • Load forecasting • Root cause of power failures • Black-out warning • Anomaly detection • Network topology error
detection
65 © Copyright 2015 EMC Corporation. All rights reserved.
SOLUTION • Analyze smart meter power data using
unsupervised clustering techniques and detect anomalies based on distance metric in clusters
• Reduce time required to monitor and improve grid efficiencies
• Leveraged the MPP architecture of Pivotal GPDB and MADlib in-database machine learning library for fast computation at scale
ELECTRICITY NETWORK LOAD PROFILING AND OUTLIER DETECTION CUSTOMER A major smart grid infrastructure provider BUSINESS PROBLEM Profile power consumption patterns based on smart meter data and flag anomalous usage CHALLENGES • Large volume of smart meter data (several
months of data from 100s of thousands of meters) could not be analyzed effectively by legacy system
• Timely business insights on large scale smart grid infrastructure demand fast processing of data
66 © Copyright 2015 EMC Corporation. All rights reserved.
ELECTRICITY NETWORK LOAD PROFILING AND OUTLIER DETECTION
Dashboards for navigating clusters and outliers
67 © Copyright 2015 EMC Corporation. All rights reserved.
Network Topology Error Detection CUSTOMER
A major utility
BUSINESS PROBLEM
Use load and voltage meter readings to determine errors in transformer network topologies
CHALLENGES
• Time consuming process to detect network topology errors on entire network in legacy system
• Timely detection of network topology errors requires big data infrastructure and analytical capabilities
SOLUTION
• For each transformer network in parallel, solve an LP to determine scale of topology error, which can be used to flag and rank anomalous network topologies
• Reduce time for topology error detection from several days/weeks to few minutes!
68 © Copyright 2015 EMC Corporation. All rights reserved. 68 © Copyright 2013 Pivotal. All rights reserved.
Security and Fraud
69 © Copyright 2015 EMC Corporation. All rights reserved.
Attacker elevates access to important user, service and admin accounts, and specific systems
Data is acquired from target servers and staged for exfiltration
Data is exfiltrated via encrypted files over ftp to external, compromised machine at a hosting provider
A handful of users are targeted by two phishing attacks: one user opens Zero day payload (CVE-02011-0609)
The user machine is accessed remotely by Poison Ivy tool
Advanced Persistent Threat (APT) APT Kill Chain
1 4 3 2
Phishing & Zero Day Attack Back Door Lateral
Movement Data Gathering Exfiltrate
5
70 © Copyright 2015 EMC Corporation. All rights reserved.
ANOMALOUS USER-TO-RESOURCE ACCESS DETECTION BUSINESS PROBLEM Detect anomalous user behaviors in the global enterprise computer network SUMMARY Given local-to-local communication data, identify anomalous users within an enterprise. • Reduce malware-dwell time, typical
243 days • Signature-based approaches cannot
detect such behavior CHALLENGES 10 Billion events in 6 months; 15K+ network devices; No existing SIEM solutions can model user behavioral resource access baseline and enable anomaly detection in an adaptive and scalable architecture.
SOLUTION An innovative Graph Mining based algorithmic framework with advanced Machine Learning. Network topology and temporal behaviors are both modeled. (Patent pending). Implemented in MPP and PL/R, enabling parallel model training and behavior risk scoring. Successfully identified DLP violating anomalous users.
71 © Copyright 2015 EMC Corporation. All rights reserved. 71 © Copyright 2013 Pivotal. All rights reserved.
Financial Services
72 © Copyright 2015 EMC Corporation. All rights reserved.
IDENTIFYING AND PRICING CROSS-SELL OPPORTUNITIES CUSTOMER A global financial services provider BUSINESS PROBLEM Identify cross-sell opportunities between two business arms of a financial institution. CHALLENGES Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential.
SOLUTIONS • Fast integration of data in Pivotal
Greenplum Database. • Predictive models and evaluation of
profitability: – Association rule. – Logistic regression for each
product offered. – Estimation of revenue opportunity.
• On-demand reporting and visualization via custom dashboards connected to in-database models.
73 © Copyright 2015 EMC Corporation. All rights reserved.
CREDIT RISK ASSESSMENT AND STRESS TESTING CUSTOMER A global financial services provider BUSINESS PROBLEM Speed up the process of compliance reporting and stress testing for Basel III. CHALLENGES Running the calculation procedures on the customer’s legacy database were time-consuming, therefore had to be done in overnight batch mode.
SOLUTION • Implement risk asset calculation and
stress testing on Pivotal Greenplum Database.
• Three years of data was processed in well under 2 minutes, significantly faster than the customer’s current procedures.
• Connect an “in- database” visualization tool to Pivotal Greenplum Database via ODBC for on-demand reporting and visualization.
74 © Copyright 2015 EMC Corporation. All rights reserved.
FINANCIAL COMPLIANCE BUSINESS PROBLEM • Ensure compliance with Dodd-Frank and Basel
Committee regulations • Identify underlying risk and fraud while reducing
the compliance department’s overburdened
Emails Chats Trades
Transactions Policy Securities
Phone Calls Watch Lists …
Financial compliance Data Lake
Data integration
Data clean up Modeling Classification
and ranking
Analyst user interfaces Feedback
Analytics
Analyst feedback Data integration: e.g., append trade information with email and chat communications
Data cleanup: e.g., identify newsletters and spam emails
Modeling: • Predictive modeling to flag
messages and trades • Graph and cohort analysis
Analyst feedback Reviewed fraud instances included in periodic model refreshes
SOLUTION A data lake platform coupled with cutting edge data
science techniques Flexible user interface to promote an adaptive,
continuously learning compliance framework
75 © Copyright 2015 EMC Corporation. All rights reserved.
PIVOTAL TOPIC & SENTIMENT ANALYSIS ENGINE
External Tables
PXF
HDFS
Source: http Sink: hdfs
Parallel Parsing of JSON
(PL/Python)
HAWQ
Nightly Cron Jobs
Topic Analysis through MADlib pLDA
Unsupervised Sentiment Analysis
(PL/Python)
D3.js Spring XD
Twitter Decahose (~55 million tweets/day)