1. 1 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Principal Data
Scientist, Pivotal Derek Lin
2. 2 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Agenda
Information Security Analytics & Use Cases Data Lake Data
Science: Extracting Values from Data Lake Demo
3. 3 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Information
Security Analytics Landscape Application Areas
4. 4 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. CIO Survey:
Top Concerns Sources: Barclays September 2013 CIO Survey, KPMG
January 2014 CIO/CFO Survey 54% What to Collect 85% How to
Analyze
5. 5 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. A Big Data
Analytics Response More sophisticated adversaries and sophisticated
methods. Limited human capacity combined with massive amounts of
events 40% of all survey respondents are overwhelmed with the
security data they already collect 35% have insufficient time or
expertise to analyze what they collect Security tools, tactics and
defenses becoming outdated: Content is static and not as dynamic as
the threat landscape Segregated by too many point products, tool
interfaces, disparate data sets 1 EMA, The Rise of Data-Driven
Security, Crawford, Aug 2012 Survey Sample Size = 200
6. 6 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Enterprise
Information Security Analytics Insider Threat Asset Risk Malware
Threat
7. 7 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. APT Kill Chain
Advanced Persistent Threat (APT) A handful of users are targeted by
two phishing attacks: one user opens Zero day payload
(CVE-02011-0609) The user machine is accessed remotely by Poison
Ivy tool Attacker elevates access to important user, service and
admin accounts, and specific systems Data is acquired from target
servers and staged for exfiltration Data is exfiltrated via
encrypted files over ftp to external, compromised machine at a
hosting provider Phishing and Zero Day Attack Back Door Lateral
Movement Data Gathering Exfiltrate 1 2 3 4 5
8. 8 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Anatomy of an
Attack | Anatomy of a Response
9. 9 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Technology
Landscape in the Kill Chain Perimeter penetration Malware beacon
Lateral movement Staging & exfiltration Single source real-time
Proactive, limited-sources rule-based methods over short-range
Proactive, multi-sources data-driven methods over long-range
Reactive, manual post-incident response Host/Network Analysis &
Search/Indexing Data-lake enabled analytics IDS/IPS Anti-virus SIEM
DLP
10. 10 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Analytics
Opportunities in Threat Defense Malware discovery Host infection
detection Malware Command & Control beaconing activity
detection Perimeter penetration prevention/detection Anomalous VPN
login detection Denial of service attack mitigation Local IP black
list construction Watering hole attack detection Chat room
monitoring Phishing attack detection Web server attack detection
In-network anomaly detection Anomalous resource access detection
Critical server activity monitoring IR efficiency improvement
Semi-automated analysis SIEM efficiency improvement Threat feed
normalization Alert prioritization Malware Threat
11. 11 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Information
Security Analytics Areas Insider Threat Asset Risk Malware
Threat
12. 12 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Malicious
Insider Threat
http://www.logrhythm.com/Portals/0/resources/LogRhythm_Survey_15.42014.pdf
What do you think is the biggest threat to your organizations
confidential data? Does your organization have any systems in place
to stop employees accessing confidential information or taking
data?
13. 13 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Analytics
Opportunities in Identity, Access, and Management Anomalous user to
resource access detection User-resource access data Role and
activity auditing + Role and provisioning data Privilege escalation
auditing + Privilege escalation data IT support personnel auditing
- Support ticket data - Command activity data Insider Threat
14. 14 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Information
Security Analytics Areas Insider Threat Asset Risk Malware
Threat
15. 15 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Information
Asset Management
16. 16 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Analytics
Opportunities in Asset Management Document categorization for risk
labeling Unstructured data User access data Asset risk profiling
Vulnerability scanner data User access data Asset Risk
17. 17 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data Lake
Needs and Trend Analytics Support
18. 18 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Life Before
Data Lake Departmental Warehouse Enterprise Apps Reporting
Non-Agile Models Spread marts Prioritized Operational Processes
Errant data and marts Departmental Warehouse Siloed Analytics Data
Sources Non-Prioritized Data Provisioning Static schemas grow over
time
19. 19 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Impact of
Status Quo High-value data is hard to reach and leverage Data
Scientists are last in line for data Queued after prioritized
operational processes Data is moving in batches from Warehouse(s)
to Data Scientists desktops In-memory analytical work (w/ R, SAS,
SPSS, Excel) Sampled, driving model accuracies down There is a
cottage industry of analytics, rather than centrally-managed
harnessing of analytics Non-standardized initiatives Frequently,
not-aligned with corporate business goals Slow time-to- insight
& reduced business impact
20. 20 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data-Driven
Digital Media Analytics Targeting & Retention Social Media
Analysis Campaign optimization 0 Transaction History Purchases
Clickstream Customer Data Unified data supporting re-usable
predictive models GB TB PB Data Size
21. 21 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data-Driven
Financial Protection Analytics Unified data supporting re-usable
predictive models TB Data Size Web Optimization Fraud Detection
Product Recommendation ATM Member Data Transactional Log Firewall
Clickstream Phone Channel GB
22. 22 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data-Driven IT
Operation Analytics Unified data supporting re-usable predictive
models GB TB/ PB Data Size Failure Prediction Root Cause Analysis
Project Risk Forecasting Server or VM Performance Metrics CMDB
Configuration Setting Alerts & Incident Server logs Network
Performance Metrics
23. 23 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data-Driven
Security Analytics Unified data supporting re-usable predictive
models TB/ PB Data Size Insider Threat Detection Malware Detection
DDoS Mitigation AD/Auth Asset/Role Netflow DNS/Firewall/Proxy
Critical Server Packet Capture GB Defense with breadth in variety
and depth in time
24. 24 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Pivotal
Business Data Lake Architecture Centralized Management System
monitoring System management Unified Data Management Tier Data
mgmt. services MDM RDM Audit and policy mgmt. Processing Tier
Workflow Management In-memory MPP database Existing Sources Unified
Sources Flexible Actions Real-time ingestion Micro batch ingestion
Batch ingestion Real-time insights Interactive insights Batch
insights HDFS New Data Sources
25. 25 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Pivotal
Business Data Lake Architecture Centralized Management Unified Data
Management Tier Data Dispatch MDM RDM Data Dispatch Processing Tier
Spring XD Pivotal GemFire XD HAWQ Unified Sources Flexible Actions
Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data
Pivotal GemFire Pivotal RabbitMQ Redis Pivotal CFPivotal HD Command
Center Existing SourcesNew Data Sources
26. 26 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data Lake:
More Than a Data Repository To iterate and experiment to fail fast
for fast cycle of value generation Analytics Support Fast Query
Data Store
27. 27 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data Science
Tools Commercial Open Source (or Free) PL/R, PL/Python PL/Java
28. 28 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. MADlib
In-Database Functions Predictive Modeling Library Linear Systems
Sparse and Dense Solvers Matrix Factorization Single Value
Decomposition (SVD) Low-Rank Generalized Linear Models Linear
Regression Logistic Regression Multinomial Logistic Regression Cox
Proportional Hazards Regression Elastic Net Regularization Sandwich
Estimators (Huber white, clustered, marginal effects) Machine
Learning Algorithms Principal Component Analysis (PCA) Association
Rules (Affinity Analysis, Market Basket) Topic Modeling (Parallel
LDA) Decision Trees Ensemble Learners (Random Forests) Support
Vector Machines Conditional Random Field (CRF) Clustering (K-means)
Cross Validation Descriptive Statistics Sketch-based Estimators
CountMin (Cormode- Muthukrishnan) FM (Flajolet-Martin) MFV (Most
Frequent Values) Correlation Summary Support Modules Array
Operations Sparse Vectors Random Sampling Probability
Functions
29. 29 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data Science:
Extracting Value from Data Lake Technology & Tools People
30. 30 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Evolution of
Data Analytics in Security BI and Compliance- driven Investigation-
driven Behavior- metrics driven Data-science driven Data goes in,
hard to extract value Fast queries over large data Single source
metrics, simple correlation, rule- based, high false positive
Leverage full contextual info, multi-source, automatic, for low
false positives
31. 31 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. What is Data
Science? The use of statistical and machine learning techniques on
big multi-structured data in a distributed computing environment to
identify correlations and causal relationships classify and predict
events identify patterns and anomalies and infer probabilities,
interest and sentiment.
32. 32 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Data Science:
The Next Security Frontier Beyond signatures Beyond simple metrics
for thresholding Beyond manual engineering of rules Monitor each
and every entity in its environmental context with 360 view over
long time window with advanced mathematics
33. 33 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Pivotal
Network Intelligence Demo
34. 34 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.
35. 35 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.
36. 36 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.
37. 37 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. What is a Data
Scientist? ProgrammingSkills Mathematical/Statistical Skills
38. 38 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.
Mathematical/Statistical Skills One Team Member
ProgrammingSkills
39. 39 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.
Mathematical/Statistical Skills Another Team Member
ProgrammingSkills
40. 40 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Yet Another
ProgrammingSkills Mathematical/Statistical Skills
41. 41 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Together
ProgrammingSkills Mathematical/Statistical Skills
42. 42 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved. Take-Home
Messages Information Security = Big Data Problem Data Lake is more
than a data store Data Science drives value from Information
Security Data Lake
43. 43 Copyright 2014 EMC Corporation. All rights reserved.
Copyright 2014 EMC Corporation. All rights reserved.