1 © Copyright 2011 EMC Corporation. All rights reserved.
Machine Learning and
BigData in Cyber
Security
Eyal Kolman, Ph.D.
Research Scientist
RSA
14.5.2015
4 © Copyright 2011 EMC Corporation. All rights reserved.
** RSA CONFIDENTIAL **
Recent Security Attacks
Neiman Marcus, Jan 2014: 1.1 million credit and debit cards
NY city, 2014: 22.8 private records were exposed
UPS, August 2014: customer details were exposed
Home Depot, September 2014: 56 million credit and debit cards
JP Morgan Chase, 2014: 76 million consumers and 7 million businesses
Sony, November 2014: 47K social security numbers, 5 movies
Goodwill, September 2014: 900k credit and debit cards
KMart, 2014: unknown number of credit and debit cards
Dairy Queen, August 2014: 600K credit and debit cards
5 © Copyright 2011 EMC Corporation. All rights reserved.
Today’s Cyber Security Paradigm DB
DB
DB
(Variety,
Velocity,
Volume)
Data Science
Monitor. Analyze.
Detect.
6 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Law and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
7 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Law and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Where to place the sniffers?
• How to best use my limited storage?
• Detection of failures
8 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Law and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Activities aggregation
• Detection of message modifications
• Automatic synchronization and
normalization
9 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
•Dimensionality reduction
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Law and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Automatic feature definition (“deep learning”
style)
• Feature selection
• Dimensionality reduction
10 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
•Dimensionality reduction
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Low and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Anomaly detection
• Pattern recognition
• Behavioral-based analysis
• Low-and-slow detection
• …
11 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
•Dimensionality reduction
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Low and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Contextual alerting
• Prioritization
• Alerts filtering
• Grouping
12 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
•Dimensionality reduction
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Low and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• Prediction of analyst next step
• Recommendation systems
• Crowd sourcing
• Feedback generation (explicit and implicit)
13 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science all the Way
Data Extraction
•Where to place the sniffers?
•How to best use my limited storage?
•Detection of failures
Parsing
•Activities aggregation
•Detection of message modifications
•Automatic synchronization and normalization
Feature Extraction
•Automatic feature definition (“deep learning” style)
•Feature selection
•Dimensionality reduction
Detection
•Anomaly detection
•Pattern recognition
•Behavioral-based analysis
•Low and slow detection
•…
Alerting
•Contextual alerting
•Prioritization
•Alerts filtering
•Grouping
Investigation
•Prediction of analyst next step
•Recommendation systems
•Crowd sourcing
•Feedback generation (explicit and implicit)
Mitigation
•What-if analysis
•Automatic mitigation
• What-if analysis
• Automatic mitigation
14 © Copyright 2011 EMC Corporation. All rights reserved.
Risk Engine
Case Mgmt
Activity details
Policy
Mgr.
Behavior Device Fraud
Authenticate Continue
The RSA Risk Engine
Step-up Authentication Feedback
Feedback
Ch
alle
nge
Ou
t-o
f-b
and
Oth
ers
Kn
ow
led
ge
271 937
15 © Copyright 2011 EMC Corporation. All rights reserved.
Device1
Device2
Device3
Country
Device
The model learns the user’s behavior from his historical data
IN
UAE
User logs in from UAE for the 1st time. He’s always located in India Score: 92
User logs in from a new, unrecognized, device Score: 90
Transmitted
Data [MB] User transmits 1GB, user’s average is 68MB
Score: 93
Session
Duration
[Hours]
Sessions duration is 15 hours, user’s average is 4 hours Score: 82
Score Final score is an aggregation of the features’ scores
Aggregated
Score: 98
Impersonation Detection
16 © Copyright 2012 EMC Corporation. All rights reserved.
Suspicious Domains Detection
Each vertical line
represents one feature
How Long is the Path length in the URL?
Was the site reached through a referrer?
Was the site communicated with a cookie?
Was the site seen by only few users?
Was the user agent string suspicious?
Is transmit to receive ratio abnormal?
Risk is calculated across multiple features
Risk is scored between 0 – 1 1 = riskiest (RED) 0 = normal (GREEN)
17 © Copyright 2012 EMC Corporation. All rights reserved.
Ranking Top Suspicious Domains
Ranking Top Suspicious Domains
68% of the top 50 domain are malicious
Legend: Red – malicious
Black – benign
18 © Copyright 2011 EMC Corporation. All rights reserved.
Data Science is not a position. It’s a
Group.
Data Gurus
Domain
Experts
Machine
Learning
Researcher
s
User
Verificatio
n
Device-
based risk
assessmen
t
Suspiciou
s Domains
Detection
Suspiciou
s Users
Detection
Alerts
Prioritizatio
n
Anomalous
Communicatio
n Detection
DNS-
based
Malware
Detection
eMail me: [email protected]
Data
Scienc
e