Date post: | 11-Apr-2017 |
Category: |
Technology |
Upload: | pivotal |
View: | 210 times |
Download: | 0 times |
background image: 960x540 pixels - send to back of slide and set to 80% transparency
Using Data Science for Cybersecurity Anirudh Kondaveeti, Principal Data Scientist, Pivotal Jeff Kelly, Principal Product Marketing Manager, Pivotal
Today’s Speakers
2
Using Data Science for Cybersecurity
Anirudh Kondaveeti Principal Data Scientist, Pivotal
Jeff Kelly Product Marketing, Pivotal
Moderator Presenter
cover this square with an image (540 x 480 pixels)
● Cybercrime costs average US enterprise $17m per year*
● Cost grew at 15% CAGR over last three years
● Any given cybercrime can cost significantly more
● Target’s 2014 hack cost company approximately $162m
● Costs not just financial, also reputational
Cost of Cybercrime on the Rise
*Source: 2016 Cost of Cyber Crime Study & the Risk of Business Innovation, Ponemon Institute
cover this square with an image (540 x 480 pixels)
● Amateur hackers giving way to professionals
● Developing new, more sophisticated, methods
● Professional hackers make their services available for a fee
● Costs to commit cybercrime dropping
● Average subscription fee for a one hour/month DDoS package is roughly $38*
Hackers Growing More Sophisticated
*Source: Q2 2015 Global DDoS Threat Landscape, Incapsula
cover this square with an image (540 x 480 pixels)
● Defending the perimeter no longer enough
● No 100%, fool-proof way to keep bad actors out
● Some threats come from within
● The idea of a perimeter becoming obsolete with mobile, cloud, IoT
● Need better methods for threat detection inside the network
Perimeter Defense Inadequate
Data Science for Cybersecurity
Security must move beyond signature-based matching • Necessary defense direction: Find the Unknown • Need an advanced platform: Security is a Big Data problem • Multiple decentralized sources of traditional or unconventional data • Need a platform for better BI, reporting, and cross-source correlation • Develop intelligence: Security is an Advanced Analytics problem
BI and Compliance-
driven Investigation-
driven
Behavior-metrics
Investigation-driven
Data-science driven
Background
8
Lateral Movement Detection
Advanced Persistent Threat (APT)
A handful of users are targeted by two
phishing attacks: one user opens Zero day
payload (CVE-02011-0609)
The user machine is accessed remotely by Poison Ivy tool
Attacker elevates access to important user, service and
admin accounts, and specific systems
Data is acquired from target servers and
staged for exfiltration
Data is exfiltrated via encrypted files over ftp to external, compromised machine at a hosting
provider
Phishing and Zero Day Attack Back Door Lateral
Movement Data Gathering Exfiltrate
1 2 3 4 5
APT Kill Chain
What: Identify anomalous user-level access to hosts How: Look at People & Machines • Users (User Behavior Models) • Network, Servers (User Peer Models)
Scenarios: Network reconnaissance from remote adversary on hijacked device Ill-intentioned activities by legitimate employee Access policy abuse
Business values: Immediate security alert generation Enhanced SIEM alert queue prioritization Focused monitoring Future integration with other analytic models for 360° attack view
Lateral Movement Detection
Data Computing Appliance
Logs
Active Directory Activity
Active Directory Metadata
Server Information
Structured
Ext
erna
l Tab
les
Semi-structured
Regression Based Model
Cluster Based Model
Recommendation System Based
User Behavioral Model
Anomalous Users
Greenplum
DIA
LDAP Activity
Lateral Movement Detection (LMD) – Flow Diagram
Model to identify users with unusual variation in the number of servers accessed over time
Build a regression model for each user (Y = aX + b)
No. of servers accessed each week (Y) ~ Week Index (X)
Find the slope of the regression line for each user (a)
Identify users who have a high positive or negative slope to find users with unusual activity
Num
ber o
f Ser
vers
Week of the year
Regression plot of number of servers for a user
Regression-Based Model
Build historical behavioral profile for each user based on following features: • Servers accessed • IP addresses logged in from • Geographical information of login
Models stress individual user/job log-in frequency
Multiple Feature Generations reduce false alarms: • Aggregate servers to respective server group • Incorporate server criticality • Assign more weight to less popular servers and IP
addresses • E.g. print servers are low-weighted • Use recommendation engine to suggest servers to users
based on job roles and peers
Ser
vers
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
Typically uses only a few servers
Begins logging into a lot of new servers
User Behavior Models (UBM)
Week1 Week2 . Week10 Week 11 . Week15
server1 2 3 1 0 . 0
server2 4 7 1 3 . 7
server3 0 2 0 0 . 0
. . . . . .
server25 1 3 5 8 . 1
PCA Model Built per User (Training Data) Testing Data
User behavior matrix is created using ‘x’ weeks of history for a user. The current week is used as test data.
PCA is dimensionality reduction technique used to capture the components set of multidimensional vector which account for most of the variance.
Principal dimensions are calculated from the training data.
Principal Component Analysis (PCA) Scoring
Reconstruction Error
Training Data (User Behavior
Matrix)
Run PCA
Principal Dimensions
Reconstruct
Project onto Principal Dimensions
Test Vector (User data for new week)
Reconstructed Test Vector
Difference between two vectors
Anomaly Score
Ref: A Lakhina, M Crovella, C Diot, Diagnosing network-wide traffic anomalies
Principal Component Analysis (PCA) Scoring
Oversampling PCA
Reference and Image Source: YR Yeh, ZY Lee, YJ Lee, Anomaly Detection via Over-sampling Principal Component Analysis
Training Data (User Behavior
Matrix)
Run PCA
Oversampled Test Data
Training Data (User Behavior
Matrix)
Run PCA
First Principal Vector
Difference in angle between them
Anomaly Score
First Principal Vector after oversampling
Test Data
Principal Component Analysis (PCA) Scoring
R Code to find the Principal Components (using SVD) SQL & R
User1 Data
User2 Data
User3 Data
User4 Data
User5 Data
User1 Model
User2 Model
User3 Model
User4 Model
User5 Model
PLR wrapper over the R Code to run in parallel
Parallelized PCA using PL/R
Users rate items
To recommend items to a particular user A • Find other users U similar to A
• Identify the set of items I accessed by U
• Recommend these items I to A
Users = Employees
Items = Servers accessed Image Source: http://dataconomy.com/2015/03/an-introduction-to-recommendation-engines/
Recommendation System-Based Model
� Historical profile for each user based on number of days per week for a particular server weighted by recommendations
� AD Logs, LDAP data (job title, dept, etc)
� Heat Map (Top figure) – X-Axis : Week Index – Y-Axis : Server – Value: Number of days per
week weighted by recommendations
� Outlier Plot (Bottom Figure) – X-Axis : Week Index – Y-Axis : Outlier Score
Heat map before recommendations Heat map after recommendations
Servers g3 & g4 are recommended, hence weight is decreased
Outlier score in test week decreases because the new servers that the user accesses are
recommended for his job profile
g1 g
2 g
3 g
4 g
5
g1 g
2 g
3 g
4 g
5
Recommendation System-Based Model
Using historical windows events data to build graphs* of typical user behavior • Which machines does the user log into? • Which machines does the user log in from? • How often? • In which order?
Ask if this behavior is typical • Is it typical for this user? • Is it typical for someone in a particular department? • Is this typical for someone in the user’s job role?
Graph models are sensitive to direction, order, and frequency
34.23.123.4
Typical Behavior
Anomalous Behavior
DB with financial information
34.23.123.51
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.4
34.23.1.1
34.23.0.1
34.23.2.8
34.23.123.51
*Reference: Alexander D. Kenta, Lorie M. Liebrockb, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.
Graph Model
Challenge: • Cybersecurity threats, data privacy, data protection and fraudulent
behavior going undetected, leaving customer vulnerable to security risks, loss of money
• Need to gain timely insight into unusual/suspicious internal behavior to allow for proper action
• Tools in place cannot be customized to leverage historical security data and allow for predictive analytics
Solution: • Leveraged Data Science to show use cases analyzing their active
directory data, identifying fraud, unapproved file sharing, etc. • Utilized Big Data Suite, specifically Greenplum + MADlib + R to
store and analyze data with potential to build out Hadoop data lake with HDB (aka HAWQ)
Pivotal Solution includes: Pivotal Greenplum, Pivotal HDB, Apache MADlib
Fortune 100 Companies Leverage Pivotal to Tackle Enterprise-wide Security Risks with Analytics
• Pivotal Data Science expertise and partnership with customers to identify high-value use cases to solve and build data science center of excellence for security analytics
• Tight integration to Analytical Tools that run in-database and across all of the data, to cover the most possible use cases
• Scalable Solution that can grow as data needs grow, leveraging commodity hardware to keep costs low as data volume increases
• Join key Pivotal customers in the Security Advisory Council for collaboration and knowledge sharing
Why Pivotal for Security Analytics
Additional Resources & Next Steps Read: Pivotal Data Science Blog https://blog.pivotal.io/channels/data-science-pivotal Strategic: Pivotal Data Science Analytics Road mapping Engagement https://pivotal.io/contact Tune in: Next data science webinar: “Using Data Science to Detect Healthcare Fraud, Waste, and Abuse,” March 14, 2017 https://pivotal.io/resources/1/webinars Hands on: Pivotal Greenplum Sandboxhttps://network.pivotal.io/products/pivotal-gpdb Apache MADlib (incubating)http://madlib.incubator.apache.org/
Questions? Using Data Science for Cybersecurity