Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | arlene-richards |
View: | 274 times |
Download: | 0 times |
Security Data Science (SDS)
Prof. Tudor DumitrașAssistant Professor, ECEUniversity of Maryland, College Park
ENEE 759D | ENEE 459D | CMSC 858Z
http://ter.ps/759d
https://www.facebook.com/SDSAtUMD
Introducing Your Instructor
2
Tudor DumitrașOffice: AVW 3425Email: [email protected] Website: http://ter.ps/759d Office Hours: Mon 2-3 pm
My Background• Ph.D. at Carnegie Mellon University
– Research in distributed systems and fault-tolerant middleware
• Worked at Symantec Research Labs– Built WINE platform for Big Data experiments in security
– WINE currently used by academic researchers and Symantec engineers
• Joined UMD faculty
• Research and teaching on applied security and systems– Focus on solving security problems with data analysis techniques
3
WINE
SDS In A Nutshell• Course objectives
– Ability to understand and interpret scholarly publications, to explain their key ideas, and to provide constructive feedback
– Ability to apply some of these ideas in practice
• Topics
• Grading– 50% paper reviews and class participation
– 50% projects
Vulnerabilities and exploits Spam infrastructuresFailures of cryptosystems Pay per installInternet worms Attacks against physical infrastructureDenial of service Targeted attacksBotnets Economic implications of cybercrime
4
We Are Swimming in Data• Data created/reproduced in 2010: 1,200 exabytes• Data collected to find the Higgs boson: 1 gigabyte / s• Yahoo: 200 petabytes across 20 clusters
• Security: – Global spam in 2011: 62 billion / day
– Malware variants created in 2011: 403 million
5
Why So Much Data?• We can store it
– 6¢ / GB
– 29¢ / GB (SAS HDD)
• We can generate it– Most data is machine-generated
– Most malware samples are variants of other malware, generated automatically (repacking, obfuscation)
What to do with all this data? 6
Three Stories about Data
7
WHAT QUESTIONS TO ASK ON A FIRST DATE?The Power of Big Data
ONE
8
If You Want to Know …Do my date and I have long-term potential?
9
If You Want to Know …Do my date and I have long-term potential?
Q Do you like horror movies?
Q Have you ever traveled around another country alone?
Q Wouldn't it be fun to chuck it all and go live on a sailboat?
Likelihood ofcoincidence
275,000 user submitted questions34,260 real world couples
3.7×
10
DataPsychology
… ask:
Top 3 user rated questions, about:• God• Sex • Smoking
11
Source: CNN Money
• eHarmony– Analyzes hundreds of behavioral variables, most collected automatically
– CTO: former search engineer at Yahoo!
• OkCupid We do math to get you dates
– Founded by Harvardmath & CS majors
• PlentyOfFishBuilding this matching system was harder than [being] cited in the paper that won the Fields Medal
Online Dating and Big Data
Early 1900s: Most Factories Had Private Generators
12
Source: Nicholas Carr
Electricity was critical for business, but not widely available
13
Source: OkCupid
Is he an engineer?
Does she dateengineers?
Data analytics provide remarkable insight
Applications in many disciplines
What Is Data Science?
• Also known as ……Big Data analytics
…Machine intelligence
…Data-intensive computing
…Data wrangling
…Data munging
…Data jujitsu
14
Source: Drew Conway
TWOIMPROVING MACHINE TRANSLATIONThe Unreasonable Effectiveness of Data
15
2005 NIST Machine Translation Competition
• Google’s first entry– None of the engineers spoke Arabic
• Simple statistical approach
• Trained using United Nations documents– 200 million translated words
– 1 trillion monolingual words
16
English-Arabic competition
17
For many hard problems there appears to be a threshold of sufficient data A. Halevy, et al., CACM 2009.
What is Security Data Science?
• Also known as …… Security analytics
… Surveillance analytics
• Applying data science methods to security problems
18
Security Principles in 60 Seconds [J. Saltzer & M. Schroeder, SOSP 1973]
• Economy of mechanism: Keep the protection mechanism as simple and small as possible
• Fail-safe defaults: Base access decisions on permission rather than exclusion
• Complete mediation: Check every access to every object• Open design: Do not keep the design secret• Separation of privilege: Require two keys to unlock, not one• Least privilege: Grant every program/user the least set of
privileges necessary to complete the job• Least common mechanism: Minimize the amount of mechanism
common to more than one user and depended on by all users• Psychological acceptability: Design interfaces for ease of use
19
Security in Practice(Source: C. Nachenberg, Symantec)
• 1986: Simple computer viruses– Defense: anti-virus
• 1990: Polymorphic viruses (decryption logic + encrypted malicious code)
– Defense: “universal” decoder, emulation
• 1995: Macro viruses– Defense: AV vendor cooperation, digital signatures for macros
• 1999: Worms– Defense: Vulnerability-specific signatures
• 2004: Web-based malware– Defense: behavior blocking
• 2006: Auto-generated malware – Defense: reputation based security
• 2010 (but probably earlier): Targeted attacks (physical infrastructure, 0-day, etc.)
– Defense: ??20
THREE
21
UNDERSTANDING ZERO-DAY ATTACKSThe Need for Security Data Science
Zero-Day Attacks: Recent Examples
22
2009: Operation Auroraagainst Google
2010: Stuxnet
2011: Attack against RSA
Zero-day attack = cyber attack exploiting a software vulnerability before the public disclosure of the vulnerability
Price of Zero-Day Exploits on the Black Market
23
The Economist, March 2013
The Elderwood Project
24
Group with “seemingly unlimited” supply of zero-day exploits(Source: Symantec)
Zero-Day Attacks: Open Questions
Decade-long open questions• How common are zero-day attacks?• How long can they remain undiscovered?• What happens after disclosure?
Creation
Vulnerabilitytimeline
[Arbaugh 2000, Frei 2008, McQueen 2009, Shahzad 2012]
Prior work
Zero-day attack
Vulnerability disclosed(“day zero”)
Exploit used in attacks
Security patch released
All hosts patched
25
Zero-Day Attacks: Open Questions (cont’d)
26
Creation Vulnerability disclosed(“day zero”)
Exploit used in attacks
Security patch released
All hosts patched
Decade-long questions: Why still open?• Rare events, hard to observe in small data sets• Need data analysis at scale
[weeks]
Before disclosure:Targeted attacks
After disclosure:Large-scale attacks
Rare events
Research in Security Data Science
27
Challenge 1: Find the needle in the haystack– Example: Identify and measure zero-day attacks
Challenge 2: Ensure generally applicable and repeatable results – The threat landscape changes frequently
Challenge 3: Deal with new and advanced threats– Skilled and persistent hackers can bypass firewalls, anti-virus, password-
protected systems, two-factor authentication, physical isolation
[…]
-100 -50 T0 50 100 150 (weeks)
Varia
nts
10
103
105
403 million new malware variants created in 2011
Targeted attacks before disclosure
Rare events
Your thesis topic goes here
What is Security Data Science? (re-visited)• Systems knowledge: develop technologies needed to store and
process massive data sets• Statistics & machine learning knowledge: analyze the data and
extract information• Security knowledge: ask the right questions about cyber attacks
• Data scientists are in high demand in the cybersecurity industry
Booz Allen may be recruiting more [data scientists] than Google or Facebook
The Economist, June 2013
28
Course Content• Introduction to Security Data Science
• Hands-on emphasis – this is largely an unexplored research area– Team-based projects
– Reviews of scholarly publications
– No textbook
• Specific things you can expect to learn– Selected topics in security
– System skills: Experiment design, data analysis, scalability
– Team skills: Cooperating to achieve your team goals
– Speaking/writing skills: Presenting paper/project findings, providing constructive feedback
29
This is an Advanced Course• You are responsible for holding up your end of the educational
bargain– I expect you to attend classes and to complete reading assignments
– I expect you to learn how to analyze data and to try things out for yourself
– I expect you to know how to find research literature on security topics• The required readings provide starting points
– I expect you to manage your time• In general there will be one written assignment due before each lecture
• Learning material in this course requires participation – This is not a sit-back-and-listen kind of course; class participation is required
for understanding the material and makes up a part of your grade!
• Different grading criteria for graduate and undergraduate students
Reading Assignments• Readings: 1-2 papers before each lecture
– Not light reading – some papers require several readings to understand
– For next time: C. Kanich et al., 'Spamalytics: An Empirical Analysis of Spam Marketing Conversion,'ACM CCS, 2008.
– Check course web page (still in flux) for next readings and links to papers
• Homeworks: review the papers you read using a defined template– Submit homework by email to [email protected]
• We might switch to a Web based submission system in the future
– Due at 6 pm the evening before class
– BibTeX template: Summary, Contributions, Weaknesses, Opinion (optional)
– I will provide feedback on some of your written critiques; no email means your writeup is satisfactory
• In-class discussion: stand up and talk about the papers– Volunteers are preferred
– Students randomly selected if no volunteers
31
Discuss …Do my date and I have long-term potential?
Q Do you like horror movies?
Q Have you ever traveled around another country alone?
Q Wouldn't it be fun to chuck it all and go live on a sailboat?
Likelihood ofcoincidence
275,000 user submitted questions34,260 real world couples
3.7×
32
DataPsychology
… ask:
Top 3 user rated questions, about:• God• Sex • Smoking
Course Projects• Pilot project: two-week individual projects
– Propose a security problem and a data set that you could analyze to solve it• Some ideas are available on the web page
– Conduct preliminary data analysis and write a report
– Propose projects by September 9th (soft deadline)
– Submit report by September 18th
• Group project: ten-week group project– Deeper investigation of promising approaches
– Submit written report and present findings during last week of class• 2 checkpoints along the way (schedule on the course web page)
– Form teams and propose projects by September 30 th
• Peer reviews: review at least 2 project reports from other students– Use skills learned from paper reviews
– Post project proposals, reports and reviews on Piazza
33
Pre-Requisite Knowledge• Good programming skills
– Knowledge of languages commonly used in data analysis, like Matlab or R, is a plus
– To brush up: ‘Data Analysis and Visualization with MATLAB for Beginners’ seminar, on September 12 at 5pm, Room 1110 Kim Engineering Building
• Ability to come up to speed on advanced security topics– Covered in the paper readings
– Basic knowledge of security (CMSC 414, ENEE 459C or equivalent) is a plus
• Ability to come up to speed on data analytics– Lectures provide light-duty tutorials, but you will need to pick up the
details as you go along 34
Policies• “Showing up is 80% of life” – Woody Allen
– Participation in in-class discussions is required for full credit
– You can get an “A” with a few missed assignments, but reserve these for emergencies (conference trips, waking up sick, etc.)
– Notify the instructor if you need to miss a class, and submit your homework on time
• UMD’s Code of Academic Integrity applies, modified as follows:– Complete your homework entirely on your own. After you hand in your
homework, you are welcome (and encouraged) to discuss it with others
– Discuss the problems and concepts involved in the project, but produce your own project implementation, report and presentation• Group projects are the result of team work
• See class web site for the official version 35
Classroom Protocol• Please arrive on time; lecture begins promptly
– I also promise to end on time
– Handouts, readings and homework templates posted class web page
• Questions are encouraged – If you don’t understand, ask; probably other students are struggling too
– Explain the content of your reading assignment, and the underlying reasoning, to the rest of the class
– Your reasons don't have to be "right” – you just have to be able to explain them
• There is no way to cover everything – If there is an interesting aspect that we do not cover in class, feel free to
incorporate that in your projects 36
Grading Criteria• Straight scale: A≥90; B≥80; C≥70; D<70
– 50% Written paper critique and class discussion• 24 assignments x 2 points each + 2 points for this lecture
– 50% Projects• 30 points for group project, 10 points for pilot project, 10 points for project reviews
– 10% Subjective evaluation
• Expectations– Graduate students: you can explain the contributions and weaknesses of the
papers you read
– Undergraduates: you demonstrate a general understanding of the papers
• Unsatisfactory participation means:– You did not read the papers
– You did not produce a working implementation for your project, or you do not understand how the implementation works
37
Review of Lecture• What did we learn?
– Data analytics provide real benefits
– Analyzing large data sets allows tackling long-standing hard problems
– Difference between security principles and security in practice
– Examples of security problems that require insights from large data sets
• I want to emphasize– This is systems course, not a not a pen-and-paper course
– You will be expected to build a real, working, data analysis tool
• What’s next?– Basic statistics and experimental design
– Pilot project: proposal, approach, expectations
• Deadline reminder – Post pilot project proposal on Piazza by Monday (soft deadline)
– First homework due on Sunday at 6 pm
38