AppSec USA 2014 Denver, Colorado Catch me if you can Machine Learning, VMs, honeypots and more..

AppSec USA 2014

Denver, Colorado

Catch me if you can

Machine Learning, VMs, honeypots and more..

Ph.D. CSE – works at CloudFlare

Anirban Banerjee• San Francisco• Web-Malware detection• Machine learning, scalable systems• Interface with hosting industry• Co-Founder of StopTheHacker• Post acquisition at CloudFlare• Interested in malware detection, RE• Various talks at Hostingcon, parallels summit

Introduction

• StopTheHacker• CloudFlare• Web Malware – Existing tools Fail• Web Malware – Attack Vectors• Identification• Scaling honeypots• Machine Learning

Quick Overview

• StopTheHacker– Founded in 2009– Funded by NSF– Identifies, cleans

web-malware automatically

– Partners with hosters– Uses Machine

Learning, pattern matching, AVs, VMs

• CloudFlare– DDoS protection– CDN– WAF– Cloud Solution– Contribute to NGINX– Use Lua, Go– 5->7% of Internet

traffic daily

StopTheHacker - CloudFlare

• AVs– Polymorphic

malware– Checks for AV

processes– Avast, ClamAV, AVG– Linux versions seem

to not be updated as frequently

• Pattern Matching– Trivial to change

code structure– Trivial to change

commands– Yara, Perl, Grep, Awk

Web Malware - Existing Tools Fail

• Via Website– SQL Injection– XSS– Ads– 3rd party libraries– Themes– Plugins

• Bypass – FTP creds– Apache modules– SEO poisoning

Web Malware – Attack Vectors

• Making it a bit harder– Custom WP packages e.g. Dreampress– Auto upgrades– WAFs– Proper separation of web server and CMS roles– End clients must be educated– *Some* default scanning for *every* site• Free to end client

– Web-Malware collaboration group (SBW)

Web Malware – Attack Vectors

Web malware• High churn– Iframe targets– Fast flux networks– Encoded, encrypted,

randomly generated domains

– PhP code changes

Binaries• Low churn– Primarily PE32/Win– Target old IE exploits– Spyware/Adware

more than malware– FTP sniffers, IRC drop

Identification - Highlights

Web malware• Detection is hard– What is malware? Redirection, binary drop,

registry modification..– PhP, ASP, Shell, Perl, Python, Ruby..– Malware is smart: UA, Geo IP, Time of day, only

once per IP..– Blacklists very outdated– AVs have very poor catch rate

Identification – Challenges

Scaling honeypots

Bare Metal

OS

Docker

dev.go.com

Bad Hacker Bad Bot

BLWAF

Front End

Public API

Container

IP, file deposited etc..

Host content, tripwire, analyze binary

WP 3.6.1, 3.7, 2.8, 3.0 Joomla, Drupal, Django – Any flavor we want

Cuckoo based VMWindows binaries and honeypot

Yes• Docker – common library re-use• Spawn thousands of instances on one rack• Any flavor of CMS you like• Watchdog for file system changes• Dropped files shipped off to cuckoo VM• Complete trace, screenshots with specific IE

version

Scaling honeypots - Is this better

Constant Cat and Mouse game• Rotate IPs, avoid customer IPs• Juicy target for DDoS (400 Gigs/s +)• Keep up with new variants• Malware getting smarter, check for VM• Malware targets mobile devices

Scaling honeypots - Challenges

Helps identify the unseen • Need a dataset– Offensive computing, virustotal, blacklists..

• Analyze what is important– Reduce noise– More features is not always better– PCA type experiments– Use rules of thumb – forests/Trees– Scikitpy/pybrain/weka is your friend

Machine Learning

Toolkit strategy• Pybrain– Use for clustering, neural network– Identify what clusters are present

• Scikitpy/weka– Use for classification– Constant retraining needed : high recall, precision– Feedback loop based system is important

Machine Learning

What is the benefit• Fuzzed iframes caught easily• Fuzzed/encoded PHP/JS caught easily• Catches ad misbehavior• Catches binary that is missed by AV but tries

to do “obvious” bad things• Lets move away from signatures

Machine Learning

Is it all roses and honey?• No – constant retraining needed• Has to be able to get large dataset– Features increase, exponential increase in data

• CPU needed• Near-Real-time very hard• Toolkits are good – but can be better

Machine Learning

Right now• Pybrain– Use for clustering, neural network– Identify what clusters are present

• Scikitpy/weka– Use for classification– Constant retraining needed : high recall, precision– Feedback loop based system is important

Current Status and Future Plans

Future Plans• Inline ML for WAF• More focus on mobile malware• More focus on DDoS malware• More focus on using ML – traffic anomalies

Current Status and Future Plans

The road ahead• Make VM detection harder• Use on metal type solution – performance!• Investigate Go for inline traffic processing• Potentially open source portions of code• Automated malware collection at massive

scale

More work needed

Q&[email protected]

That’s it folks

Date post:	02-Jan-2016
Category:	Documents
Upload:	neal-parker
View:	213 times
Download:	0 times

AppSec USA 2014 Denver, Colorado Catch me if you can Machine Learning, VMs, honeypots and more..

Documents