Securerank ping-opendns

transcript

Big Data for Security

Ping @opendns.com

Umbrella Security Lab @OpenDNS

•  100+ sensors across 200+ countries •  200 million unique registered domains names •  40 million acDve users •  50 billions daily DNS requests

The PlaIorm HDFS, HBASE

KaPa Storm

naDve MR

pig hive

Python

ProducDon

Backup

AnalyDcs

google.com

labs.kaspersky.com

gpioegjrhsf.ws

pbqxdwwv.ws

usirk.ws

dncdh.nl

hzkfooak.cn

jflyyruea.com

15.83.5.1

128.13.18.67

154.1.32.15

62.8.20.54

TransacDon View of DNS Lookups

DNS VS. Retail

– Amazon’s CollaboraDve Filtering – Apriori algorithm (frequent item set mining)

Modeling Methodologies

Data abstracDon/ representaDon

(link graph, social graph …)

Behavior abstracDon (random walk)

Reasoning tech (generaDve, empirical, iteraDve,

recursive …)

Client IP

domain

Page rank

The more linked by good pages, the higher a page is ranked One type of node One node can have both inlinks and outlinks Most nodes link to a limited amount of other nodes Pages are not classified

DNS transacDons

The less visited by good clients, the higher chance a domain is bad Two types of node Node is either visiDng, or being visited, but never both There are super nodes that link to millions of other nodes Domains are classified as benign, malicious, unknown

Page rank Damping factor (user get bored) Random sinks and cycles Page rank are numbers between 0 and 1 and sum up to one in total Linkage matrix NxN (N being the total number of pages

DNS transacDons The domains visited by more good visitors are ranked high (inlink) -‐ Assign a “posiDve” iniDal value Visitors visiDng more good domains are ranked high (outlink) -‐ Assign a “posiDve” iniDal value Linkage matrix NxM (N being total number of domains, M being total number of IPs) PotenDally, we can consider query count as linkage weight

Recursive defini-on for all ips visiDng domain dn -‐-‐-‐ rank for ip at Dme t -‐-‐-‐ the total number of domains ip connects to (in a certain Dme window) for all domain dn visited by ip -‐-‐-‐ the rank for d at Dme t -‐-‐-‐ the total number of ips visiDng domain d (not variant by Dme) The denominator gives the marginal (the sum of the counts of the condiDoning variable co-‐occurring with anything else)

r(dn)t+1 = (r(ip)t / L(ip))∑r(ip)tL(ip)

r(ip)t+1 = (r(dn)t / L(dn))∑r(ip)tL(dn)

•  Recursive definiDon •  Build linkage matrix •  IniDalizaDon •  IteraDng •  Test for convergence

Link analysis – build sparse linkage matrix (row-‐wise) input query log (each entry: client ip to hostnames) output dn -‐> ip ip ip ip ip -‐> dn dn dn dn //STRIPE DESIGN //map job: parsing query entry, filter bad hostname, convert hostname to domain emit [key(domain), value(ip)] emit [key(ip),value(domain)] //reduce job: emit [key(domain), value(ip ip ip)] emit [key(ip),value(domain domain domain)]

Itera-ng – MapReduce iteraDon #n map – input Key (domain), value (pagerank ip ip ip) Or Key(ip), value (pagerank dn dn dn dn) – output key(ip/domain), value(x=pagerank/linklist.size()) reduce – input Key(domain/ip), values (x) //x as defined above key (domain/ip), value (x ip ip ... ip ) – output Key (domain/ip), value (Σx ip ip ip)

Hadoop ImplementaDon

•  Mapreduce job #1 – Building Link lists

•  Iterate mapreduce job #2 – Security ranking

•  Mapreduce job #3 – SorDng

Querylog

Output

key Value (rank, previous rank, links)

IP 1.0 1.0 d d d d

Domain 1.0 1.0 ip ip ip ip

Output

key value

Domain IP

IP Domain

Hadoop Job 2 – linkage creaDon, domain (or ip) mappings

Mapper Reducer

Key Value

IP1 2.3, 1.0, d1, d2, d3

IP2 -‐9.5,1.0, d1, d3

d1 24, 1.0, IP1, IP2

Output

key value

d1 2.3/3 + -‐9.5/2, 24, IP1, IP2

Output

key value

d1 “rank” 2.3/(num_of_links=3)

d1 “rank” -‐9.5/(num_of_links=2)

d3 “rank” -‐9.5/(num_of_links=2)

IP1 “links” 2.3, 1.0, d1, d2, d3

IP2 “links” -‐9.5,1.0, d1, d3

Hadoop Job 2 – Security Ranking (SR)

Mapper

UpdaDng security rank

SR = Σ SRi/K, for each outlink, K being the number of outlinks of enDty i

Risks/Issues •  Behavior changes. A machine can be infected at any

minute. Is a day or an hour good window to measure the “cleanness” of a client?

•  Noises

•  Each individual source is one client IP or a user or machine (e.g., school WIFI, where no consistent client visiDng behavior can be obtained). Are these IPs introducing noises or they are the ones bringing in the most likely malicious connec8ons?

•  Massive detecDon, is it massive FP?

Take-‐away

•  Graph-‐based discovery

•  Take a different view at your data

•  Machine Learning at a different scale

Securerank ping-opendns

Technology