Post on 12-Jul-2015
transcript
Big Data for Security
Ping @opendns.com
Umbrella Security Lab @OpenDNS
• 100+ sensors across 200+ countries • 200 million unique registered domains names • 40 million acDve users • 50 billions daily DNS requests
The PlaIorm HDFS, HBASE
KaPa Storm
naDve MR
pig hive
Python
R
ProducDon
Backup
AnalyDcs
google.com
labs.kaspersky.com
gpioegjrhsf.ws
pbqxdwwv.ws
usirk.ws
dncdh.nl
hzkfooak.cn
jflyyruea.com
15.83.5.1
128.13.18.67
154.1.32.15
62.8.20.54
TransacDon View of DNS Lookups
DNS VS. Retail
– Amazon’s CollaboraDve Filtering – Apriori algorithm (frequent item set mining)
Modeling Methodologies
Data abstracDon/ representaDon
(link graph, social graph …)
Behavior abstracDon (random walk)
Reasoning tech (generaDve, empirical, iteraDve,
recursive …)
?
?
Client IP
domain
Page rank
The more linked by good pages, the higher a page is ranked One type of node One node can have both inlinks and outlinks Most nodes link to a limited amount of other nodes Pages are not classified
DNS transacDons
The less visited by good clients, the higher chance a domain is bad Two types of node Node is either visiDng, or being visited, but never both There are super nodes that link to millions of other nodes Domains are classified as benign, malicious, unknown
Page rank Damping factor (user get bored) Random sinks and cycles Page rank are numbers between 0 and 1 and sum up to one in total Linkage matrix NxN (N being the total number of pages
DNS transacDons The domains visited by more good visitors are ranked high (inlink) -‐ Assign a “posiDve” iniDal value Visitors visiDng more good domains are ranked high (outlink) -‐ Assign a “posiDve” iniDal value Linkage matrix NxM (N being total number of domains, M being total number of IPs) PotenDally, we can consider query count as linkage weight
Recursive defini-on for all ips visiDng domain dn -‐-‐-‐ rank for ip at Dme t -‐-‐-‐ the total number of domains ip connects to (in a certain Dme window) for all domain dn visited by ip -‐-‐-‐ the rank for d at Dme t -‐-‐-‐ the total number of ips visiDng domain d (not variant by Dme) The denominator gives the marginal (the sum of the counts of the condiDoning variable co-‐occurring with anything else)
r(dn)t+1 = (r(ip)t / L(ip))∑r(ip)tL(ip)
r(ip)t+1 = (r(dn)t / L(dn))∑r(ip)tL(dn)
Tasks
• Recursive definiDon • Build linkage matrix • IniDalizaDon • IteraDng • Test for convergence
Link analysis – build sparse linkage matrix (row-‐wise) input query log (each entry: client ip to hostnames) output dn -‐> ip ip ip ip ip -‐> dn dn dn dn //STRIPE DESIGN //map job: parsing query entry, filter bad hostname, convert hostname to domain emit [key(domain), value(ip)] emit [key(ip),value(domain)] //reduce job: emit [key(domain), value(ip ip ip)] emit [key(ip),value(domain domain domain)]
Itera-ng – MapReduce iteraDon #n map – input Key (domain), value (pagerank ip ip ip) Or Key(ip), value (pagerank dn dn dn dn) – output key(ip/domain), value(x=pagerank/linklist.size()) reduce – input Key(domain/ip), values (x) //x as defined above key (domain/ip), value (x ip ip ... ip ) – output Key (domain/ip), value (Σx ip ip ip)
Hadoop ImplementaDon
• Mapreduce job #1 – Building Link lists
• Iterate mapreduce job #2 – Security ranking
• Mapreduce job #3 – SorDng
Slide 18
Input
Querylog
Output
key Value (rank, previous rank, links)
IP 1.0 1.0 d d d d
Domain 1.0 1.0 ip ip ip ip
Output
key value
Domain IP
IP Domain
Hadoop Job 2 – linkage creaDon, domain (or ip) mappings
Mapper Reducer
Slide 19
Input
Key Value
IP1 2.3, 1.0, d1, d2, d3
IP2 -‐9.5,1.0, d1, d3
d1 24, 1.0, IP1, IP2
Output
key value
d1 2.3/3 + -‐9.5/2, 24, IP1, IP2
Output
key value
d1 “rank” 2.3/(num_of_links=3)
d1 “rank” -‐9.5/(num_of_links=2)
d2 “rank” 2.3/(num_of_links=3)
d3 “rank” 2.3/(num_of_links=3)
d3 “rank” -‐9.5/(num_of_links=2)
IP1 “links” 2.3, 1.0, d1, d2, d3
IP2 “links” -‐9.5,1.0, d1, d3
Hadoop Job 2 – Security Ranking (SR)
Mapper
Redu
cer
UpdaDng security rank
SR = Σ SRi/K, for each outlink, K being the number of outlinks of enDty i
Risks/Issues • Behavior changes. A machine can be infected at any
minute. Is a day or an hour good window to measure the “cleanness” of a client?
• Noises
• Each individual source is one client IP or a user or machine (e.g., school WIFI, where no consistent client visiDng behavior can be obtained). Are these IPs introducing noises or they are the ones bringing in the most likely malicious connec8ons?
• Massive detecDon, is it massive FP?
Take-‐away
• Graph-‐based discovery
• Take a different view at your data
• Machine Learning at a different scale