+ All Categories
Home > Documents > Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

Date post: 27-Mar-2015
Category:
Upload: makayla-morrow
View: 224 times
Download: 0 times
Share this document with a friend
Popular Tags:
34
Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech
Transcript
Page 1: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

Network-Based Spam Filtering

Anirudh RamachandranNick Feamster

Georgia Tech

Page 2: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

2

Spam• 75-90% of all email traffic

– PDF Spam: ~11% and growing– Content filters cannot catch!

• Late 2006: “there was a significant rise in spammers’ use of botnets, armies of PCs taken over by malware and turned into spam servers without their owners realizing it.”

• August 2007: Botnet-based spam caused volumes to increase 53% from previous day

Source: NetworkWorld, August 2007

Page 3: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

3

More Than Just a Nuisance

• As of August 2007, one in every 87 emails constituted a phishing attack

• Targeted attacks on the rise– 20k-30k unique phishing attacks per month– Spam targeted at CEOs, social networks on the rise

Page 4: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

4

One Approach: Filtering

• Prevent traffic from reaching users’ inboxes by distinguishing spam from ham

• Key question: What features best differentiate spam from legitimate mail?– Content– IP address of sender– Behavioral features

Page 5: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

5

Content-Based Filtering is Malleable

• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed

• Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophistocated

Page 6: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

6

This Talk: Network-Based Filtering

• Filter email based on how it is sent, in addition to simply what is sent.

• Network-level properties are more fixed– Hosting or upstream ISP (AS number)– Botnet membership– Location in the network– IP address block

• Challenge: Which properties are most useful for distinguishing spam traffic from legitimate email?

Very little (if anything) is known about these characteristics

Page 7: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

7

Talk Outline

• Study current sending and mitigation techniques– Network-level behavior of spammers– The effectiveness of IP-based blacklists

• Design behavioral based filtering techniques– Behavioral blacklisting

• General idea• First trial of system on basic set of features

– Joint work with Santosh Vempala

• Deploy distributed monitoring system to – learn distinguishing features “on the fly”

Page 8: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

8

Studying Sending Patterns• Where is the spam coming from?

– What IP address space?– ASes?– What are the OSes of the senders?

• What techniques?– Botnets– Short-lived route announcements– Shady ISPs

• Capabilities and limitations?– Bandwidth– Size of botnet army

Page 9: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

9

BGP Spectrum Agility

• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network

where spam trap is co-located.

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)

Page 10: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

10

Why Such Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

Page 11: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

11

Characteristics of IP-Agile Senders

• IP addresses are widely distributed across the /8 space

• IP addresses typically appear only once at our sinkhole

• Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked

• Some IP addresses were in allocated, albeit unannounced space

• Some AS paths associated with the routes contained reserved AS numbers

Page 12: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

12

Lessons for Improving Spam Filters

• IP-Based Blacklists are Becoming Less Effective

• Effective spam filtering requires – A better notion of end-host identity– Filtering based on features that are more persistent

• Some features may require network-wide monitoring capabilities

Page 13: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

13

Two Parts

• Study the network-level behavior of spammers– Majority of spam comes from a very small portion of

the Internet address space– Most coming from Windows hosts– Most senders low volume to our domain– Conventional blacklists somewhat ineffective

• Develop behavioral based filtering techniques– Behavioral blacklisting

Page 14: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

14

Two Metrics for Evaluating Blacklists

• Completeness: The fraction of spamming IP addresses that are listed in the blacklist

• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

Page 15: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

15

Completeness of IP Blacklists

~80% listed on average

~95% of bots listed in one or more blacklists

Number of DNSBLs listing this spammer

Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF

ract

ion

of

all

spam

rec

eive

d

Spam from IP-agile senders tend to be listed in fewer blacklists

Page 16: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

16

Completeness and Responsiveness

• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted

even after one month

Page 17: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

17

Problems with Existing Blacklists

• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses not seen

within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines

• IP addresses of senders have considerable churn

• Requires a human to first notice the behavior– Spamming is compartmentalized by domain and not analyzed

across domains

Page 18: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

18

Problem: Changing IP Addresses F

ract

ion

of

IP A

dd

ress

es

Page 19: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

19

Problem: Low Volume to Each Domain

Lifetime (seconds)

Am

ou

nt

of

Sp

am

Most bot IP addresses send very little spam, regardless of how long they have

been spamming. Single-domain observation cannot detect.

Page 20: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

20

SpamTracker: Main Idea and Intuition

• Idea: Blacklist sending behavior (“Behavioral Blacklisting”)– Identify sending patterns that are commonly used by

spammers

• Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

Page 21: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

21

SpamTracker Design

• For each sender, construct a behavioral fingerprint

• Cluster senders with similar fingerprints

• Filter new senders that map to existing clusters

Approach

Email

Cluster

Classify

IP x domain x time

CollapseLookup Score

Page 22: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

22

Building the Classifier: Clustering

• Feature: Distribution of email sending volumes across recipient domains

• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector:

volume per domain per time interval– Collapse into a single IP x domain matrix:– Compute clusters

Page 23: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

23

Clustering: Output and Fingerprint

• For each cluster, compute fingerprint vector:

• New IPs will be compared to this “fingerprint”

IP x IP Matrix: Intensity indicates pairwise similarity

Page 24: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

24

Classifying IP Addresses

• Given “new” IP address, build a feature vector based on its sending pattern across domains

• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster

Page 25: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

25

Evaluation

• Emulate the performance of a system that could observe sending patterns across many domains– Data: Postfix logs– Build clusters / train on given time interval– Evaluate classification with subsequent data in trace

• Evaluate classification– Relative to labeled logs– Relative to IP addresses that were eventually listed

Page 26: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

26

Dataset: Summary and Issues

• 30 days of Postfix logs from large email provider– Time, remote IP, receiving domain, accept/reject– Allows us to observe sending behavior over a large

number of domains– Problem: About 15% of accepted mail is also spam

• Creates problems with validating SpamTracker

• 30 days of SpamHaus database in the month following the Postfix logs– Allows us to determine whether SpamTracker detects

some sending IPs earlier than SpamHaus

Page 27: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

27

Initial Results

• Many single-domain senders

• Large volumes to just a few domains

SpamTracker Score

HamSpam

Problems

Page 28: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

28

Rejected Mails Have Higher ScoresHam

Spam

SpamTracker Score

Page 29: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

29

SpamTracker and Early Detection

• Compare SpamTracker scores on “accepted” mail to the SpamHaus database– About 15% of accepted mail was later determined to

be spam– Can SpamTracker catch this?

• Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month– 65 emails had a score larger than 5 (85th percentile)

Page 30: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

30

Deployment

• Integration with existing infrastructure– Deploy SpamTracker as “yet another DNSBL”– Existing spam filters use SpamTracker score as an

additional feature– Advantage: easy deployment

• On the wire– Infer connections/email from traffic flow records in

individual domains– Advantage: Stop mail closer to the source

Page 31: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

31

Improving Classification

• Use additional features, and combining for more robust classification– Temporal: interarrival times, diurnal patterns, etc.– Spatial: sending patterns of groups of senders

• Improved similarity computation– Better similarity metrics– Better metrics for detecting “early onset”

Page 32: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

32

Evasion

• Problem: Malicious senders could add noise to a large feature vector– Possibility: Use smaller number of trusted domains

• Problem: Malicious senders could change sending behavior to emulate “normal” senders– In doing so, they may limit their own effectiveness

Page 33: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

33

Other Questions and Challenges

• Reactivity: Can the features be observed quickly enough to construct the fingerprints?

• Scalability: How can the data be aggregated and collected without imposing too much overhead?

• Reliability: How can SpamTracker be replicated to better defend against attack or failure?

• Sensor placement: From where should we watch spam to ensure that the clusters can be distinguished?

• Symbiosis between botnet detection and spam filtering

Page 34: Network-Based Spam Filtering Anirudh Ramachandran Nick Feamster Georgia Tech.

34

Summary

• Spam is on the rise and becoming more clever– 12% of spam now PDF spam. Content filters are

falling behind– Also becoming more targeted

• IP-Based blacklists are evadable– Up to 30% of spam not listed in common blacklists at

receipt. ~20% remains unlisted after a month– Spammers commonly steal IP addresses

• New approach: Behavioral blacklisting– Blacklist how the mail was sent, not what was sent


Recommended