+ All Categories
Home > Documents > Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex...

Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex...

Date post: 27-Mar-2015
Category:
Upload: jayden-moss
View: 212 times
Download: 0 times
Share this document with a friend
Popular Tags:
42
Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala
Transcript
Page 1: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

Network-Based Spam Filtering

Nick FeamsterGeorgia Tech

with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala

Page 2: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

2

Spam: More than Just a Nuisance• 75-95% of all email traffic

– Image and PDF Spam increasing (PDF spam ~12% and growing)

– Content filters cannot catch!

• As of August 2007, one in every 87 emails constituted a phishing attack

• Targeted attacks on the rise– 20k-30k unique phishing attacks per month

– Spam targeted at CEOs, social networks on the rise

Source: NetworkWorld, August 2007

Page 3: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

3

One Approach to Mitigation: Filtering

• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham

• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?

Page 4: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

4

Content Filtering is Malleable

• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed

• Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

Page 5: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

5

Sender Reputation: Ephemeral

• Every day, 10% of senders are from previously unseen IP addresses

• Possible causes– Dynamic addressing– New infections

Page 6: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

6

Alternative: Network-Based Filtering

• Filter email based on how it is sent, in addition to simply what is sent.

• Network-level properties are more stable– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting infrastructure)– Network location of sender and receiver– Set of target recipients

• Challenge: Which properties are most useful for distinguishing spam traffic from legitimate email?

Page 7: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

7

Talk Outline

• Network-level behavior of spammers– Data collection– Highlights

• Performance of existing sender reputation systems

• Network-based behavioral filtering techniques– Behavioral blacklisting

• SpamTracker: Spectral analysis of sender behavior• SNARE: Classifier based on lightweight network-level

features

Page 8: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

8

Data Collection• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability

Domain 1

Domain 2

17-Month Study: August 2004 to December 2005

Page 9: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

9

Mail Collection: MailAvenger

• Highly configurable SMTP server• Collects many useful statistics

Page 10: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

10

BGP Spectrum Agility

• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network

where spam trap is co-located.

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)

Page 11: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

11

Why Such Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

Page 12: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

12

Characteristics of Agile Senders

• IP addresses are widely distributed across the /8 space

• IP addresses typically appear only once at our sinkhole

• Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked

• Some IP addresses were in allocated, albeit unannounced space

• Some AS paths associated with the routes contained reserved AS numbers

Page 13: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

13

Other Findings

• Top senders: Korea, China, Japan– Still about 40% of spam coming from U.S.

• More than half of sender IP addresses appear less than twice

• ~90% of spam sent to traps from Windows

Page 14: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

14

What about IP-based blacklists?

Page 15: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

15

Two Metrics

• Completeness: The fraction of spamming IP addresses that are listed in the blacklist

• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

Page 16: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

16

Completeness and Responsiveness

• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted

even after one month

Data: Trap data from March 2007, Spamhaus from March and April 2007

Page 17: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

17

Completeness of IP Blacklists

~80% listed on average

~95% of bots listed in one or more blacklists

Number of DNSBLs listing this spammer

Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF

ract

ion

of

all

spam

rec

eive

d

Spam from IP-agile senders tend to be listed in fewer blacklists

Page 18: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

18

What’s Wrong with IP Blacklists?

• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses not seen

within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines

• IP addresses of senders have considerable churn

• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not analyzed

across domains

Page 19: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

19

Problem: Changing IP Addresses F

ract

ion

of

IP A

dd

ress

es

About 10% of IP addresses never seen before in trace

Page 20: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

20

Under the Radar at Each Domain

Lifetime (seconds)

Am

ou

nt

of

Sp

am

Most spammers send very little spam, regardless of how long they have been spamming.

Page 21: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

21

Where do we go from here?

• Option 1: Stronger sender identity– Stronger sender identity/authentication may make

reputation systems more effective– May require changes to hosts, routers, etc.

• Option 2: Filtering based on sender behavior– Can be done on today’s network– Identifying features may be tricky, and some may

require network-wide monitoring capabilities

Page 22: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

22

SpamTracker

• Idea: Blacklist sending behavior (“Behavioral Blacklisting”)– Identify sending patterns commonly used by

spammers

• Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content

Page 23: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

23

SpamTracker Approach

• Construct a behavioral fingerprint for each sender

• Cluster senders with similar fingerprints

• Filter new senders that map to existing clusters

Page 24: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

24

Building the Classifier: Clustering

• Feature: Distribution of email sending volumes across recipient domains

• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector:

volume per domain per time interval– Collapse into a single IP x domain matrix:– Compute clusters

Page 25: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

25

Clustering: Output and Fingerprint

• For each cluster, compute fingerprint vector:

• New IPs will be compared to this “fingerprint”

IP x IP Matrix: Intensity indicates pairwise similarity

Page 26: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

26

Classifying IP Addresses

• Given “new” IP address, build a feature vector based on its sending pattern across domains

• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster

Page 27: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

27

Evaluation

• Emulate the performance of a system that could observe sending patterns across many domains– Build clusters/train on given time interval

• Evaluate classification– Relative to labeled logs– Relative to IP addresses that were eventually listed

Page 28: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

28

Dataset

• 30 days of Postfix logs from email hosting service– Time, remote IP, receiving domain, accept/reject– Allows us to observe sending behavior over a large

number of domains– Problem: About 15% of accepted mail is also spam

• Creates problems with validating SpamTracker

• 30 days of SpamHaus database in the month following the Postfix logs– Allows us to determine whether SpamTracker detects

some sending IPs earlier than SpamHaus

Page 29: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

29

Results: ClassificationHam

Spam

SpamTracker Score

Page 30: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

30

Results: Early Detection

• Compare SpamTracker scores on “accepted” mail to the SpamHaus database– About 15% of accepted mail was later determined to

be spam– Can SpamTracker catch this?

• Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month– 65 emails had a score larger than 5 (85th percentile)

Page 31: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

31

Evasion

• Problem: Malicious senders could add noise– Solution: Use smaller number of trusted domains

• Problem: Malicious senders could change sending behavior to emulate “normal” senders– Need a more robust set of features…

Page 32: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

32

Improving Classification

• Lower overhead• Faster detection• Better robustness (i.e., to evasion, dynamism)

• Use additional features and combine for more robust classification– Temporal: interarrival times, diurnal patterns– Spatial: sending patterns of groups of senders

Page 33: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

33

SNARE: Automated Sender Reputation

• Goal: Sender reputation from a single packet?(or at least as little information as possible)– Lower overhead– Faster classification– Less malleable

• Key challenge– What features satisfy these properties and can

distinguish spammers from legitimate senders

Page 34: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

34

Sender-Receiver Geodesic Distance

90% of legitimate messages travel 2,200 miles or less

Page 35: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

35

Density of Senders in IP Space

For spammers, k nearest senders are much closer in IP space

Page 36: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

36

Putting It Together

• Put features into SVM or decision tree (C4.5) classifier• 10-fold cross validation on one day of query logs from a

large spam filtering appliance provider

Page 37: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

37

Additional History: Message Size Variance

Senders of legitimate mail have a much higher variance in sizes of messages they send

Message Size Range

Certain Spam

Likely Spam

Likely Ham

Certain Ham

Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier

Page 38: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

38

Deployment Options

• Integration with existing infrastructure– Deploy SpamTracker as “yet another DNSBL”– Existing spam filters use SpamTracker score as an

additional feature– Advantage: easy deployment

• On the wire– Infer connections/email from traffic flow records in

individual domains– Advantage: Stop mail closer to the source

Page 39: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

39

In Progress: Real-Time Blacklist-Style Deployment

• As mail arrives at servers, lookups received at BL

• Queries provide proxy for sending behavior

• Train classifier/cluster based on mail

• Return current score

Approach

Email

Cluster

Classify

IP x domain x time

CollapseLookup Score

Page 40: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

40

Further Challenges

• Reactivity: Which features be observed quickly enough to construct signatures?

• Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead?

• Reliability: How should the system be replicated to better defend against attack or failure?

• Sensor placement: Where should monitors be placed to best observe behavior/construct features?

Page 41: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

41

Conclusion: Network-Based Behavioral Filtering

• Spam increasing, spammers becoming agile– Content filters are falling behind– IP-Based blacklists are evadable

• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month

• Complementary approach: behavioral blacklisting based on network-level features– Blacklist based on how messages are sent– SpamTracker: Spectral clustering

• catches significant amounts faster than existing blacklists– SNARE: Automated sender reputation

• ~90% accuracy of existing with lightweight features

Page 42: Network-Based Spam Filtering Nick Feamster Georgia Tech with Anirudh Ramachandran, Nadeem Syed, Alex Gray, Sven Krasser, Santosh Vempala.

42

References

• Anirudh Ramachandran and Nick Feamster, “Understanding the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006

• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2006

• Nadeem Syed, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, GT-CSE-08-02


Recommended