Post on 27-Mar-2015
transcript
Network-Based Spam Filtering
Nick FeamsterGeorgia Tech
Joint work with Anirudh Ramachandran and Santosh Vempala
2
Spam• 75-90% of all email traffic
– PDF Spam: ~11% and growing– Content filters cannot catch!
• Late 2006: “there was a significant rise in spammers’ use of botnets, armies of PCs taken over by malware and turned into spam servers without their owners realizing it.”
• August 2007: Botnet-based spam caused volumes to increase 53% from previous day
Source: NetworkWorld, August 2007
3
More Than Just a Nuisance
• As of August 2007, one in every 87 emails constituted a phishing attack
• Targeted attacks on the rise– 20k-30k unique phishing attacks per month– Spam targeted at CEOs, social networks on the rise
4
One Approach: Filtering
• Prevent traffic from reaching users’ inboxes by distinguishing spam from ham
• Key question: What features best differentiate spam from legitimate mail?– Content– IP address of sender– Other “behavioral features”
5
Content-Based Filtering is Malleable
• Content-based properties are malleable– Low cost to evasion: Spammers can easily alter features of an
email’s content can be easily adjusted and changed– Customized emails are easy to generate: Content-based
filters need fuzzy hashes over content, etc.– High cost to filter maintainers: Filters must be continually
updated as content-changing techniques become more sophistocated
• Content-based filters are applied at the destination– Too little, too late: Wasted network bandwidth, storage, etc.
Many users receive (and store) the same spam content
6
Complementary Approach: Network-Based Filtering
• Filter email based on how it is sent, in addition to simply what is sent.
• Network-level properties are more fixed– Hosting or upstream ISP (AS number)– Botnet membership– Location in the network– IP address block
• Challenge: Which properties are most useful for distinguishing spam traffic from legitimate email?
Very little (if anything) is known about these characteristics!
7
Two Parts
• Study the network-level behavior of spammers– Majority of spam comes from a very small portion of
the Internet address space– Most coming from Windows hosts– Most senders low volume to our domain– Conventional blacklists somewhat ineffective
• Develop behavioral based filtering techniques– Behavioral blacklisting
8
Studying Sending Patterns• Network-level properties of spam arrival
– From where?• What IP address space?• ASes?• What OSes?
– What techniques?• Botnets• Short-lived route announcements• Shady ISPs
– Capabilities and limitations?• Bandwidth• Size of botnet army
9
BGP Spectrum Agility
• Log IP addresses of SMTP relays• Join with BGP route advertisements seen at network
where spam trap is co-located.
A small club of persistent players appears to be using
this technique.
Common short-lived prefixes and ASes
61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717
~ 10 minutes
Somewhere between 1-10% of all spam (some clearly intentional,
others might be flapping)
10
Why Such Big Prefixes?
• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP
addresses
• Visibility: Route typically won’t be filtered (nice and short)
11
Characteristics of IP-Agile Senders
• IP addresses are widely distributed across the /8 space
• IP addresses typically appear only once at our sinkhole
• Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked
• Some IP addresses were in allocated, albeit unannounced space
• Some AS paths associated with the routes contained reserved AS numbers
12
Lessons for Spam Mitigation
• Blacklists based on IP address alone are becoming less effective– Effective spam filtering requires a better notion of
end-host identity
• Detection based on network-widebehavior may be more fruitful than focusing on individual IPs
• Critical pieces of the puzzle– Botnet detection: Need better monitoring techniques– Routing security
13
Two Parts
• Study the network-level behavior of spammers– Majority of spam comes from a very small portion of
the Internet address space– Most coming from Windows hosts– Most senders low volume to our domain– Conventional blacklists somewhat ineffective
• Develop behavioral based filtering techniques– Behavioral blacklisting
14
The Effectiveness of Blacklisting
~80% listed on average
~95% of bots listed in one or more blacklists
Number of DNSBLs listing this spammer
Only about half of the IPs spamming from short-lived BGP are listed in any blacklistF
ract
ion
of
all
spam
rec
eive
d
Spam from IP-agile senders tend to be listed in fewer blacklists
15
Incomplete and Unresonsive
• Incomplete: Up to 35% of spam unlisted by SpamHaus or SpamCop at time of receipt
• Unresponsive: 20% remained unlisted in the blacklists even after one month
16
Problems with Existing Blacklists
• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses
not seen within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines
• Requires a human to first notice the behavior– Spamming is compartmentalized by domain and not
analyzed across domains
17
Problem: Low Volumes of Spam to Any Single Domain
Lifetime (seconds)
Am
ou
nt
of
Sp
am
Most bot IP addresses send very little spam, regardless of how long they have
been spamming. Single-domain observation cannot detect.
18
Main Idea and Intuition
• Idea: Blacklist sending behavior– Identify sending patterns that are commonly used by
spammers
• Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content
19
SpamTracker: Behavioral Blacklisting
• Observe sending behavior across domains
• Form clusters of behavioral fingerprints of known spammers
• Map new IP addresses to known clusters
Approach
20
Building the Classifier: Clustering
• Feature: Distribution of email sending volumes across recipient domains
• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector: volume
per domain per time– Collapse into a single IP x domain matrix:– Compute clusters
21
Clustering: Output and Fingerprint
• For each cluster, compute characteristic vector:
• New IPs will be compared to this “fingerprint”
22
Classifying “New” IP Addresses
• Given “new” IP address, build a feature vector based on its sending pattern across domains
• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster
23
Spam Has Higher SpamTracker Score
• Compare spam score of known spam to that of mail that was accepted for delivery
Rejected mails have higher spam scores
24
Deployment Options
• Integration with existing infrastructure– Deploy SpamTracker as “yet another DNSBL”– Existing spam filters use SpamTracker score as an
additional feature– Advantage: easy deployment
• On the wire deployment– Infer connections/email from traffic flow records in
individual domains– Advantage: Stop mail before it even reaches the mail
server
25
Other Questions and Challenges
• Reactivity: Can the features be observed quickly enough to construct the fingerprints?
• Scalability: How can the data be aggregated and collected without imposing too much overhead?
• Reliability: How can SpamTracker be replicated to better defend against attack or failure?
• Sensor placement: From where should we watch spam to ensure that the clusters can be distinguished?
• Symbiosis between botnet detection and spam filtering
26
Summary
• Spam is on the rise and becoming more clever– 12% of spam now PDF spam. Content filters are
falling behind– Also becoming more targetted
• IP-Based blacklists are evadable– Up to 30% of spam not listed in common blacklists at
receipt. ~20% remains unlisted after a month– Spammers commonly steal IP addresses
• New approach: Behavioral blacklisting– Blacklist how the mail was sent, not what was sent