+ All Categories
Home > Documents > Approximate Object Location and Spam Filtering on Peer-to-Peer...

Approximate Object Location and Spam Filtering on Peer-to-Peer...

Date post: 27-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
21
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems Feng Zhou, Li Zhuang, Ben Y. Zhao , Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley
Transcript
Page 1: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

Feng Zhou, Li Zhuang, Ben Y. Zhao,Ling Huang, Anthony D. Josephand John D. Kubiatowicz

University of California, Berkeley

Page 2: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 2

The Problem of Spamn Spam

q Unsolicited, automated emailsq Radicati Group: $20B cost in 2003, $198B in 2007

n Proposed solutionsq Economic model for spam prevention

n Attach cost to mass email distributionn Weakness: needs wide-spread deployment, prevent but not filter

q Bayesian network / machine learning (independent)n “Train” mailer with spam, rely on recognizing words / patternsn Weakness: key words can be masked (images, invis. characters)

q Collaborative filteringn Store / query for spam signatures on central repositoryn Other users query signatures to filter out incoming spamn Weakness: central repository limited in bandwidth, computation

Page 3: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 3

Our Contributionn Can signatures effectively detect modified spam?

q Goals: n Minimize false positives (marking good email as spam)n Recognize modified/customized spam as same as original

q Present signature scheme based on approx. fingerprintsq Evaluate against random text and real email messages

n Can we build a scalable, resilient signature repositoryq Leverage structured peer-to-peer networksq Constrain query latency and limit bandwidth usage

n Orthogonal issues we do not address:q Preprocessing emails to extract contentq Interpreting collective votes via reputation systems

Page 4: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 4

Outlinen Introductionn An Approximate Signature Schemeq Evaluation using random text and real emails

n Approximate object locationq Similarity search on P2P systemsq Constraining latency and bandwidth usage

n Conclusion

Page 5: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 5

Collaborative Spam Filtering1. Spam sent to user A2. Signatures stored3. Spam sent to user B4. B queries repository5. B filters out spam

storesignature

queryresponse

Page 6: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 6

An Approximate Signature Scheme

q How to match documents with very similar contentq Calculate checksums of all substrings of length Lq Select deterministic set of N checksumsq A matches B iff |sig(A) ∩ sig(B)| > Thresholdq Computation tput: 13MByte/s on P-III 1Ghz

A B C D E F

Sort by value

checksum

Choose N Randomize

Vector

Email Message

Page 7: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 7

Accuracy of Signature VectorsMatching Accuracy vs Changes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10

Minimum Matches out of 10

Pro

bab

ility

of

Mat

ch

10/5K

50/5K

5x5/5K

q 10000 random text documents, size = 5KB, calculate 10 signaturesq Compare signatures of before and after modificationsq Analytical results match experimental results

Page 8: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 8

Eliminating False positivesFalse Positive Rate

1.00E-09

1.00E-08

1.00E-07

1.00E-06

1.00E-05

1.00E-04

1.00E-03

1000 10000 100000Size of Documents (Bytes)

Fals

e P

osit

ive

Rat

e

1/10 Signatures Match 2/10 Signatures Match

n Compare pair-wise signatures between 10000 random docsn None matched 3 of 10 signatures (100,000,000 pairs)

Page 9: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 9

Evaluation on Real Messagesn 29631 Spam Emails from www.spamarchive.orgq Processed visually by project membersq 14925 (unique), 86% of spam = 5K

n Robustness to modification testq Most popular 39 msgs have 3440 modified copiesq Examine recognition between copies and originals

86.2547329675/10

92.2126831724/10

97.568433563/10

%FailedDetectedTHRES

Page 10: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 10

False Positive Testn Non-spam emailsq 9589 messages: 50% newsgroup posts + 50% personal

emailsq Compare against 14925 unique spam messages

n Sweet spot, using threshold of 3/10 signaturesq Recognition rate > 97.5%q False positive rate < 1 in 140 million pairs

003/10

2.79e-842/10

1.89e-62701/10

Probability# of pairsTHRES

Page 11: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 11

A Distributed Signature Repository?

storesignature

queryresponse

• How do we build a scalable distributed repository?• How do we limit bandwidth consumption and latency?

Page 12: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 12

Structured Peer-to-Peer Overlaysn Storage / query via structured P2P overlay networks

q Large sparse ID space N (160 bits: 0 – 2160)q Nodes in overlay network have nodeIDs ∈ Nq Given k ∈ N, overlay deterministically maps k

to its root node (a live node in the network)q E.g. Chord, Pastry, Tapestry, Kademlia, Skipnet, etc…

n Decentralized Object Location and Routing (DOLR)q Objects identified by Globally Unique IDs (GUIDs) ∈ Nq Decentralized directory service for endpoints/objects

Route messages to nearest available endpointq Object location with locality:

routing stretch (overlay location / shortest distance) ≅ O(1)

Page 13: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 13

DOLR on Tapestry Routing Mesh

Object O

ServerClient

Root(O)

Client

Page 14: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 14

More Than Just Unique Identifiersn Objects named by Globally Unique ID (GUID)q Application maps secondary characteristics to ID:

versioning, modified replicas, app-specific info

n Simplify the search problemq out of m search fields, or “features,” find

objects matching at least n exactly

Simpler Queries More Powerful QLs

DHT/DOLRs DatabasesSearch on Features

Page 15: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 15

A Layered Perspective

ADOLR layern Introduce naming mapping from feature vector to GUIDsn Rely on overlay infrastructure for storagen Abstraction of feature vectors as approximate names for

object(s)

IP Network Layer

Structured P2P Overlay Network (DOLR)

Approximate DOLR/DHT

route (IPAddr, msg)

Publish (GUID) route (GUID, msg)

approxPublish(FV,GUID)approxSearch(FV)route (FV, msg)

Page 16: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 16

Marking a New Spam Message

n Signatures stored as inverted index (feature object) inside overlayn User on C gets spam E2, calculates signatures S: {S1, S2, S3} n For each feature in S, if feature object exists, add E2

n If no feature object exists, create one locally and publish

node CE2

Overlay

S2 à E2

put (S1, E2)

S3 à E2

node AE3

S1 à E3

X X

put (S2, E2)

S1 à E3, E2

put (S3, E2)

Page 17: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 17

Filtering New Emails for Spam

n User at node D receives new email E2’ with signatures {S1, S2, S4}n Queries overlay for signatures, retrieve matching GUIDs for eachn Threshold = 2/3, contact GUIDs that occur in 2 of 3 result setsn Contact E2 via overlay for any additional info (votes etc)

node CE2

Overlay

S2 à E2

S3 à E2

node AE3

S1 à E3, E2

node D

XS4

S2

S1

E2’≅ E2

Page 18: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 18

Constraining Bandwidth and Latency

n Need to constrain bandwidth and latency q Limit signature query to h overlay hopsq Return null set if h hops reached without result

n Simulation on transit-stub topologiesq 5K nodes, 4K overlay nodes, diameter = 400msq Each spam message only reaches small groupn For each message:

% of users seen and marked = mark rate

q Measure tradeoff between latency and success rate of locating known spam, for different mark rates

Page 19: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 19

Simulation ResultTrading Accuracy for Latency and Bandwidth

00.10.20.30.40.50.60.7

0.80.9

1

0 100 200 300 400 500

Latency to Locate Object

Pro

b. o

f F

ind

ing

Mat

ch

10% Mark Rate 4% Mark Rate2% Mark Rate 1% Mark Rate0.5% Mark Rate

network diameter

Page 20: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 20

Feature-based Queriesn Approximate Text Addressing

q Text objects: features à hashed signaturesq Applications: plagiarism detection, replica management

n Other feature-based searchingq Music similarity searchn Extract musical characteristicsn Signatures: {hash(field1=value1), hash(field2=value2)…}n E.g. Fourier transform values, # of wavetable entries

q Image similarity searchn Locate files with similar geometric properties, etc.

Page 21: Approximate Object Location and Spam Filtering on Peer-to-Peer …ravenben/talks/ata_middleware... · 2004-07-18 · ACM Middleware 2003 ravenben@eecs.berkeley.edu 2 The Problem of

ACM Middleware 2003 [email protected] 21

Finally…n Statusq ADOLR infrastructure

implemented on Tapestryq SpamWatch: P2P spam filter

implemented, including Microsoft Outlook plug-inAvailable for download:http://www.cs.berkeley.edu/~zf/spamwatch

n Tapestryhttp://www.cs.berkeley.edu/~ravenben/tapestry

n Thank you…Questions?


Recommended