+ All Categories
Home > Documents > A Selective Learning Model for Spam...

A Selective Learning Model for Spam...

Date post: 09-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
26
Outline Motivations of this work The selective learning model Experiments and results Online application Conclusion A Selective Learning Model for Spam Filtering Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines France March 26, 2009 Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory A Selective Learning Model for Spam Filtering
Transcript
Page 1: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

A Selective Learning Model for Spam Filtering

Didier Colin, Catherine Roucairol, Ider TseveendorjPrism Laboratory

University of Versailles Saint-Quentin en YvelinesFrance

March 26, 2009

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 2: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Motivations of this work

The selective learning model

Experiments and results

Online application

Conclusion

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 3: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

The spam filtering problem

I Two approaches for spam filtering :I Knowledge engineeringI Machine learning : text classification

I Spam filtering is not a typical text classification problem :I Adversarial classification : Classifying against an opponent

who will try to delude/break the filterI Need for autonomy : Maintaining accuracy over time with

minimal human interventionI False-positive issue : No acceptable false positive rate

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 4: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Idea

I Learning all messages is generally a bad idea

I Assumption : existence of a harmful knowledge

I Basic idea : identify these messages and do not learn them

I Formulate the learning process as an optimization problem,and introduce a decision variable

I Purposes:I Protect the filter against deluding strategiesI Provide better behaviour over time by preventing natural

degeneration of the filterI Give the filter better generalization capability

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 5: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Why a selective approach ?

I Human communications are inherently redundant

I Human languages often contain misleading informations

I Especially true in the case of spam (repetitive commercialstrategies, deceptive messages)

I These characteristics may be difficult to capture in a featureselection scheme

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 6: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Problem formulation

I Problem formulation: finding a training subcorpus such thattraining on it maximizes the resulting filter’s accuracy on theevaluation corpus

I A typical corpus : 103 to 106 learning messages

I A typical classifier learns in polynomial time

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 7: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Problem formulation

I Problem formulation: finding a training subcorpus such thattraining on it maximizes the resulting filter’s accuracy on theevaluation corpus

I A typical corpus : 103 to 106 learning messages

I A typical classifier learns in polynomial time

−→ we opt for a meta-heuristic implementation

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 8: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Implementation

I Genetic implementation

I Data : a set of messages C , a classifier fI Representations

I Solution : boolean vector X of dimension |C |, Xi = 1 ifmessage i is selected

I Fitness : A(fC(X ),C ), weighted accuracy of resulting filter onthe set C , C (X ) = {ci ∈ C |Xi = 1}

I OperationsI Selection : elitistI Cross-over : one pointI Mutation : random bit inversion

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 9: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 10: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 11: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 12: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 13: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 14: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Genetic operations

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 15: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Experiments protocol

I Data sets : lingspam corpus 1(481 spams, 2412 legitimatemessages), SpamAssassin( 1897 spams, 4150 legitimatemessages)

I Classifier : Bernoulli naive bayesian, 60 words vocabularyI Parameters :

I population size : 10 to 100I mutation rate : 5 to 75I initial solutions : random selection of 10% legitimate message

and 50% spam

I Metric : Total Cost Ratio =A(fC(X ),C)

A(f∅,C) )

1Ion Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras and C. D.Spyropoulos, An evaluation of Naive Bayesian anti-spam filtering”, ComputingResearch Repository”, ”2000”

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 16: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Results : TCR evolution for various population size

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 17: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Results : TCR evolution for a population of 25 individuals

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 18: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Results : Overview

Table: Comparison of spam precision and spam recall for exhaustive andselective learning algorithm

Exhaustivelearning

Selectivelearning(initial)

Selectivelearning(best)

Precision 96.82 % 96.85 % 98.72 %

Recall 88.33 % 89.60 % 96.47 %

I Better solutions found at the first iteration

I TCR improved by a factor 4

I Best solutions contain only 1/3 of the lingspam corpus

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 19: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Results on SpamAssassin

Bernoulli naive bayesianperfoms bad (TRC < 1)

Initial solutions must bealmost exhaustive

Selective learning do notbring much improvement

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 20: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Online selective learning

I Initial learning is only half of the job

I Is online selective learning possible ?

I Assuming no-user feedback

I Corpus → flow

I For each incoming message, a decision problem : shall welearn it ?

I Idea : for each incoming message, test if learning this messageimproves the filter’s precision over the N previous messages(learning window)

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 21: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Online selective learning algorithm

Input: Wi , the i-th message on the mail flow, f , a classifier, N,an integer

beginf ′ ← copy(f )if f(W) ≥ λthen learn(f ′,W , spam)else learn(f ′,W , ham)C ← {Wj , i − N ≤ j ≤ i}if A(f,C) ≥ A(f ′, C)then return falseelse return true

endAlgorithm 1: Online selective learning

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 22: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

TCR evolution, regular lingspam

I Little to noimprovements

I Slight loss forwindow = 50, 25

I Slight gain forwindow = 500

I But globalevolution is even

I Easy mail flow →conservativelearning strategies

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 23: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

TCR evolution, noisy lingspam (5%)

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 24: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Conclusions

I A learning model specifically designed to address the issues ofspam filtering

I Easy to implement...

I Good synergy with existing techniques

I Not tied to a specific classification model

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 25: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Perspectives and future works

I Efficient heuristics for initial solutions ?

I Make use of non learned data

I Dynamic variations of online selective window

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering

Page 26: A Selective Learning Model for Spam Filteringprojects.csail.mit.edu/spamconf/SC2009/Didier_Colin/Colin_slides_final.pdfResults : TCR evolution for a population of 25 individuals Didier

OutlineMotivations of this work

The selective learning modelExperiments and results

Online applicationConclusion

Thank you !

Didier Colin, Catherine Roucairol, Ider Tseveendorj Prism Laboratory University of Versailles Saint-Quentin en Yvelines FranceA Selective Learning Model for Spam Filtering


Recommended