1
E-Mail FilteringE-Mail Filtering
Soonyeon Kim
2
Good Site for Data MiningGood Site for Data Mining
http://liinwww.ira.uka.de/bibliography/
- The Collection of
Computer Science Bibliographies
Major Conferences in Data Mining
- KDD 2000 of ACM SIGKDD
http://www.acm.org/sigs/sigkdd/kdd2000/
- SIGMOD 2000 of ACM SIGMOD
Other Conferences
- VLDB, IEEE ICDE, PAKDD conference
3
Text Mining:Text Mining:Finding Nuggets in Mountains of Finding Nuggets in Mountains of Textual DataTextual Data Author
- Jochen Dorre, Peter Gerstl, Roland Seiffert
- {doerre,gerstl,seiffert}@de.ibm.com Method to find this paper
- Searching from The Collection ofComputer Science Bibliographies
- key word used : Data mining & Text classification
4
Brief DescriptionBrief Description
What is Text Mining?
- same analytical functions of data mining to the domain of textual information.
How Text mining differs from Data mining?
- Data mining : addresses a very limited part of data (structured information available in database)
- Text mining : helps to dig out the hidden gold from textual information & requires the very complex feature extraction function
Describe in more detail the unique technologies that are key to successful text mining
5
Ifile: An Application of Machine Ifile: An Application of Machine Learning to E-mail FilteringLearning to E-mail Filtering
Author
- Jason D. M. Rennie
Artificial Intelligence Lab, MIT
- [email protected] Method to find this paper
- KDD 2000 of ACM SIGKDD
6
Outline of PaperOutline of Paper
Introduction
- need for automated e-mail filtering
- Ishmail
- important issues regarding mail filtering Mail Filtering
- Classification Efficiency
- Features
- Naïve Bayes algorithm IFILE Experiment Conclusion
7
IntroductionIntroduction Popular E-mail clients allow users to manage their mail into fol
ders by meaningful topic - popular e-mail client : Netscape Messenger, Pine, Microsoft O
utlook, Eudora and EXMH Ishmail - purpose of a prioritization system - alert the user when high-priority mail is arrived or a large nu
mber of messages have accumulated in a lower-priority folder Barrier
- implementation for mail filters (speed efficiency, database size, collection of supervised training
data)- integration into e-mail clients
8
Classification EfficiencyClassification Efficiency
Traditional classification method
- kNN, C4.5, Naïve Bayes Recent development
- SVM (Support Vector Machine), Maximum Entropy discrimination)
Efficiency Problems
- SVM and MEM provide significant improvement in accuracy, but at the cost of simplicity and time efficiency
- kNN : time to classify
9
Classification Efficiency(2)Classification Efficiency(2)
Naïve Bayes
- efficient training, quick classification and extensibility to iterative learning
- training : updating word counts
- classification : normalized sum of the counts corresponding to the words in the document in question
10
Personal E-mail FilteringPersonal E-mail Filtering
Every user has a unique collection of e-mail User organizes their e-mail in unique way It pertains directly to his preference
Key fact for effective personal e-mail filtering
- using the information made through the user interface of the mail client
11
Learning ArchitectureLearning Architecture
Label is assigned to newly filtered e-mail message Added to the classification model Update the classification model : every filtered e-mail
is a training example- assumed to be correct if user does not move the message to another folder- update the model if user moves misclassified mail into the appropriate folder
Update for Naïve Bayes- shift word counts from one folder to another
12
FeaturesFeatures
Classification model act as a function f
F C - F : Features C: class Mail filter is a special classifier
F C- F : characteristics of e-mail message C: mail folder- by considering each e-mail message as a bag of words function f maps an unordered set of words to a folder name
f
13
Features(2)Features(2)
Naïve Bayes keeps the track of word frequency statistics Reduce the number of features for classification to make
filtering more efficient Feature selection cutoff
- old, infrequent words are dropped
- word that occur fewer than log(age)-1 times should be discarded from the model
- age : number of e-mail messages added to the model since statistic has been kept for that word
e.q. if “baseball” occurred in the 1st document and occurred 5 or fewer times in the next 63 document, the word and statistics would be eliminated from database.
14
Maintaining DictionaryMaintaining Dictionary Cutoff Algorithm
- word that occur fewer than log(age)-1 times should be discarded from the model
e.q. “datamining” occurred in the 1st document
63 documents are coming after the document
age = 1 + 63 = 64
log(age) – 1 = 5
if “datamining” appears 5 or fewer times, the word and statistics would be eliminated from database.
15
Maintaining Dictionary(2)Maintaining Dictionary(2)-----------.idata------------- A B C list of folders(A:0 B:1 C:2)5 2 6 total word instances2 1 1 # of messageparty 4 0:2 1:1 belch 3 0:1 yellow 4 0:2 2:3 word age folder:frequencykick 2 1:1 peep 1 2:2
two msg in A - "party party belch yellow yellow" one msg in B - "party kick" one msg in C - "peep peep yellow yellow yellow"
16
Word SelectionWord Selection
Header Trimming- E-mail
1. Body: content
2. Header : list of fields pertaining to the message
From: To: Subject:
- keep this part
Received: Date: Message-id
- remove
17
Tokenizing textTokenizing text
Two techniques
1. Using stop list
- decrease the amount of noise in the data by eliminating uninformative words
e.g) pronoun, modifier, adverb
2. Stemming
- link together words which have the same root
e.g) serve, service, serves, served
=> same root serv
18
Naïve BayesNaïve Bayes
What is Naïve Bayes?
- Simple, yet effective classifier of text documents
- Statistical Machine learning algorithm
Assumption- each document is considered as a set of words- Each word is independent
19
Naïve Bayes-1Naïve Bayes-1stst step step Probability of d having been generated by ci
- With the assumption that attribute values are independent,
)(
)|()()|(
dP
cwPcPdcP
ijdwii
j
)|()|..,( 21 jiijn vaPvaaaP
20
Naïve Bayes-second stepNaïve Bayes-second step Computing P(ci|d) for all classes
Find the class to be classified
Maximum likelyhood
- Probability values are only used for comparison
Purpose, P(d) can be dropped
dw
ijiCciNB
j
cwPcPC )|()(maxarg
21
Naïve BayesNaïve Bayes
M-estimate- purpose : to give a reasonable probability in the case
of sparse data
-nj : number of instances of wj in the documents of class ci
-n : total number of words in documents of class ci
-|Vocabulary| : number of distinct words
||
1)|(
Vocabularyn
ncwP
jij
22
Experimental ResultExperimental Result
Information about the e-mail corpora on which classification experiments were conducted.
Four volunteers including author
23
Experimental ResultExperimental Result
24
Experimental ResultExperimental Result Individual Experiments with different setting
1. Alpha lexer, stoplist used, header trimming, feature selection, no stemming
2. Alpha only lexer replaces alpha lexer
3. White lexer replaces alpha lexer
4. No stoplist is used
5. Stemming is used
6. No feature selection is used
7. All headers are used for classification purposes
25
Experimental ResultExperimental Result
Three Lexers
1. Alpha lexer- default lexer- tokenizes strings of alphabetic characters
2. Alpha only lexer- tokenizes only strings of alphabetic characters- does not lex e-mail addresses into tokens
3. White lexer- tokenizes strings separated by whitespace
26
Experimental ResultExperimental Result Result
- No experimental environment setting provide the best results across all users
Experiment with highest average accuracy
- experiment #1 shows the best average result (89% accuracy)
- ranging from 86% to 91%
27
Experimental ResultExperimental Result
Time Efficiency
- Naïve Bayes : “fast enough”
- 27 seconds to build a model of 7000+ e-mail messages (average 259 msg/second)
(tar-gzip of same msg requires 17 seconds) Space Efficiency
- classification model built on 7000+ messages across 49 folders requires only 447,090 bytes
28
Filtering Junk E-MailFiltering Junk E-Mail
Soonyeon Kim
29
A Bayesian Approach to A Bayesian Approach to Filtering Junk E-MailFiltering Junk E-Mail Authors
- Mehran Sahami, Susan Dumais, David Heckerman, Eric H
orvitz
- Stanford University & Microsoft Researh
From
- AAAI 98 (American Association for Artificial Intelligence
30
Problems of Junk-mailProblems of Junk-mail
Wasting time- Many users must now spend a non-trivial portion of their time because of unwanted messages
Content of Material- Some of these messages can contain offensive material such as graphic pornography
Space problem-Junk-mails also quickly fill up file server storage space
31
Machine Learning ApproachMachine Learning Approach Learning
- system S learns from experience E with respect to a class of tasks T and performance P
Learning in junk-mailS : E-mail classifierT : classify an e-mail message as junk/legitimateP : fraction of correct predictionE : a set of pre-classified e-mail messages
Vector Space Model- to represent mail messages as feature vectors - e-mail message has single fixed-length feature vector- individual message can be represented as a binary vector denoting which word are present or absent. (1 for present 0 for absent)
32
Bayesian ClassifierBayesian Classifier
e-mail message as a vector of N featuresX = X1, X2, X3, ..., XN
- For example, X42 might be ‘the e-mail contains “money”’- x42=0 means “the message described by x does not contain the “money”’.
classify messages in K classesC = {c1 , c2} = {junk, legit} (K=2)
Now suppose we see a new e-mail message, with encoding x. We seek the probability that the class C is junk,
Pr[C=junk | X=x]
shorthand for Pr[C=junk | X1=x1 & X2=x2 & … & XN=xN]
33
Bayesian networksBayesian networks
(a) a Naïve Bayesian classifier
(b) a more complex Bayesian classifier with limited dependencies between the features
34
Bayesian RuleBayesian Rule
Bayes theorem
ssume that each Xi is independent
)(
)()|()|(
xXP
cCPcCxXPxXcCP
kkk
)|()|( kiik cCxXPcCxXP
35
FeaturesFeatures
1. Words- fixed width vector <X = X1, X2,…, Xn>
2. Hand-crafted Phrasal Features- “FREE!”, “only $” ( as in “only $4.95”) and “be over 21”- 35 such hand-crafted phrases are includedDomain-specific features- domain type of sender (.edu or .com)- junk mail is usually not from .edu domainResolving familiar E-mail address- i.e. replace [email protected] with Susan DumaisTime- most junk E-mail is sent at night
36
Features(2)Features(2)
Peculiar punctuation
- percentage of non-alphanumeric characters in the subject of a mail
- “$$$$$ MONEY $$$$$”
X : subject has peculiar punctuation
Y : pct of total messages
37
Feature SelectionFeature Selection
Mutual Information- Mutual information MI(A,B) is a numeric measure of what we can conclude about A if we know B, and vice-versa.
- Example: If A and B are independent, then we can’t conclude anything: MI(A, B) = 0
- Select 500 features with greatest value
xXi cC CPXiP
CXiPCXiPCXiMI
)()(
),(log),();(
38
EvaluationEvaluation
Three ways
1. Using Domain-specific Features- Words only- Words + Phrases- Words + Phrases + Extra Features
2. Three way Categorization- 3 categories {porn-junk, other-junk, legit}instead of 2 categories {junk, legit}.
3. “Real” scenario
39
Using different featuresUsing different features
The cost of missing legitimate email is much higher than the costing of inadvertently reading junk.
The authors wanted to make their system very “optimistic” so that it only predicts “junk” if it is very certain -- uses threshold 99.9%.
1789 hand-tagged e-mail messages
– 1578 junk
– 211 legit Split into…
– 1538 training messages (86%)
– 251 testing messages (14%)
40
Using different featuresUsing different features
Result of experiment- words only- words + 35 phrasal features- words + phrasal features + 20 non-textual domain-specific features
41
Using different featuresUsing different features
Junk Precision = A / (A + C) Junk Recall = A / (A + B)
Legit Precision = D / (D + B) Legit Recall = D / (D + C)
Junk precision is of greatest concern to most users, because they would not want their legitimate mail discarded as junk
predict Junk predict Legit
actually Junk A (true positives) B (false negatives)
actually Legit C (false positives) D (true negatives)
42
Using different featuresUsing different features
Precision/ Recall curves for junk mail
43
Sub-classes of junk E-mailSub-classes of junk E-mail Three way Categorization
- 3 categories {porn-junk, other-junk, legit}instead of 2 categories {junk, legit}
Consider that classifying is correct if any “junk” messages is classifed either “porn-junk” or “other-junk”
Unfortunately, it didn’t work!- Probably because more parameters means need (exponentially!) more data to estimate them accuractely- some feature may be very clearly indicative of junk versus legitimate, but may not be powerful in three categories(they do not distinguish well between the sub-classes of junk
44
Sub-classes of junk E-mailSub-classes of junk E-mail
Precision/recall curves considering sub-groups of
junk mail
45
Real Usage ScenarioReal Usage Scenario Three kinds of messages
1. Read and keep2. Read and discard (ex. Joke from a friend)3. Junk
Result
Misclassified mails – news stories from a e-mail news service that the user subscribes to. (No loss of significant information)
46
Real Usage ScenarioReal Usage Scenario Precision/recall curves in real usage scenario