Date post: | 18-Jul-2015 |
Category: |
Data & Analytics |
Upload: | chang-wei-yuan |
View: | 158 times |
Download: | 0 times |
+
How Many Folders Do You Really Need? �Classifying Email into a Handful of Categories
2014/1/23 (Fri.)�Chang Wei-Yuan @ MakeLab Group Meeting
Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek �Yahoo Labs CIKM‘14
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
2
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
3
+ Introduction
n Traditional email classification is still a mostly manual task. �
4
+ Introduction
n Recently automatic classification has started to appear in some Web mail clients, e.g. Inbox.
5
+ Introduction
n The current email traffic is dominated by non-spam machine-generated email. �n Social network �n Commerce sites �n Official institutions
6
+ Introduction
n Goal �n automatically distinguishing between personal
and machine-generated email �n classifying messages into latent categories,
without requiring users to have defined any folder
7
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
8
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Discovering Latent Categories
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Discovering Latent Categories
n All messages have the potential to be classified. �n by retrieving the most popular folder from
users �
n This paper applied LDA to these "document folders " for finding latent categories. �n latent topics would map into "latent
categories" �
11
+ 12
msg msg msg
msg
msg
msg
msg msg
msg msg msg
msg
msg msg
+ 13
msg msg msg
msg
msg
msg
msg msg
msg msg msg
msg
msg msg
LDA
+Discovering Latent Categories
n Our objective was to train a value of K �n each individual and overall set of topics
achieve significant coverage �
n We further examined for K = 6 �n good balance between total and individual
coverage �
14
+Discovering Latent Categories 15
msg
travel %, social % …
travel
+Modeling Data
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Modeling Data
n Original method: Each individual message as a single data point �n various features extracted from the message
header and body�
17
+Modeling Data
n Extracting Features �n content features �
n the message subject and body�n address features�
n sender email address, including the subdomain �n behavioral features �
n sender's and recipient's actions over a given message
18
subject� body� action� time� sender� address� domain� msg
+Modeling Data
n Extended method: Aggregating messages at higher levels�n address/mail domain level �
n This paper consider three levels of aggregation.
19
subject� body� action� time� address� sender� domain� msg
Aggregating : sender level
Aggregating : domain level
+Modeling Data
n Aggregation Levels �
20
msg: shopping msg: traveling
+Training Data
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Training Data
n labeling techniques �n label used as 6 latent categories �n we will create a two-stage classifier by msg-
level and sender-level �
22
subject� action� …� sender� domain� category � msg
sender� domain� category� sender
+Training Data
n labeling techniques �n label used as 6 latent categories �n we will create a two-stage classifier by msg-
level and sender-level �
23
subject� action� …� sender� domain� category � msg
sender� domain� category� sender known by LDA
unknown
+ 24
sender
human
travel
social
career
+ 25
sender
human
travel
social
career
heuristic-based • Domain : gmail.com, yahoo.com • Sender: <first name>.<last name>
+ 26
sender
human
travel
social
career
automatic voting
sender msg
msg
msg
folder1
folder2
folder3
travel 96%,
travel 88%,
shopping 70%, travel 20 %
+ 27
sender
human
travel
social
career
automatic voting
sender msg
msg
msg
folder1
folder2
folder3
travel
travel
shopping
+Classification Mechanism
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Classification Mechanism
n Offline creation of classified senders table and message-level classier�n We use the training set to train a logistic
regression model. �n For each category we train a separate model in a
one-vs-all manner. �n The classification process is run performed
periodically to account for new senders.
+Classification Mechanism
35 % sender training data
classifier
classifier
senders table
65 % sender testing data
msg training data
+Classification Mechanism
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Classification Mechanism
n Online Light-weight classification �
n The initial classification �n hard coded rules designed to quickly classify �
n This process described requires very few resources and covers 32% of the email traffic.
+Classification Mechanism
n Online Sender-based classification �
n The second phase in our cascade classification �n looking for the sender with known categories �n using senders table �
n The amount of traffic that is not covered by this phase is roughly 8%. �
+Classification Mechanism
n Online Heavy-weight classification �
n As only 8% of the traffic end up in this last phase �
n We can afford slightly heavier computations to classifier. �n use all relevant feature, pertaining to the
message body, subject line and sender name
+One-vs-all 35
social
human
career
shopping
travel
finance
Yes, confidence
No
msg
+Semi-supervise 36
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Semi-supervise 37
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Semi-supervise 38
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Semi-supervise 39
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
40
+Experiment
n This paper estimated the actual volume of machine-generated messages on a very large Yahoo mail dataset. �
n This dataset built for the purpose of this work �n 6 months of email traffic �n more than 500 billion messages.
41
+Experiment
n 5 sender based classifiers for machine latent categories �n Shopping, Financial, Travel, Career and
Social �
n 1 sender-based machine for human classifier.
+
+ 44
+
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
46
+Conclusion
n We presented here a Web-scale categorization approach. �n offline learning �n online classification �
n Discovered latent categories. �
n Discriminated human and machine-generated email. �
n Building a scalable online system can be applied in Web mail.
+Future Work
n Discussing how categories should be exposed to users.
+Outline
n Introduction �n Method �
n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
49
+Thought
n Extended multiclass classification with multi-label.
50
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
k ?
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
threshold ?
+Thanks for listening. 2014 / 01 / 23 (Tue.) @ MakeLab Group Meeting �[email protected]�