+ All Categories
Home > Documents > Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new...

Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new...

Date post: 16-Dec-2015
Category:
Upload: george-nichols
View: 218 times
Download: 1 times
Share this document with a friend
Popular Tags:
20
Active Learning to Classify Email 4/22/05
Transcript
Page 1: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Active Learningto Classify Email

4/22/05

Page 2: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

What’s the problem?

How will I ever sort all these new emails?

Page 3: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

What’s the problem? To get an idea of what mail I have gotten, I will need to

sort these new messages.

A great solution would be if I could sort just a few and my computer could sort the rest for me.

To make it really accurate, the assistant could even pick which messages I should manually sort, so that it can learn to do the best job possible. (Active Learning)

Page 4: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

What’s the solution? To solve this problem, we need a way to

choose the most informative training examples.

This requires some way of sorting emails by how informative they are for classification.

Page 5: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Email Classification So, what do we know about email classification?

SVM and Naïve Bayes significantly outperform many other methods(Brutlag 2000, Kiritchenko 2001)

Both SVM and Naïve Bayes are suitable for “online” learning required for solving this problem effectively. (Cauwenberghs 2000)

Classifier accuracy varies more between users than between algorithms. (Kiritchenko 2001)

SVM performs better for users with more email in each folder.(Brutlag 2000)

Users with more email, such as in our example problem, tend to have more email in each folder than other users. (Klimt 2004)

Thus, we have chosen SVM as the basis for this research.

Page 6: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

“Bag-of-Words” Model

email data “bag of words” SVMclassification

decision

Page 7: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Multiple SVMs Using separate SVMs for each section

email data

SVMs

classificationdecision

LLSF

Page 8: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Active Learning with SVM In general, examples closer to the decision boundary

hyperplane will cause larger displacement of that boundary. (Schohn and Cohn 2000, Tong 2001)

Page 9: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

What if our prediction is right? Labeling the closer

example:

Labeling the farther example:

Page 10: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

And if our prediction is wrong? Picking the closer

example:

Picking the farther example:

Page 11: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Incorporating Diversity In this example, the instance near the top is intuitively

more likely to be informative. This is known as “diversity” (Brinker 2003).

Page 12: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Active Learning with SVM But what about when you have multiple SVMs

(like one-vs-rest)? (Yan 2003)

Page 13: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

The Enron Corpus

150+ users 200,000 emails

Page 14: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Initial Results Trained on 10%, Tested on 90%

Page 15: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Chrono-Diverse Algorithm The way a user sorts email changes over time. Pick training data that are maximally different from

previous data with respect to time.

Page 16: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Combination Algorithm Combine strengths of Standard and Chrono-Diverse. Take a weighted combination of their results. Adjust weighting with parameter lambda.

Page 17: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Results Trained on 10%, Tested on 90%

Page 18: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Parameter Tuning

Page 19: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Conclusions State-of-the-art algorithm for active learning with text

classification performs horribly on email data!

Choosing emails for time diversity works very well.

Combining the two works best.

Page 20: Active Learning to Classify Email 4/22/05. What’s the problem? How will I ever sort all these new emails?

Future Work

Improve the efficiency of SVM or find a better alternative

Determine when using chronological diversity performs best and worst

Adapt the algorithm to online classification


Recommended