PPM based Spam Filtering in SEWM2008

Post on 20-Jan-2016

42 views 0 download

Tags:

description

PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn ,billpengpeng@sohu.com oillgz@gmail.com College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching ) - PowerPoint PPT Presentation

transcript

PPM based Spam Filtering

in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu

Guanzhong

llx_2008@yahoo.com.cn,xucongfu@zju.edu.cn,billpengpeng@sohu.com oillgz@gmail.com

College of Computer Science, Zhejiang UniversityApril 10, 2008

Outline

PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification

PPM

Data Compression

PPM Framework

Email Pre-processing

Source alphabet Merge continuous spaces Truncate long messages

Email Pre-processing

Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe

Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20

After Replaceabcd_= ? Af????=? ?af? =ab fe addfe

After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe

After Truncateabcd_= ? Af????=? ?a

Train PPM Model

Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model

Model Classification

MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score

Advantage

Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive

Reference

《 Spam Filtering Using Statistical Data Compression Models 》

《 Unbounded Length Contexts for PPM 》

Question

Delay Index ham, Ham and HAM Active learning 10000

Deliver the filter

Thanks for your attention!Q&A