+ All Categories
Home > Documents > meghneel

meghneel

Date post: 29-May-2018
Category:
Upload: khang-nguyen
View: 223 times
Download: 0 times
Share this document with a friend

of 24

Transcript
  • 8/9/2019 meghneel

    1/24

    Text Categorization With Support

    Vector Machines: Learning WithMany Relevant Features

    By Thornsten JoachimsPresented By Meghneel Gore

  • 8/9/2019 meghneel

    2/24

    Goal of Text Categorization Classify documents into a number of pre-

    defined categories.

    Documents can be in multiple categories

    Documents can be in none of the categories

  • 8/9/2019 meghneel

    3/24

    Applications of Text

    Categorization Categorization of news stories for online

    retrieval

    Finding interesting information from theWWW

    Guiding a user's search through

    hypertext

  • 8/9/2019 meghneel

    4/24

    Representation of Text Removal of stop words

    Reduction of word to its stem

    Preparation of feature vector

  • 8/9/2019 meghneel

    5/24

    Representation of Text

    .......................

    ......................

    ......................

    ......................

    ......................

    ...........................................

    2 Comput

    1 Process2 Buy3 Memory

    ....

    This is a Document Vector

  • 8/9/2019 meghneel

    6/24

    What's Next... Appropriateness of support vector

    machines for this application

    Support vector machine theory

    Conventional learning methods

    Experiments

    Results

    Conclusions

  • 8/9/2019 meghneel

    7/24

    Why SVMs? High dimensional input space

    Few irrelevant features

    Sparse document vectors

    Text categorization problems are linearlyseparable

  • 8/9/2019 meghneel

    8/24

    Support Vector Machines

    Visualization of a Support Vector Machine

  • 8/9/2019 meghneel

    9/24

  • 8/9/2019 meghneel

    10/24

    Support Vector Machines We define a structure of hypothesis

    spaces Hi such that their respective VCdimensions di increases

  • 8/9/2019 meghneel

    11/24

    Support Vector Machines Lemma [Vapnik, 1982]

    Consider hyperplanes}{)( bdwsigndh !

    TTT

    As hypotheses

  • 8/9/2019 meghneel

    12/24

    Support Vector Machines

    Awwithb

    dw!u

    TTT,

    1

    If all example vectors are contained in

    Ahypersphere of radius R and it isRequired that

  • 8/9/2019 meghneel

    13/24

    Support Vector Machines Then this set of hyperplane has a VC

    dimension d bounded by

    1)],min([ 22 e nARd

  • 8/9/2019 meghneel

    14/24

  • 8/9/2019 meghneel

    15/24

    Conventional Learning

    Methods Nave Bayes classifier

    Rocchio algorithm

    K-nearest Neighbors

    Decision tree classifier

  • 8/9/2019 meghneel

    16/24

    Nave Bayes Classifier Consider a document vector with

    attributes a1, a2 an with target values v

    Bayesian approach:

    ),,,(maxarg 21 njVv

    map aaavPvj

    -

    !

  • 8/9/2019 meghneel

    17/24

    Nave Bayes Classifier We can rewrite that using Bayes

    theorem as

    )()...,(maxarg

    )...,(

    )()...,(maxarg

    21

    21

    21

    jjnVv

    n

    jjn

    Vvmap

    vPvaaaP

    aaaP

    vPvaaaPv

    j

    j

    !

    !

  • 8/9/2019 meghneel

    18/24

    Nave Bayes Classifier Nave Bayes method assumes that the

    attributes are independent

    )""(

    ...)""()""()(maxarg

    )()(maxarg

    11

    21},{

    1},{

    j

    jjjdislikelikev

    n

    i

    jijdislikelikev

    NB

    vsnowaP

    vhadaP

    vMary

    aPv

    P

    vaPvPv

    j

    j

    !

    !!!

    !

    !

    -

  • 8/9/2019 meghneel

    19/24

    Experiments Datasets

    Performance measures

    Results

  • 8/9/2019 meghneel

    20/24

    Datasets Reuters-21578 dataset

    9603 training examples

    3299 testing documents

    Ohsumed Corpus

    10000 training documents

    10000 testing examples

  • 8/9/2019 meghneel

    21/24

    Performance Measures Precision

    Probability that a document predicted to be

    in class x truly belongs to that class

    Recall

    Probability that a document belonging to

    class x is classified into that class Precision/recall breakeven point

  • 8/9/2019 meghneel

    22/24

    Results

    Precision/recall break-even point on Ohsumed dataset

  • 8/9/2019 meghneel

    23/24

    Results

    Precision/recall break-even point on Reuters dataset

  • 8/9/2019 meghneel

    24/24

    Conclusions Introduces SVMs for text categorization

    Theoretical and empirical evidence thatSVMs are well suited for textcategorization

    Consistent improvement in accuracy over

    other methods