+ All Categories
Home > Documents > UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification...

UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification...

Date post: 28-Mar-2015
Category:
Upload: sean-morales
View: 215 times
Download: 2 times
Share this document with a friend
Popular Tags:
22
UNIVERSITA’ DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1 , V.Policicchio 1 , P.Rullo 1,2 , I.Sidhu 3 1. Università della Calabria (Rende, Italy) {a.pietramala,policicchio,rullo}@mat.unical.it 2. Exeura Srl (Rende, Italy) 3. Kenetica Ltd (Chicago, IL-USA) {isidhu}@computer.org ECML PKDD 2008 15-19 September 2008, Antwerp, Belgium
Transcript
Page 1: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

A Genetic Algorithm for Text Classification

Rule Induction

A.Pietramala1, V.Policicchio1, P.Rullo1,2, I.Sidhu3

1. Università della Calabria (Rende, Italy) {a.pietramala,policicchio,rullo}@mat.unical.it

2. Exeura Srl (Rende, Italy)

3. Kenetica Ltd (Chicago, IL-USA) {isidhu}@computer.org

ECML PKDD 200815-19 September 2008, Antwerp, Belgium

Page 2: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAOutline

– Motivations

– The Olex Hypothesis Language

– The Genetic Algorithm Approach (Olex-GA)

– Experimental Results and Comparative Evaluation

– Discussions

– Conclusions and Future Work

Page 3: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAMotivations

• Rule learning algorithms have become a successful strategy for classifier induction.

• Rule-based classifiers provide the desirable property of being readable and, thus, easy to understand (and, possibly, modify).

• Genetic Algorithms (GAs) are stochastic search methods inspired to the biological evolution.

• GAs show the capability to provide good solutions for classical optimization tasks (e.g. TSP and Knapsack)

Page 4: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICARule Induction and GAs

• Rule induction is one of the application fields of GAs.

The basic idea is that:

– Each individual in the population represents a candidate solution

(a classification rule or a classifier)

– The fitness of an individual is evaluated in terms of the predictive

accuracy.

• We propose presents a GA approach, called Olex-GA, for the induction of rule-based text classifiers.

Page 5: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GA - The hypothesis language

• A classifier c (Pos,Neg) is of the form:

c category

titerm (n-gram)

d document).dt...dt(

)dt...dt(c

mnn

n

1

1

Neg

c (Pos,Neg)Pos

“if any of the terms t1,…,tn occurs in d and none of the terms tn+1,…,tn+m occurs in d, then classify d under category c”

Page 6: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GA The hypothesis language

• The terms in Pos and Neg are chosen among the ones belonging to the local vocabulary:

• Intuitively, Vc (k, f ) is the set of the best k terms for category c according to a given scoring function f.

U Cc c fkVfkV ),(),(

Page 7: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GAProblem statement

• The Olex-GA’s learning problem is stated as an optimization

problem:

PROBLEM MAX-F

Let a category c C and a vocabulary V (k, f) over the training set TS be

given.

Then, find two subsets of V (k, f), Pos = {t1,…,tn } and Neg = {tn+1,…,tn+m }

with Pos ≠ Ø , such that c (Pos, Neg) applied to TS yields a maximum value

of Fc, (over TS), for a given [0,1].

• Problem MAX-F is NP-Hard.

Page 8: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GA A Genetic Algorithm to Solve MAX-F

• Problem MAX-F is a combinatorial optimization problem aimed at finding a best combination of terms taken from a given vocabulary.

• MAX-F is a typical problem for which GAs are known to be a good candidate resolution method.

Page 9: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

GA-OlexOur implementation of GA

• In the following, we describe our choices concerning:

– Population Encoding

– Fitness Function

– Evolutionary Operators

Page 10: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GAPopulation Encoding

• Each individual represents an entire classifier.

• An individual is simply a binary representation of the sets Pos and Neg of a classifier c (Pos, Neg) .

}t,t,t,t,t{)f,k(V 54321

)tt()tt(c 4231 c

0101000101t5t4t3t2t1t5t4t3t2

t1

Given a vocabulary

EX

AM

PLE

Page 11: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GA Population Encoding

• We restrict the search of both positive and negative terms, respectively, to:

– Pos*, the set of terms belonging to Vc (k, f ) (candidate

positive terms);

– Neg*, the set of terms which occur in any document

containing some candidate positive term and not

belonging to the training set TSc of c (candidate negative

terms).

• The reduction of search space allows:

– an improvement of the algorithm efficiency

– a quick convergence toward good solutions

Page 12: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GAFitness Function

• The fitness of a chromosome K, representing c(Pos,Neg) is the value of the F-measure resulting from applying c(Pos,Neg) to the training set TS.

• This choice naturally follows from the formulation of problem MAX-F.

Page 13: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GAEvolutionary Operators

• We perform:

– selection via the roulette-wheel method,

– crossover by the uniform crossover scheme.

– mutation, which consists in the flipping of each

single bit with a given (low) probability.

– elitism, in order to ensure that the best individuals of

the current generation are passed to the next one

without being altered

Page 14: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Olex-GAExperimentation

We have experimentally evaluated our algorithm on two

standard benchmark corpora:

• REUTERS-21578 (R10)– It consists of 12,902 documents– They are manually classified with respect to 135 categories. We

have considered the subset of the 10 most populated categories.

• OHSUMED

– We used the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year 1991.

– The classification scheme consisted of the 23 MeSH disease categories.

Page 15: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAExperimental settings

• We applied the stratified holdout method:

REUTERS:– ModApté split : 9603 documents are used to form the training

corpus (seen data) and 3299 to form the test set (unseen data).

OHSUMED:– The first 10,000 were used as seen data and the second 10,000

as unseen data.

In both cases, we have randomly split the set of seen data into a

– training set (70%), on which to run the GA

– and a validation set (30%), on which tuning the model parameters.

Page 16: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAExperimental settings

• GA Parameters:

• For each chromosome K in the population, we initialized K+ at random, while we set K¡

- [t] = 0, for each t Neg* (thus, K initially encodes a classifier Hc(Pos,Neg) with no negative terms).

Parameter Value

Iterations 3

Population Size 500

Num of Generations 200

Cross-over Rate 1.0

Mutation Rate 0.001

Elitism Probability 0.2

Page 17: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAComparative Evaluation

• On both corpora, we carried out a direct comparison with the following systems:

– SVM (both polynomial and radial basis function)

– Ripper (with two optimization steps)

– C4.5

– Naive Bayes

– Olex-Greedy

• The performances were evaluated using the Weka library of ML algorithms (apart from Olex-Greedy).

Page 18: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Performance Comparison on Reuters

• Efficacy

– SVMpoli > SVMrbf > Ripper ≈ Olex-GA > C45 > Olex-Greedy > NB• Efficiency

– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

Page 19: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Performance Comparison on OHSUMED

• Efficacy

– Olex-GA > Ripper > SVMpoli > Olex-Greedy > SVMrbf ≈ NB > C45 • Efficiency

– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper

Page 20: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICA

Discussions – Relation to other inductive rule learners

• Conventional Rule Learners (Ripper, C4.5):

– Usually rely on a two-stage process: rule induction and rule pruning.

– Each of the above step in turn consists of several steps

• Olex-GA relies on a a single-step process which does not need any post-induction optimization.

• With respect to Olex-Greedy, Olex-GA provides better predictive accuracy, but is less efficient.

Page 21: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAConclusions

• Olex-GA encodes a classifier, in a very natural and compact way, as an individual

• Fitness of an individual is evaluated as the F-measure of the encoded classifiers

• Experimental results point out:

– Olex-GA quickly converges to very accurate classifiers;

– Olex-GA performs at a competitive level with standard

algorithms;

– Time efficiency is lower than Olex-Greedy but higher than the

other rule learning methods, such as Ripper and C45.

Page 22: UNIVERSITA DELLA CALABRIA Dipartimento di MATEMATICA A Genetic Algorithm for Text Classification Rule Induction A.Pietramala 1, V.Policicchio 1, P.Rullo.

A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu A Genetic Algorithm for Text Classification Rule Induction

UNIVERSITA’ DELLA CALABRIA

Dipartimento di MATEMATICAFuture work

• Extension of the proposed technique to deal with classifiers of the form

where each Ti is a conjunction of “simple” terms:

)dT...dT(

)dT...dT(c

mnn

n

1

1

kiii t....tT 1


Recommended