+ All Categories
Home > Technology > Revisiting evolutionary information filtering

Revisiting evolutionary information filtering

Date post: 15-Jan-2015
Category:
Upload: manolis-vavalis
View: 29 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
18
Revisiting Evolutionary Information Filtering Nikolaos Nanas, Centre for Research and Technology Thessaly, GREECE Stefanos Kodovas, University of Thessaly, GREECE Manolis Vavalis, University of Thessaly, GREECE
Transcript
Page 1: Revisiting evolutionary information filtering

Revisiting Evolutionary Information Filtering

Nikolaos Nanas, Centre for Research and Technology Thessaly, GREECE

Stefanos Kodovas, University of Thessaly, GREECE

Manolis Vavalis, University of Thessaly, GREECE

Page 2: Revisiting evolutionary information filtering

outline

Adaptive Information Filtering – brief introduction

Evolutionary Information Filtering – review

Diversity & dimensionality – theoretical issues

Experimental evaluation

• Methodology – a test-bed

• Results – not a success

• Discussion – interesting observations

Conclusions and future work

Page 3: Revisiting evolutionary information filtering

Information Overload is still around

Page 4: Revisiting evolutionary information filtering

Adaptive Information Filtering in the case of textual information

Page 5: Revisiting evolutionary information filtering

Adaptive Information Filtering (AIF)

challenging problem with no established solution

complex and dynamic

• multiple and changing user interests

• changing information environment

crucial issues for successful AIF

• profile representationprofile representation

• profile adaptationprofile adaptation

Page 6: Revisiting evolutionary information filtering

Evolutionary Information Filtering with the Vector Space Model

Profile adaptation through evolution of user’s profiles.

Page 7: Revisiting evolutionary information filtering

Evolutionary Information Filtering

• “A Review of Evolutionary and Immune-Inspired Information Filtering”, Natural Computing, 2009

• A common vector space with as many dimensions as the number of unique keywords

• A population of profiles that collectively represent the user’s interests

• Both profiles and documents are represented as (weighted) vectors in this space

• Trigonometric measures of similarity for comparing profile vectors to document vectors

• Fitness function based on (explicit or implicit) user feedback

• reward profiles that assigned a high relevance score to relevant documents and vice versa• fitness is updated proportional to user feedback

• average score of relevant documents

• ratio of successful evaluations

Page 8: Revisiting evolutionary information filtering

Evolutionary Information Filtering

profile initialisation is not random

selection

• fixed percentage of best individuals

• variable percentage

• roulette wheel

crossover

• single-point, two-point, three-point

• variable percentage

• roulette wheel

mutation

• keyword replacement

• random weight modification

steady-space replacement

• offspring typically replace less fit individuals

Page 9: Revisiting evolutionary information filtering

Diversity Issues

AIF is not a classic optimisation problem

• online learning problem

• reminiscent of Multimodal Dynamic Optimisation (MDO)

Traditional GAs suffer in the case of MDO due to diversity loss.

Four types of remedies:

1. adjust mutation rate when changes are observed

2. spread the population

3. memory of previous generations

4. multiple subpopulations

• in “Multimodal Dynamic Optimisation: from Evolutionary Algorithms to Artificial Immune Systems”, 2007

• intrinsic diversity problems due to:• selection based on relative fitness

• no developmental process

• fixed population size

Page 10: Revisiting evolutionary information filtering

Dimensionality Issues

• A vector space with a large number of dimensions (keywords) is required for successful AIF

• In a multi-dimensional space:

• the volume increases exponentially with the number of dimensions

• distance based measures become meaningless as points become equidistant

• the discriminatory power of pair-wise distances is significantly affected

• scalar metrics can not differentiate between vectors with distributed and concentrated differences

• in a multi-dimensional keyword space the ability of GAs to achieve profile adaptation is affected because:

• the number of possible weighted keyword combinations increases exponentially with the number of dimensions

• crossover and mutation cannot randomly produce the right combination of weighted keywords

Page 11: Revisiting evolutionary information filtering

Experimental Evaluation: Dataset

Reuters-21578

• 21578 news stories that appeared in Reuters newswire in 1987

• documents are ordered according to publication date

• 135 topic categories

• experiments concentrate on the 23 topics with at least 100 relevant documents

document pre-processing

• stop word removal

• stemming with Porter’s algorithm

• weighting with Term Frequency Inverse Document Frequency (TFIDF)

words with large average TFIDF are selected to build the keyword space

topic code

size

earn 3987

acq 2448

money-fx

801

crude 634

grain 628

trade 552

interest 513

wheat 306

ship 305

corn 254

dlr 217

oilseed 192

topic code

size

money-suply 190

sugar 184

gnp 163

coffee 145

veg-oil 137

gold 135

nat-gas 130

soybean 120

bop 116

livestock

114

cpi 112

Page 12: Revisiting evolutionary information filtering

Experimental Evaluation: Baseline Experiment

Page 13: Revisiting evolutionary information filtering

Baseline Results

• as the number of extracted words increases the AUP values increase

• for a small number of extracted keywords the results are biased towards topics with a large number of relevant documents

• the best results are achieved when all extracted keywords are used

• if we wish to represent a range of topics then a multidimensional space is required

Page 14: Revisiting evolutionary information filtering

Experimental Evaluation: Evolutionary Experiments

a vector space comprising 31298 keywords

The basic Genetic Algorithm: • with a population of 100 profiles

• each profile is a weighted keyword vector (randomly initialised)

• the same random initial population is used in all experiments

• documents are evaluated in order using the inner product

• new fitness = old fitness + relevance score

• the 25% fittest profiles are selected for reproduction

• single-point crossover

• mutation through random weight modification

• the offspring replace the 25% worst profiles

two further variations of the basic GA

• GA_init: initialisation using the first 100 relevant documents per topic.

• GA_init + learning: a MA that uses Rocchio’s learning algorithm

Page 15: Revisiting evolutionary information filtering

Comparative Results: accuracy

• y-axis: best AUP achieved in 50 generations (bias)

• baseline results are included

• additional results for ranking by date

Findings:

• the GA performs worse than the baseline

• marginal improvements for non-random initialisation

• significant improvement when learning is introduced

• the MA is only better for some topics with small size

Page 16: Revisiting evolutionary information filtering

Comparative Results: learning

• y-axis: average AUP over all topics after each generation

• x-axis: number of generations

• embedded figure focuses on GA and GA_init

Findings:

• GA does not essentially improve

• better initial performance and learning rate for non random initialisation (GA_init)

• much steeper learning curve when learning is introduced (GA_init + learning).

Page 17: Revisiting evolutionary information filtering

Conclusions

The basic GA fails to learn the topic of interest.

• the right combination of keyword weights can not be randomly produced.

• the GA is lacking a mechanism for appropriately updating keyword weights.

• performance depends on the weighted keywords that initialisation produced.

When the GA is initialised based on relevant documents

• then the initial set of weighted keywords produces better filtering results

The introduction of learning allows for further improvements in the initial keyword weights.

• still worse than the baseline experiment despite the 50 generations

• this is possibly due to the negative effect of the genetic operations

Page 18: Revisiting evolutionary information filtering

Discussion

Our experimental results do not agree with the promising results reported in the literature

• we did not re-implement an existing approach, but adopted existing techniques

• AIF is a complex problem that can not be easily tackled with weighted keyword in a multi-dimensional space

• comparative experiments between GAs and other machine learning algorithms have been missing from AIF

large differences observed between the GA and the baseline algorithm

• despite the biased comparison in favour of the GA

• more fundamental alternatives which are not based on vector representations

• the choice of representation should facilitate the learning task

• external remedies like those adopted for MDO are not practical

we wish to reanimate the interest of the research community on AIF

• biologically inspired solutions are well suited to the problem

• appropriate experimental methodologies that reflect the complexity and dynamics of AIF are required


Recommended