+ All Categories
Home > Technology > Dynamic modelling of document streams

Dynamic modelling of document streams

Date post: 20-Jan-2015
Category:
Upload: juan-julian-merelo-guervos
View: 488 times
Download: 4 times
Share this document with a friend
Description:
Presentation for the GECCO conference
Popular Tags:
30
A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams Lourdes Araujo,JJ Merelo [email protected], [email protected] Dpto. Lenguajes y Sistemas Inform ´ aticos Universidad Nacional de Educaci ´ on a Distancia Dpto. Arquitectura y Tecnolog ´ ıa de Computadores Universidad de Granada Spain A Genetic Algorithm for Dynamic Modelling and Prediction of Activity in Document Streams– p.1/24
Transcript
Page 1: Dynamic modelling of document streams

A Genetic Algorithm for DynamicModelling and Prediction of

Activity in Document StreamsLourdes Araujo,JJ Merelo

[email protected], [email protected]

Dpto. Lenguajes y Sistemas Informaticos

Universidad Nacional de Educacion a Distancia

Dpto. Arquitectura y Tecnologıa de Computadores

Universidad de Granada

Spain

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.1/24

Page 2: Dynamic modelling of document streams

Why• Document

metadata, such asarrival timehelporganize documentstreams.

• Temporalinformation helpmake sense ofdocument streamssuch ase-mailsandnews items.

• Its study combinescontent analysisandtime series mode-lling. A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.2/24

Page 3: Dynamic modelling of document streams

Showing interest• Hypothesis: Explosions in interest match points

in time where arrival intensity increases sharply.• In general, arrival time is quiteirregular.

X

Y

#doc

umen

t arr

ival

s

Time

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.3/24

Page 4: Dynamic modelling of document streams

Regularizing irregularity• A cost function, that reflects

how difficultis hiking fromone state to another, isintroduced.

• Intervals of similar frequencyshould be grouped in a sin-gle state, so change of sta-te will be penalyzed. But weshouldn’t overdo it.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.4/24

Page 5: Dynamic modelling of document streams

Kleinberg’s model• The document stream is modeled as aninfinite

state automaton, A, which emits messages withdifferent frequencies.

• Each state has a frequency assigned.• Burstsare indicated by transitions from a lower

to a higher state.• Frequency changes are controlled by assigning

costs to state changes, avoiding small explosionsand making identification of real explosionseasier.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.5/24

Page 6: Dynamic modelling of document streams

Infinite state automaton model• Generation of time sequence

based on aexponentialdistribution.• Time intervalx between

messagei andi + 1follows exponentialdistribution functionf(x) = αe−αx, for α > 0.

• Expected value for theinterval isα−1.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.6/24

Page 7: Dynamic modelling of document streams

First things first: two state mo-del

• Basic model2-State probabilistic automataA: q0

(low emission rate) y q1 (high).

q0

q1

• n + 1 messages,n intervals: Bayes procedureused to fit to a conditional probability of a statesequence:q = (qi1, · · · , qin):

c(q|x) = b ln (1 − p

p) + (

n∑

t=1

−ln fit(xt))

whereb = state transitions, 1st term: low numberof transitions, 2nd: states fit the sequence

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.7/24

Page 8: Dynamic modelling of document streams

To the infinite and beyond• Given a sequence of intervalsx =

(x1, x2, · · · , xn), a sequenceq = (qi1, · · · , qin)that minimizes

c(q|x) =n−1∑

t=0

τ(it, it+1) +n∑

t=1

−ln fit(xt)

must be found• f is related to theresolutionof discrete rates

within continuous emission rates, andτ thefacility of changing state.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.8/24

Page 9: Dynamic modelling of document streams

Infinite is a bit too much• A∗

s,γ that minimizesc(q|x) is restricted toAks,γ

with k states.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24

Page 10: Dynamic modelling of document streams

Infinite is a bit too much• A∗

s,γ that minimizesc(q|x) is restricted toAks,γ

with k states.• We will use aevolutionary algorithmto findAk

s,γ.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24

Page 11: Dynamic modelling of document streams

Infinite is a bit too much• A∗

s,γ that minimizesc(q|x) is restricted toAks,γ

with k states.• We will use aevolutionary algorithmto findAk

s,γ.

• Finally!

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.9/24

Page 12: Dynamic modelling of document streams

Individual representation• n integer sequence,1 < qij < E, representing

automaton state and idi of last document insequence.

• i arrives at0 ≤ ti ≤ T (intervalsxi = ti − ti−1).

t1 t2 · · · tn

| qt1, tk1| qtk1

+1, tk2| · · · | qtf , tn |

• Fitness function= cost function.• Initial population: documents chosen at random

thatsplit the document stream in intervals, withrandom states.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.10/24

Page 13: Dynamic modelling of document streams

Crossoverg11 · · · g1i · · · g1f1

q11, (t1, · · · ) · · · q1i, (t − n1, · · · , t, · · · t + m1) · · · q1f1, (· · · , tn)

g21 · · · g2j · · · g2f2

q21, (t1, · · · ) · · · q2j , (t − n2, · · · , t, · · · t + m2) · · · q2f2, (· · · , tn)

g11 · · · g1i−1 c.p. g2j+1 · · · g2f2

q11 q1i−1 q2j+1 q2f2

(t1, · · · ) · · · (· · · , t − n1 − 1) ? (t + m2 + 1, · · · ) · · · (· · · , tn)

g21 · · · g2j−1 c.p. g1i+1 · · · g1f1

q21 q2j−1 q1i+1 q1f1

(t1, · · · ) · · · (· · · , t − n2 − 1) ? (t + m1 + 1, · · · ) · · · (· · · , tn)

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.11/24

Page 14: Dynamic modelling of document streams

Mutation• Several mutation

operators• Increment state by

one• Merge two genes,

state taken randomly• Split a gene in two:

one with originalstate, another±1.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.12/24

Page 15: Dynamic modelling of document streams

Effect of crossover

10 20 30 40 50Crossover rate %

100

200

300

400

500

Gen

erat

ion

N.

stream astream bstream c

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.13/24

Page 16: Dynamic modelling of document streams

Effect of mutation

0 5 10 15 20 25 30Mutation rate %

0

100

200

300

400

500

Gen

erat

ion

N.

stream astream bstream c

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.14/24

Page 17: Dynamic modelling of document streams

Effect of population size

100 200 300 400 500Population size

0

100

200

300

400

500

Gen

erat

ion

N.

stream astream bstream c

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.15/24

Page 18: Dynamic modelling of document streams

Effect of number of generations

0 100 200 300 400 500Generation N.

2e+05

3e+05

4e+05

5e+05

6e+05

7e+05

8e+05

9e+05

Cos

t fun

ctio

n

stream astream bstream c

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.16/24

Page 19: Dynamic modelling of document streams

Time resultsState n. Viterbi Evo. Alg

Ex. time Cost Ex. time Cost (Av. Cost, Std. dev.)

15 2319.36 277402 1678.61 277712 (279385.6, 980.11)

20 3117.28 277306 2182.12 277528 (278980.4, 1114.91)

25 3835.37 277260 2033.81 277270 (279472.6, 1116.03)

15 20 25

010

0020

0030

0040

00Time comparison

states

time

(s.)

Evolutionary algorithm

Viterbi

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.17/24

Page 20: Dynamic modelling of document streams

Predicting the state of new arri-vals

• Main point of this work:to predict whether buzzis going up or down.

• Several possibleapproaches: usingViterbi algorithm overthe whole sequence, andreusing evolutionaryalgorithms.

• Easy approach for a sin-gle state: assume currenttrend continues.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.18/24

Page 21: Dynamic modelling of document streams

Local approximation: results

Previous substream A. T. Old s. New s. Trend

· · · 38 38 39 41 49 49 52 12 0 ↓

· · · 41 49 49 52 68 69 69 3 4 ↑

· · · 88 89 90 90 91 92 95 0 0 →

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.19/24

Page 22: Dynamic modelling of document streams

But it breaks down after a while

date GA approx.0(2004-04-02) 7(0.694669)· · · · · ·

74(2004-06-15) 14(0.797281)75(2004-06-16) 24(0.970706)76(2004-06-17) 19(0.87973)

77(2004-06-18) 19(0.87973) 19(0.87973)78(2004-06-19) 0(0.605263) 19(0.87973)79(2004-06-20) 0(0.605263) 19(0.87973)

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.20/24

Page 23: Dynamic modelling of document streams

Fast GA for modelling new arri-vals

• Using results of previous fitting• Chromosome extended, and last gene mutation

probability higher.

0 50 100 150

Time0,6

0,7

0,8

0,9

1

Fre

quen

cy

GA fitapprox. fit

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.21/24

Page 24: Dynamic modelling of document streams

Fast GA: Results

Subst. len. New Subs. len. T. w/out seed T. w/ seed

219900 100

3895.28

141.45 (79.09)

219000 1000 144.75 (81.96)

210000 10000 166.73 (79.32)

Subst. Len. New Subs. len. T. w/out seed T. w/ seed

3032 100

5048.49

54.6

2632 500 92.247

2132 1000 294.97

1132 2000 570.41

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.22/24

Page 25: Dynamic modelling of document streams

Conclusions

• The presented system dynamically detectschanges on the trends of interest on a documentstream.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24

Page 26: Dynamic modelling of document streams

Conclusions

• The presented system dynamically detectschanges on the trends of interest on a documentstream.

• An EA allows to deal with very large sequencesof documents in a reasonable time.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24

Page 27: Dynamic modelling of document streams

Conclusions

• The presented system dynamically detectschanges on the trends of interest on a documentstream.

• An EA allows to deal with very large sequencesof documents in a reasonable time.

• Extending this EA allows fitting a stream whichis an extension of a previously fitted substream ina very short time.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24

Page 28: Dynamic modelling of document streams

Conclusions

• The presented system dynamically detectschanges on the trends of interest on a documentstream.

• An EA allows to deal with very large sequencesof documents in a reasonable time.

• Extending this EA allows fitting a stream whichis an extension of a previously fitted substream ina very short time.

• We plan to study correlations among documentstreams, to automatically detect the occurrence ofnew topics composed of multi-word concepts.

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.23/24

Page 29: Dynamic modelling of document streams

The end

• Thanks for your attention

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.24/24

Page 30: Dynamic modelling of document streams

The end

• Thanks for your attention• Any question?

A Genetic Algorithm for Dynamic Modelling and Prediction ofActivity in Document Streams– p.24/24


Recommended