+ All Categories
Home > Technology > Defensa.V11

Defensa.V11

Date post: 15-Dec-2014
Category:
Upload: promanas
View: 469 times
Download: 0 times
Share this document with a friend
Description:
Navigation behaviour is described by mean of first biological principles: The neurophysiology of decision making.
Popular Tags:
75
1 Web User Behavior Analysis Doctorado en Sistemas de Ingeniería, Universidad de Chile. Prof. Guía: Juan D. Velásquez Pablo E. Román [email protected]
Transcript
Page 1: Defensa.V11

11

Web User Behavior Analysis

Doctorado en Sistemas de Ingeniería,

Universidad de Chile. Prof. Guía: Juan D. Velásquez

Pablo E. Romá[email protected]

Page 2: Defensa.V11

2

Outline

Motivation, Hypothesis, Achievement The problems & solutions

Pre-processingSimulationCalibration

Conclusions & Future Work

Page 3: Defensa.V11

3

Motivation, Hypothesis, Achievement

Page 4: Defensa.V11

44

Most famous web companies are analyzing the web user browsing behavior.

Google 2009 net profit: 6,520 Millions US$ Amazon: 902 Millions US$. NetFlix: 116 Millions US$. (Codelco net profit: 1,262 Millions )

AdaptiveWeb Sites

Page 5: Defensa.V11

5

Why we study the web user browsing behavior? A web user need to fast information fast

and complete. To enhance a web site

Administrators/owners can only modify:Web Pages’ ContentsWeb Site Links

Hopefully, the modification likes to objective group of members!

Page 6: Defensa.V11

6

The main Problem

There are only Heuristics in order to analyze the web user browsing behavior to enhance the contents and structure of a web site

We think we can do it better…

Page 7: Defensa.V11

77

Research hypothesis

It is possible to apply neurophysiology’s decision making theories to explain web user navigational behavior by using web data.

Page 8: Defensa.V11

8

The Thesis ProposalWeb Intelligence

A.I. in the Web

Web Mining

Knowledge Representation

AdvancedInf Tech. in the Web

AgentUbiquitous Sys.

Wireless Sys.

Grid & Cloud Sys.

Social Network

Web Structure Mining

Web Content Mining

Web Usage Mining

Web user neurocomputing

Neurophysiology model for the analysis of the behavior

discovering pattern of web user navigational behavior from the set of user’ trails

Page 9: Defensa.V11

99

Web user neurocomputing in Brief

We use a brain model of decision making to study how people browse a web site.

Based on neurophysiology first principles.

Page 10: Defensa.V11

1010

Machine learning vs. First principle model

Traditional Web Mining: Machine Learning (ML) Generic algorithm that can found or be trained to reproduce data regularities.

First principle models (FPP): e.g. Newton’s Law.

1. Can we use ML or FPP to build trajectories of the Apollo mission?

2. One million dollar Netflix contest: achieve a 10% improvement to the accuracy of customer movie preference. 4 years without a winner!!!

3. If conditions of the problem change, then ML system’s must be recalibrated.

Proposed Solution

Page 11: Defensa.V11

1111

Thesis dissertation: Main Contributions Novel mechanism for web session extraction from

web log based on Integer Programming. 2008, WI-IAT Int. Conf. R. Dell, P. Román, J. Velásquez. Using a linear objective function.

2010, Submitted to IDA Journal. P. Román, R. Dell, J. Velásquez. Using a network model.

Application of a Psychology model for describing web user navigation. 2009, AWIC Int. Conf. P. Román, J. Velásquez.

Simulation of decision’s making Neurophysiology. Calibration and simulation of Psychology based

stochastic model. 2009, IAAA BICA Symposium, P. Román, J. Velásquez. 2010, WI-IAT Int. Conf. , P. Román, J. Velásquez.

Page 12: Defensa.V11

12

First problem: Pre-processing

Page 13: Defensa.V11

13

Web basic operationWeb basic operation

Page 14: Defensa.V11

14

Web data: Web logs (Usage)Web data: Web logs (Usage)# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent

1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

Page 15: Defensa.V11

15

Web data: Content (text, object,..)Web data: Content (text, object,..)

You can put anything you want on a Web page, from family to business info….

You can put anything you want on a Web page, from family to business info….

Hyperlink structureHyperlink structure

Page 16: Defensa.V11

16

Web Data: Hyperlink structure

Page 17: Defensa.V11

1717

Proposal: Data sources

Neurophysiology commonly uses data obtained from neural-cabled subjects or psychological tests (surveys).

I use web data for the study of human behavior using the web

Page 18: Defensa.V11

1818

Problem: Web data pre-processing Hyperlink graph, Web page content, Web user session

(sequence of pages). Web Logs do not directly capture sessions How to reconstruct sessions? SESSIONIZATION : process for obtaining sessions. If invasive methods are used privacy right are violated

(forbidden by law in several countries). Cookies Spyware Tracking applications

Page 19: Defensa.V11

19

Traditional approach for sessionization1. Proactive: direct tracking of the web user

a) Privacy issueb) The most exact

2. Reactive: reconstruction of web user’s page sequence heuristically.

a) Only an approximation (40% noise)b) Use anonymous activity data sources like web logs.c) Models of behavior are sensitive to noise in data.

Page 20: Defensa.V11

20

Traditional heuristic for sessionization

Filtering: IP+Browser(Agent)

Timeout of 30 minute

Path completion: shortest path backward

How to identify individual web users?

Page 21: Defensa.V11

2121

Sessionization: The proposal

Incorporate all restrictions as a combinatorial optimization problem. Two formulation: Maximization of a linear reward, network flow model.

Page 22: Defensa.V11

2222

Integer Programming for sessionization (WI-IAT08 R. Dell, P.Roman, J. Velasquez)

Xros : 1 if log register “r” is assigned as the “o-th” request during session “s” and zero otherwise.

It is a labeling problem!

Index r IP Time Method/URL/Protocol Status Bytes Agent1 "165.182.168.101" 16/04/2008:16:24:06 GET index.htm HTTP/1.1 200 3821 Mozilla/4.02 "165.182.168.101" 16/04/2008:16:24:07 GET /informacion/academicos/index.htm HTTP/1.1 200 1345 Mozilla/4.03 "165.182.168.101" 16/04/2008:16:24:08 GET /informacion/academicos/Profesores_Titulares/index.htm HTTP/1.1 200 2567 Mozilla/4.04 "190.20.216.76" 16/04/2008:16:24:09 GET /images/borde_titulos.jpg HTTP/1.1 200 4678 Mozilla/4.05 "190.44.161.57" 16/04/2008:16:24:10 / HTTP/1.1 200 3821 Mozilla/4.06 "165.182.168.101" 16/04/2008:16:24:11 GET index.htm HTTP/1.1 200 3821 Mozilla/4.07 "165.182.168.101" 16/04/2008:16:24:12 GET informacion/academicos/index.htm HTTP/1.1 200 1345 Mozilla/4.08 "165.182.168.101" 16/04/2008:16:24:13 GET index.htm HTTP/1.1 200 3821 Mozilla/4.09 "165.182.168.101" 16/04/2008:16:24:14 GET index.htm HTTP/1.1 200 3821 Mozilla/4.0

10 "165.182.168.101" 16/04/2008:16:24:15 GET /exalumnos_empresas/index.htm HTTP/1.1 200 4563 Mozilla/4.011 "165.182.168.101" 16/04/2008:16:24:16 GET /proyectos_investigacion/index.htm HTTP/1.1 200 1224 Mozilla/4.012 "165.182.168.101" 16/04/2008:16:24:17 GET /publicaciones/index.htm HTTP/1.2 200 2543 Mozilla/4.013 "165.182.168.101" 16/04/2008:16:24:18 GET /~cea/ index.html 200 1924 Mozilla/4.0

1 2 3 4 5 . . .o1 5 6 8 92 7 12 133 104 11...

s

123

Log register

Sessions

Page 23: Defensa.V11

2323

Integer Program ~ Maximize the number of sessions. (WI-IAT08, KES09 P. Roman et al)

Register used once

One register on o

Structure and time

Page 24: Defensa.V11

2424

Network model: Minimize number of session.(IDA10 P. Roman et al)

Source Sink

Z=3

1

0

0

(1,1)

00

0

1

: flow of a session1

(1,1)

1’

2(1,1)

2’

3(1,1)

3’1

0

Now is feasibleN

N’

(1,1)

4

4’

1

1

1

0

0

• Edge indicates register precedence• Node is a register (duplicated) • Flow = Number of sessions

Page 25: Defensa.V11

25

Experiment: Large scale (15 month)DII departmental web site.

~4000 pages ~17000 links ~15000 visits per month Simple: precise information Content mainly based on text Objective: Academics, Study

programs, Projects, … Session size distribution (1 Year)

0

1

2

3

4

5

6

0 0,5 1 1,5 2 2,5

Log Size

Log

N

http://www.dii.uchile.cl/

Page 26: Defensa.V11

2626

A large scale experiment evaluation: F-Score over cookie retrieved sessions. (IDA10 P. Roman et al)

• 0<F<1• Higher F is better

Traditional sessionization

Both proposal

Page 27: Defensa.V11

2727

A large scale experiment evaluation: F-Score over cookie retrieved sessions.Method Precision Recall F-Score TimeSessionization Integer Programming (SIP)

0.7788 0.6696 0.7201 6 Hour

Network Flow (BCM)

0.7777 0.6671 0.7182 4 Min

Canonical Sessionization

0.5091 0.6996 0.5993 1 Min

Compared with 15 months of cookie retrieval

Page 28: Defensa.V11

28

Summary: Pre-processing It is possible to ensure data quality using

optimality Even in the worst scenario when only web logs

are available. Main Achievement: F=0.72 In acceptable processing time 4min/month

Ready for Neurocomputing!

Page 29: Defensa.V11

29

Second problem: Simulation

Page 30: Defensa.V11

3030

Strong Regularities:Distribution of sessions(WI-IAT08 P. Roman et al)

•Empirical power rule for session size has been observed in the literature [Huberman et al. 1998, Science]. Web Surfer Law.

•The correlation coefficient and standard error of fitting to a power law gives us a sense of the quality of the sessions.

•Our correlation coefficient is 0.94 and our standard error is 0.3817. A common heuristic has a correlation coefficient of 0.91 and a standard error of 0.64.

Page 31: Defensa.V11

3131

Regularity presence of internal rule (2008, CLAIO P. Roman et al.)

Law of surfing Machine learning algorithm has been applied

in order to capture such regularities. Today new directions based on the brain’s

informatics are used to explains navigation.

What we need isa theory for

explaining such regularities!

Page 32: Defensa.V11

3232

Proposal: To adapt Psychological theory to web navigation, using web data.

Human behavior on the web is the result of brain neural network processing.

Require historical data of individual’s trajectories on a web site.

Difficult to calculate or predict the calculation of 1011 neuron and 1014 Interconnection.

Diffusion process -> average at mesoscopic level This is the point of view of this thesis.

Page 33: Defensa.V11

3333

Biological experiment (1970-2005)

Rhesus monkey with sensor placed on Lateral intra-parietal (LIP) cortex (2002-2008)

Screen with moving dots, the decision is to select the correct direction of motion.

Monkeys are trained to receive a reward if they answer the correctly.

Possible options map on the LIP cortex and the point with higher neural activity will correspond to the decision of the subject.

Page 34: Defensa.V11

3434

Neurophysiology of decision making: First Principles

First hitting time -> time to decide. First hitting coordinate -> the choice

X1

X2

It decides option 1.

Two options0

Page 35: Defensa.V11

3535

LCA Model (Leacky Competing Accumulator) [M. Usher et al, 2001]

X>0 → Biological condition: Neural activity is positive II is considered exogenous and constant Others parameters (k,λ,σ) in the model are positives The stochastic equation:

jjl jljj dWdtIXkXdX

X

I

IIjj : Likelihood to make choice j. It drives the decision! Result from other area processing (e.g. Visual Cortex).

Important parameter!!

Page 36: Defensa.V11

3636

Application: The browsing process Arrivals (first page) are

exogenous to this model. Based on historic sessions,

the model predicts probability of following a link.

Web users are information seeker and respond according to text.

Page 37: Defensa.V11

3737

Modeling the likelihood of choosing each option (vector I)

Ij considered a probability of choosing option j.

Discrete choice theory Text must be represented as numeric

entities -> Bag of words model with TF-IDF (~ vector of frequency of appearance of word).

Page 38: Defensa.V11

3838

Likelihood of a decision and web user utility

Random Utility Model (Economy): Individuals decide within discrete options {j} with utility Vj with probability Pj of choice j.

The likelihood of taking decision j should be proportional to Pj

Web user objectives are modeled as a text vector µ. Web users are information (TEXT) seeker.

Similarity between text is measured as the cosine between both vectors.

j

jj

L

LV

k

V

V

j k

j

e

eI

),...](09.0),(1.0),(3.0[ diimgomba

Page 39: Defensa.V11

3939

Assumption & Approximation

Web browsing is characterized only by jumps.

Independence of available choices.

Utility only depends on text.

Independent of the past visited trail.

No information Satiety Rational web user Correctness of web site

information Web pages with little

content. Web page with simple

content. Web user information

processing time is negligible.

Page 40: Defensa.V11

40

jDj

jkk

Cj

jDj

jEj

Cj

Djj

IF

XF

XF

dWdtFFFdX

)(

40

Adaptation of the LCA model (WI-IAT10, P. Roman et al)

• It is a Langevin’s equation.• force interpretation of the stochastic neural activity evolution.• Open the way for improving the dynamic system: Adding forces.

jjl jljj dWdtIXkXdX

Evidence

Inhibition

Dissipation

Noise

Page 41: Defensa.V11

4141

)()0,(

)(ˆ,,0)),(2

()(ˆ

0,,0),(

)],(2

),()([),(

2

2

XX

XnXtXFXn

tXtX

tXtXXIt

tX

41

The Fokker-Planck equation: probability density of not reaching a decision (AWIC09 P. Roman et al).

Never reach a decision in t’<t

Neural activity is positive

Neural activity is initially near to 0

Probability density jl ljljj dWdtXIdX

Page 42: Defensa.V11

4242

The probability of reaching a decision in time t.

The probability of deciding option “j” in time “t”

jl ljljj dWdtXIdX

1

0

1

1

0

21

0

1

1

0

|2

|),(jk

kXjjk

kXj dXX

dXJdSJtjpjjj

Page 43: Defensa.V11

434343

Unconstrained exact solution

Hermite PolinomialsExact solution

jl ljljj dWdtXIdX (Ornstein-Uhlenbeck)

Page 44: Defensa.V11

44

Exact unconstrained solution evolution Nearly a delta in t=0, X=0 Large time solution 0 No border condition

But in t=0 the delta values on border are nearly 0

(Ornstein-Uhlenbeck)

Page 45: Defensa.V11

4545

This approach is threefold

1. Stochastic equation allows simulations for finding probabilities given a web site. But parameters need to be calibrated. Approximation: constant for all users.

2. Calibration of the model is performed by maximum likelihood. But requires web data (session set). Requires approximation of the density φ

3. Session needs to be obtained with higher accuracy.

jl ljljj dWdtXIdX

Page 46: Defensa.V11

4646

Simulation: Monte Carlo simulation

Euler approximation

Exact simulation

),(

),(

),(,1

Xtb

IXkXXta

WXtbtXtatXtX

j

lj jljj

jkjkjkjkj

jl ljljj dWdtXIdX

Page 47: Defensa.V11

47

Simulation algorithm: Deciding which link to follows.

Page 48: Defensa.V11

4848

Results: Simulated session length distribution (BICA08, AWIC09, BAO10).

Empirical result: Session length [1] distribution follows a power law [4,5].

j

jj

L

LV

• Kind of average web user• u contains all text in the web site• Sessions L>20 diverge: users that performs more elaborate processing?• Session L=1: users that have others text interest?

Page 49: Defensa.V11

4949

Results: Number of visits per page. Fuzzy, but averages remain similar.

Page 50: Defensa.V11

5050

Adjustment of distribution of time used per session.

Log scale time spend per session

y = -0,3968x + 6,8958

R 2 = 0,9573

012

3456

789

0 2 4 6 8 10

Number of session

Tim

e u

sed

Simulated session

• Same power law than real case.

• Shift in time, change time scale that is used for adjusting white noise variance.

• Slope represent more structural behavior. Intended to adjust other scalar parameters.

Page 51: Defensa.V11

51

In Summary

With only an estimation of the parameter, simulation shows result that are close to real.

Calibrating the model should produce better simulation.

Page 52: Defensa.V11

52

Third problem: Calibration

Page 53: Defensa.V11

5353

Calibration (WI-IAT 2010, P. Roman et al) Parameters :

Should correspond to properties of neural tissue.Approximation: constant for all users.

Parameter : The evidence vector ICorresponds to the intention of the web user It is distributed

The density must be approximated!!!

,,,

Page 54: Defensa.V11

54

SESSION DATA: • (i,j) : Hyperlink from i to j.• k: numerate the time distribution.• nijk : The number of observed transitions• tijk: The observed time used on this observation

Parameter Inference

ijk ijkijkI

ItjiPLognSMax )),,,|,,((,,,

Maximum log-likelihood:

• Approximate P by a linear combination of unconstrained exact solutions.

• The approximated probability function must agree restriction of LCA model.

j

i

Page 55: Defensa.V11

55

Curse of dimensionality

Many numerical methods for solving differential equations require a partition of the space.

Discretization involves: Any coordinate partition in 100. A typical number of links on a page is 20 Then the total number of points of the discretization is

about 1040

unmanageable

Page 56: Defensa.V11

56

Distribution of number of links per page.

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60 70

Number of Link

Num

ber o

f Pag

e

Page 57: Defensa.V11

5757

Proposal (1): To use symbolic processing Explicit expression are not manageable

by hand. Operation involved: Integration,

differentiation, product, … Φ is based on polynomials Instead of evaluating at each step, it

is better to perform symbolic manipulation until evaluation is needed.

Grid is not necessary for intermediate step.

/

1 -

1 ^

X 2

21

1

x

Page 58: Defensa.V11

5858

Proposal (2):Use the time propagator of the Cauchi problem

Initial condition is concentrated on 0.

But L must ensure border condition!!!!

00| t

443322 24/16/12/11 LtLtLttLetL

]2

)([2

XIL

Page 59: Defensa.V11

595959

Proposal (3):Penalization method for ensuring border condition A force FP on boundary that is added to

ensure reflection and adsorption

jP

l ljljj dWdtFXIdX

FFPP(x)(x)=(1-x)2n+x2n

]2

)([~ 2

pFXIL

Page 60: Defensa.V11

6060

Approximating the probability distribution Φ Unconstrained case involves polynomial solution. Propagator takes Φ on a small t to a t’. Propagator involves only derivatives. Symbolic processing of the solution could be performed

for building solutions for the required time t’. Probability P is built on a derivative of a definite integral that are

easily calculated by symbolic processing. A solution for the dimensionality problem!!!

Page 61: Defensa.V11

61

Experiment: DII departmental web site. ~4000 pages ~17000 links ~15000 visits per month Simple: concise and

precise information. Content mainly based on

text. Objective: Academics,

Study programs, Projects, …

Session size distribution (1 Year)

0

1

2

3

4

5

6

0 0,5 1 1,5 2 2,5

Log Size

Log

N

Page 62: Defensa.V11

62

Calibration of parameter

Neurophisiology = 0.4 = 0.2 = 0.03

Text vector preference 1 vector: Most ranked words

Mba Syllabus Project

A distribution of Gaussian vector 3 main clusters related to : study programs, academics, economics.

λ

κ

σ

Page 63: Defensa.V11

63

Simulation of in the DII site

Average error of only 5% in distribution of session size precision: 0.8, recall 0.74 by number of specific sessions

Empirical vs simulated web user session

-8

-7

-6

-5

-4

-3

-2

-1

0

0 0,5 1 1,5 2 2,5 3 3,5

Log(Session Length)

Log(

Fre

quen

cy n

umer

of s

essi

on)

Simulation

Experimental

Page 64: Defensa.V11

64

Comparing with ML approach

ML Algorithm based on clustering session with text measured.

Simulation approximates 70% of reality.

ML reachs 60% [J. Borges, 2007, IEEE Trans.] [Ghorbani, 2007, WI-IAT][J. Velasquez et Al., 2007, International Journal of Artificial Intelligence Tools ][J. Velasquez et Al., 2007, Journal of Knowledge-Based Systems (Elsevier)]

Page 65: Defensa.V11

65

Situation after 1 month: stability of the calibration. 5% of links were modified 2% of pages are new or deleted 30% of words in documents have

changed.

Simulation reach an F-score of 0.7

Page 66: Defensa.V11

66

In Summary In spite of changing web site configuration

(after 1 month) simulation returns similar session distribution.

Complexity of calibration process is improved by symbolic calculation.

Density of session length is matched in 95%. But Distribution of sessions is matched in 70%.

Page 67: Defensa.V11

67

Conclusions & Future Work

Page 68: Defensa.V11

68

Conclusion (1) In spite of the anonymous character of a

web log, It is possible to extract sessions in good agreement with an empirical statistical law.

~70% F-score Quick pre-processing can be obtained with

the use of network models. Further explorations using combinatorial

model leads us to retrieve other likely values.

Page 69: Defensa.V11

69

Conclusions (2) Web users are shown to behave like text

information seeker using simulation. Simulation of a web user is a straightforward

algorithm if parameters are known. Distribution of web user sessions are obtained with notable precision.

Calibration is notably difficult due to the dimensionality. A method based on symbolic manipulation and semi-group propagation was proposed for density estimation.

Page 70: Defensa.V11

70

Conclusion (3)

The model is robust to changes to the web site maintaining 70% accuracy in predicting distribution of sessions.

Compared with traditional data mining methods that have only 60-70% only one step prediction.

Page 71: Defensa.V11

71

Future Work Web personalization

Simulation is cheap and parallelizable, once it is trained (expansion coefficients are fitted).

Small changes (same semantic) in web site (hyperlink and structure) produce changes on web user trails on the web site.

Simulation predict web usage! Since assuming same users with the same fitted behaviour will visit the web site.

Iteration on changes and simulation could find better changes given a measure of quality.

Page 72: Defensa.V11

72

Publications: Book Chapters

1. 2010. Web Usage Mining, P. Román, G. L’Huillier, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer.

2. 2010. Advanced Techniques in Web Data Pre-Processing and Cleaning, P. Román, R. F. Dell, J. Velásquez, in Advanced Techniques in Web Intelligence – 1. J. Velásquez, L. Jain, Springer.

Publications: International Journal1. 2010, Optimization Models For Sessionization, Submitted to Journal of Intelligent Data

Analysis.

2. 2011, Simulation of web user navigation. In preparation.

Page 73: Defensa.V11

73

International Conferences1. 2006.Improving a Web Site using Keywords, P. Román, J. Velásquez, CLAIO

XIII, Int. Conf. Uruguay.2. 2008.Markov Chain for modeling the Web User Behavior, P. Román, J.

Velásquez, Infomrs, CLAIO XIV, Int. Conf. Colombia.3. 2008. Identifying Web User Session using an Integer programming Approach,

R. Dell, P. Román, J. Velásquez, CLAIO XIV, Int. Conf. Colombia.4. 2008. Web User Session Reconstruction Using Integer Programming, R. Dell,

P. Román, J. Velásquez, IEEE/ACM, WI-IAT Int. Conf. Australia.5. 2009. A Dynamic Stochastic Model Applied to the Analysis of the Web User

Behavior, P. Román, J. Velásquez, IEEE, AWIC Int. Conf. Czech Republic.6. 2009. Fast Combinatorial Algorithm for Web User Session Reconstruction, R.

Dell, P. Román, J. Velásquez, the 24th IFIP TC7 Int. Conf., Argentina.7. 2009. Analysis of the Web User Behavior with a Psychologically-Based

Diffusion Model, P. Román, J. Velásquez, AAAI BICA Int. Conf., USA.8. 2009. Web User Session Reconstruction with Back Button Browsing, P. Román,

R. Dell, J. Velásquez, IEEE LNAI 5711, KES Int. Conf. Chile.9. 2010. Stochastic Simulation of Web Users, P. Román, J. Velásquez,

IEEE/ACM, WI-IAT Int. Conf. Canada.

Page 74: Defensa.V11

74

Publications: National Conferences1. 2010. Ant Colony Surfer: Discovering the Distribution of Text

Preferences from Web Usage, P. Loyola, P.E. Román and J.D. Velásquez, BAO.

2. 2010. Best Web Site Structure for Users Based on a Genetic Algorithm Approach, E. Andaur, S. Rios, P.E. Román and J.D. Velásquez, BAO.

3. 2010. Artificial Web User Simulation and Web Usage Mining, P.E. Román and J.D. Velásquez, BAO.

4. 2010. Time Course of the Web User, P.E. Román and J.D. Velásquez, TUO2.

Publications: National review1. 2009, ; Un método de optimización lineal entera para el análisis de

sesiones de usuarios web, Revista de Ingenieria de Sistemas; Vol. 23.

Page 75: Defensa.V11

75

Thanks youfor your attention.


Recommended