Thesis Presentation

Clustering Internet users based on their behavior towards banner ads

Desp ina S [email protected]

14 Feb 2011

I n t roduc t i on

Theore t i ca l Background

Method

Resu l t s

Ana lys i s

Conc lus ions

Fu tu re Work

Agenda

M a r ke t i n g i s a n exc h a n g e p ro c e s s o f v a l u e s b e t w e e n c o m p a n i e s a n d c u s t o m e r s

(Philip, Armstrong, Wong and Saunders, 2010)

Online Marketing

[2nd position on Advertisement Investment](Orbit Scripts, 2011)

In t roduct ion: : Background

Online Advertisements are promoted through Web Sites (Publishers)

The goal is to motivate the internet users to click on the online advertisements

Users with similar profiles click on similar online advertisements(Giuffrida et al. 2001)

Users are more likely to click on personalised advertisements compared to non-personalised ads (automatic optimisation)

In t roduct ion : : Background

AdNetwork

Advertisement 1

Advertisement 2

Advertisement 3

Advertisement N

Advertisement Placement

publisher

automatic optimisation mechanism

…

1

2

3

4

56

7

In t roduct ion : : Background

Automatic Optimisation Mechanism for personalised online advertisements

Web Site

Company between

publishers and clients

Client’s Advertisement

s

In t roduct ion : : Prob lem Statement

ProblemAdNetworks need to develop an intelligent automatic optimisation logic

To keep a competent position in the online marketing business area

GoalEvaluate well known grouping algorithms

To use the best performing one for the automatic optimisation logic

PurposeTo prove that the performance success of the dominant algorithm is data-independent

In t roduct ion : : Method & Mater ia l

Literature Study Background Knowledge on clustering Identify algorithms with significant clustering performance

Empirical Part Compare the identified algorithms

In t roduct ion : : S ign ifi cance

Automatic optimisation can increase the revenues of an AdNetwork

The thesis topic is part of the automatic optimisation project in Tradedoubler and will use data from the specific AdNetwork

Each Adnetwork has different data but can benefit from the conclusions

The conclusions will reinforce the data-independence of the dominant clustering algorithm

In t roduct ion : : L imi tat ions

Only two clustering algorithms are examined

The number of clusters are predefined

Data set has a specific dimensionality and is not publicly available

Data set represent an instance of the user’s behaviour for a specific period

Theoret ica l Background : : C lass ifi cat ion vs C luster ing

Data mining is the process of discovering knowledge from data sources (Bing Liu, 2006)

Supervised Classification ( Classification)We know the class labels and the number of classes

1.dark blue

2.light green

3.dark orange

n. pink

…

Unsupervised Classification ( Clustering)We do not know the class labels and may not know the number of classes

2. ??? 3. ??? ?. ???

…1. ???

Groups users with similar characteristics Opportunity to predict future

actions

Groups users with the exact same characteristics

Impossible to predict future actions

Theoret ica l Background : : Se lect ing the c luster ing method

C l u s t e r i n g

E x c l u s i v eN o n -

E x c l u s i v e

H i e r a r c h i c a lP a r t i t i o n a l

D i v i s i v eA g g l o m e r a t i v e

Data object belong to one or more clusters

Data object belong to only one cluster

Theoret ica l Background : : Re lated Research

Most recent related studies were selected to be examined (2011)

These studies aimed to compare the clustering performance between the best performing algorithms from past related studies

K-means algorithm was used as a base line

The algorithms were examined with a predefined number of clusters

The performance measurement was applied through a fitness function

Theoret ica l Background : : Se lect ing the a lgor i thms

Particle Swarm Optimisation (PSO) & K-means

K-means as a base line

PSO because it outperformed the rest of the clustering algorithms

Limited studies around PSO

Interesting to evaluate PSO performance with the available data set from Tradedoubler and reinforce the data-independence

Method : : Data Se lect ion

Data set consists of real transactions within Tradedoubler’s AdNetwork

254.046 rows

Sampling by time period – 1 month

information columns:

PROGRAM_ID ID of the Campaign where the banner belongsWEBSITE_ID ID of Website from where the action was generatedBANNER_ID ID of the banner with which the user interactedEVENT_ID ID of the event: Click or SaleUSER_AGENT Visitors’ web browser agent and Operating SystemTIMESTAMP Time the transaction was made

AdvertisementCampaign info

Internet user info

Method : : Eva luat ion Cr i te r ia

Clustering evaluation is a complex and difficult problem (Liu, 2006)

Types of evaluation External

With readable and meaningful data -without numbers

Indirect With an external application which will test the results

Internal With any distance comparison function

Method : : Fi tness Funct ion

The fitness function that will be used will provide the summary value of the maximum distance of each cluster from a data object :

The smaller the value of the summary, the better the clustering algorithm performs

Hypothetical representation of clusters and vectors in a two dimension

is the maximum distance between a centroid and a data vector

Method : : A l te rnat ive Fi tness Funct ion

Summary value of average distance between the centroid and the data vectors

Summary value of minimum distance between data objects that belong to different clusters

The selected for this study fitness function has been used from relative researches for the same purpose and with the same algorithms, as the current study, and therefore was preferred among the alternatives

Resu l ts : : Methods Too ls and Time

Programs developed in Perl and parameterized for the multidimensional data set

Both algorithms ran for 10 different values of K; 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50

The operating system Linux UbuntuHardware characteristics : RAM: 3GB, processor: Intel Core Duo at 2,26GHz.

Execution time between the algorithms was approximately 1:4; K-mean ran in total for 1,5 hours and PSO for 7 hours

Resu l ts: : Per formance Chart

Analys is : : Per formance Compar ison

PSO >> K-means Why?

Both algorithms calculate the next position of the clusters and continuously moving them within the search space until there is no change on their position but…

…PSO evaluates each next position in the space by using an internal fitness method

…This method keeps a memory of the previous fitness value of each cluster and compares it with the fitness of the new position

…Then a decision is made if the new position should be kept or return the cluster to the previous one

Analys is : : S imi lar i ty Eva luat ion

Through a basic external evaluation from a small sample of data vectors similarities were traced so as to prove the concept of having grouped homogeneous users within the same clusters

Even though it was discussed that external will not be used as argument for the final conclusions, it can yet provide us with confidence of having properly developed the clustering algorithms

Analys is : : L imi tat ions

Fitness Function is the main evaluation method Combined with indirect evaluation would give more accurate conclusions

Fitness was measured for a defined number of clusters Hypothetically PSO would continue performing well in a higher number of K.

Yet this is not proved through the experiments

The basic external evaluation should not be taken as a criterion for the performance of the algorithms; rather, to guarantee that the development of the algorithms is more likely correct

Conc lus ions

The experiments reinforce the superiority of PSO in terms of performance despite the nature and the dimensionality of the data

Important fact : the data belong to real life transactions

Indication that the higher the value of clusters is, the better the resulting fitness for PSO This indicates additional process effort and memory use

The best number of clusters can be defined based on processing time and fitness

Future Work

Compare different hybrids of the PSO without predefined number of clusters

Develop the personalised mechanism to propose relevant advertisements

Users’ actions will define the performance : indirect method of evaluation

Subgroup 1Has seen Advertisement A

Subgroup 3Has seen Advertisement A and Advertisement BSubgroup 2Has seen Advertisement B

Show Advertisement B

Show Advertisement A

Show Advertisement from

neighbour cluster

Inside a Cluster :

Questions / Comments

Thank you!

References

Philip, K., Armstrong, G., Wong, V. and Saunders, J., 2010. Principles of Marketing, 5 th edition. New Jersey: Pearson Education, p.7

Giuffrida, G., Reforgiato, D., Tribulato, G. and Zabra, C. , 2001. A Banner Recommendation System Based on Web Navigation History. Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium, Paris

Liu, B., 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Chicago:Springer, p.6

Date post:	23-Dec-2014
Category:	Technology
Upload:	despina-stamkou
View:	371 times
Download:	4 times

Thesis Presentation

Technology