Clickstream Data Warehouse - Turning clicks into customers

Post on 29-Jun-2015

10,887 views 5 download

Tags:

description

As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data sources, like store transactions, CRM data, call center’s logs etc. As simple as it sounds for recording every click a customer made, Clickstream data actually offers a wide range of opportunities for modelling user behaviour, gaining valuable customer insights. This is definitely a data source which has been under utilized. However, benefits also come with a problem. Amazon records 5 Billion clicks a day and the whole US generates 400 Billion clicks, equivalent to 3.4 Petabytes a day. This immense volume has given enterprises and their IT professionals a big data problem before they can fully utilize this insight-rich data source. This presentation will use big data technology to help solve this big data problem; the presenter will explain everything about Clickstream data, like benefits, challenges and the solution. The end-to-end solution will include proposed data architecture, ETL, and various machine learning algorithms. A real world successful example will also be presented for audience to better grasp the concept and its applications. Sample codes and demo will also be presented for audience to apply in their respective areas.

transcript

_____________________________________________

Clickstream Data Warehouse – turning clicks into customers

1

Albert Hui

About Me• Associate Director with EPAM Canada

• Over 12 years with Business Intelligence/Data Warehousing

• Over 7 years with Java and web technologies

• BIDW Architect, Big Data Evangelist

• Conference Speaker at IOUG, TOUG Collaborate 2011, 2012 and 2013

• Technical editor on Oracle 12c Book.

4/19/2013 22

• Technical editor on Oracle 12c Book.

• Master in Engineering in the area of Artificial Intelligence – Fuzzy logic

• MBA, University of Toronto

• Toronto based

• Twitter: @dataeconomist

• Father of two twin boys

Agenda

Objective of this Session

What is Clickstream data?

How to collect Clickstream data?

Use Cases

4/19/2013 33

Challenges – what are we trying to solve?

Solutions

Live Demo

How to Start?

Concluding Thoughts

Q/A’s

Some Leaders Who Chose EPAM.

4/19/2013 44

Objective of this session

Introduction of Clickstream Data

4/19/2013 55

Start thinking how to fully

Utilize Clickstream Data

Get started

Individually and as

An organization - a

Sample Demo

Introduction of Clickstream Data

Solutions and

Available

Technologies

Movie – A Beautiful Mind

4/19/2013 666

Sales – how to sell a lobster

4/19/2013 777

www.bishopbigideas.com

Let’s have a quick quiz

8

• In US, a 45year male, 3 children, Around 150-180K

income, Post Graduate Education, if he wants to buy a

car. Which brand?

Quick Quiz

4/19/2013 99

• In US, a 45year male, 3 children, 180K income,

Graduate School Education, if he wants to buy a car.

And he lives in Texas, then which brand?

Quick Quiz

4/19/2013 1010

• In US, a 45year male, 3 children, Graduate School

Education, if he wants to buy a car. And he lives in

Texas, he is a single parent, <Unknown> income, but

he is looking to travel to Florida ONLY. then Which

brand?

Quick Quiz

4/19/2013 11

brand?

11

• But, would these preferences change (evolving

behaviours) over time? How do we catch-up?

Quick Quiz

4/19/2013 1212

@MiamiParking lot

4/19/2013 1313

What is Clickstream Data?

14

What is Clickstream?

• A Clickstream is the recording of the parts of the screen

a computer user clicks on while web browsing or using

another software application. As the user clicks

anywhere in the webpage or application, the action is

logged on a client or inside the web server, as well as

Clickstream Data

What is Clickstream?

4/19/2013 15

logged on a client or inside the web server, as well as

possibly the web browser, router, proxy server or ad

server. Clickstream analysis is useful for web activity

analysis, software testing, market research, and for

analyzing employee productivity.

Source: wikipedia

15

• Clickstream is not just weblogs.

• They can be essentially every interaction that you transact with any electronic devices.– TV PVRs.

– Smart phones.

– Game consoles.

Clickstream Data

What is Clickstream?

4/19/2013 16

– Game consoles.

– Sensors: security systems, highways.

– E-Payment cards, Loyalty cards.

– Geolocation

– Maybe more:• Alarm clocks.

• Printers

• Parking etc.....

16

• Clickstream Data is not new.– Published in January 2002, Clickstream Data

Warehousing, by Mark Sweiger

• There are essentially two types of Clickstream data– Individual Site’s Clickstream, - click path

Clickstream Data

What is Clickstream?

4/19/2013 17

– Individual Site’s Clickstream, - click path

– Internet Clickstream Data

• Server weblog accounts for 75% of daily data generation according to Gartner.

• Facebook alone captures 1.5PB of weblog data daily.

• Amazon captures 200TB of weblog data daily.

17

Sample of Clickstream Data

What is Clickstream?

• Web logs204.243.130.5 - - [26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437

"http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“

204.243.130.5 - - [26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/"

"Mozilla/4.5 [en] (Win98; I)“

204.243.130.5 - - [26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/"

"Mozilla/4.5 [en] (Win98; I)“

4/19/2013 1818

• A click path is the sequence of links a site visitor

follows.

Clickstream – Click-path Analytics

What is Clickstream?

4/19/2013 1919

• A click path is the sequence of links a site visitor

follows.

Clickstream – Click-path Analytics

What is Clickstream?

4/19/2013 2020

Let’s take another quick

21

quiz

Customer A

What is Clickstream?

Quiz 2: Which one is a more frustrated customer?

4/19/2013 2222

Customer B

Quiz 2: Which one is a more frustrated customer?

What about I tell you the What is Clickstream?

4/19/2013 2323

What about I tell you the customer is a Deal finder?

How Clickstream Data is collected?

24

• Web Logs

– Here no need to use JavaScript code for tracking purpose.

The data is collected by the web server independently of

a visitor’s browser. It captures all the requests made to

your web server including pages, images and PDFs.

Clickstream – how to collect

4/19/2013 25

your web server including pages, images and PDFs.

25

• Page Tagging

– Google Analytics is implemented with "page tags". A

page tag, in this case called the Google Analytics Tracking

Code (GATC) is a snippet of JavaScript code that the

website owner user adds to every page of the website.

Clickstream – how to collect

4/19/2013 26

website owner user adds to every page of the website.

The GATC code runs in the client browser when the client

browses the page (if JavaScript is enabled in the browser)

and collects visitor data and sends it to a Google data

collection server as part of a request for a web beacon.

26

What about some Use Cases for Clickstream?

27

Clickstream – Use Cases

What is Clickstream?

• Internet Traffic Analytics is another type of

Clickstream data. E.g.

– Google Analytics

– Yandex

4/19/2013 2828

– Kontagent

Clickstream – Use case – Google Analytics

What is Clickstream?

• Google Analytics measure how your site is performing– Competitor Analytics

– Social Mobile analytics

– Advertising Analytics

4/19/2013 2929

Clickstream – Use Case - Yandex

What is Clickstream?

• Yandex is another big one based in Russia

4/19/2013 3030

Clickstream – Use Cases – make money

Advertising on the Internet

1. Banner Ads

2. Paid Search

3. Email Campaign

4/19/2013 3131

Use cases

Clickstream – Use Cases – make money

Personalized Advertising

Minority Report-style

shopping? The billboard

that profiles you and

then flashes up ads

4/19/2013 3232

Use cases

then flashes up ads

tailored to your tastes

Clickstream – Use Cases – medical field

Use cases

Medical Science – electronic clicks

4/19/2013 3333

Clickstream – Use Cases - games

• Kontagent is the user analytics platform for

developers, marketers, product managers, and

strategic partners across the social and mobile

web. The platform kSuite provides social data

pattern visualization and analysis that delivers

actionable insights via an on-demand services.

4/19/2013 3434

Use cases

actionable insights via an on-demand services.

• San Francisco/Toronto based.

• It focuses on the gaming industry, - records every

click of the gamers.

• It tries to make gaming sites more sticky.

• Raised $50M+ US in the last 3years.

Quiz #3

35

Clickstream – Quiz #3

What is Clickstream?

1. What is the main focus on these

analytics?

4/19/2013 3636

2. What are they missing?

YOU

4/19/2013 3737

YOU

SIMILARITY

BETWEEN all of

4/19/2013 3838

BETWEEN all of

YOU

Collective Intelligence

4/19/2013 3939

Crowd Sourcing

What are we trying to solve?

40

Clickstream - Challenges

• Yes, you are right! We have too much data.Challenges

4/19/2013 4141Yes, we have a lot of data

Clickstream - Challenges

Challenges

• And user demographics

data is hard to get, due to

localized privacy laws.

• Users’ sense of privacy.

4/19/2013 4242

• Users’ sense of privacy.

• User preferences change

constantly, there are no

one-size-fit-all rules.

Clickstream – What are we trying to solve?

4/19/2013 4343Rules inside the data

Clickstream - ChallengesChallenges

Gende

r

Age Marita

l

status

occupa

tion

No. Of

Kids

Incom

e

Region Race Own a

house

Car

brand

Like

sport

Like

politics

Like

busine

ss

... click

path

Buy

M 25-35 M Engine

er

3 80-90K Toront

o

Caucas

ian

Y BMW - N Y ... ABACB

CDE...

Y

M 25-35 S Chemis

t

1 50-60K NY Asian Y N/A N - N ... AABEB

FGHIG

SJBA..

Y

F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N

4/19/2013 4444

F 35-45 D Chemis

t

0 50-60K Toront

o

Caucas

ian

N TOYOT

A

N N - ... ABAEB

FGHIG

FSBA...

.

N

F 50-60K M Doctor 6 - Minsk Caucas

ian

Y BMW N Y Y ... ABAEB

FGHIG

FSBA...

..

Y

F 35-45 D Resear

cher

0 50-60K Toront

o

Caucas

ian

Y N/A N Y N ... ABAEB

FGHIG

FSBA..

N

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Clickstream - ChallengesChallenges

Gende

r

Age Marita

l

status

occupa

tion

No. Of

Kids

Incom

e

Region Race Own a

house

Car

brand

Like

sport

Like

politics

Like

busine

ss

... click

path

Buy

M 25-35 M Engine

er

3 80-90K Toront

o

Caucas

ian

Y BMW - N Y ... ABACB

CDE...

Y

M 25-35 S Chemis

t

1 50-60K NY Asian Y N/A N - N ... AABEB

FGHIG

SJBA..

Y

F 35-45 D Chemis 0 50-60K Toront Caucas N TOYOT N N - ... ABAEB N

4/19/2013 4545

F 35-45 D Chemis

t

0 50-60K Toront

o

Caucas

ian

N TOYOT

A

N N - ... ABAEB

FGHIG

FSBA...

.

N

F 50-60K M Doctor 6 - Minsk Caucas

ian

Y BMW N Y Y ... ABAEB

FGHIG

FSBA...

..

Y

F 35-45 D Resear

cher

0 50-60K Toront

o

Caucas

ian

Y N/A N Y N ... ABAEB

FGHIG

FSBA..

N

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Clickstream – What are we trying to solve?

Prediction

4/19/2013 4646

Prediction

Solutions here.

47

Clickstream – Solutions – Clickstream Data

Warehouse

Solutions

Problems Solutions

Too much Data

4/19/2013 4848

Rules inside the data

Prediction

Architecture and Schema

Data

Vectorization

Clickstream – Solutions – handling too much data

• Top level Apache project

• Open source

• Software Framework - Java

• Inspired by Google’s white papers onMap/Reduce (MR)Google File System (GFS)Big Table

Solutions

4/19/2013 4949

Big Table

• Originally developed to support Apache Nutch

• Designed for

– Large scale data processing

– For batch processing

– For sophisticated analysis

– To deal with structured and unstructured data

Clickstream – Solutions – Data Vectorization

SolutionsClustering: Understanding data as vectors

X = 5 , Y = 3

Y

Mahout Vector Implementation1. DenseVector2. RandomAccessSparseVector3. SequentialAccessSpareVector

4/19/2013 505050

• The vector denoted by point (5, 3) is simply

Array([5, 3]) or HashMap([0 => 5], [1 => 3])

X = 5 , Y = 3(5, 3)

X

3. SequentialAccessSpareVector

Storing non-zero values in memory

Vectors must implements Java

Interface

java.io.serializable

java.mahout.VectorWritable

Clickstream – Solutions – Data as n-dimensional

vectors

Solutions Clustering: Understanding data as vectors

• Imagine one dimension for each feature for user,

product, geography, time etc.

• Each dimension is also called a feature or label

4/19/2013 515151

• Each dimension is also called a feature or label

• Support Vector Machine (SVM) age

income

occupation

Clickstream – Solutions – Predictive Algorithms

SolutionsTrain/test the

model

Then predict

What to happenFour major steps

4/19/2013 525252

Collection

And model

The Data

Select/build a model

Clickstream – Solutions – Predictive Algorithms

Solutions

• An Apache Software Foundation project to create

scalable machine learning libraries under the Apache

Software License

• http://mahout.apache.org

• Why Mahout?

4/19/2013 535353

• Why Mahout?

– Many Open Source ML libraries either:

• Lack Community

• Lack Documentation and Examples

• Lack Scalability

• Lack the Apache License

• Or are research-oriented

“Hindi” word stands for

Elephant Driver

Clickstream – Solutions – Algorithms

Solutions

Algorithms and ApplicationsAlgorithms and Applications

4/19/2013 545454

Math

es/SVD

Math

Vectors/Matric

es/SVD

RecommendersClusteringClassificationFreq. Pattern

Mining

Utilities

Lucene/Solr

Statistics

ProbabilityApache Hadoop

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Clickstream – Solutions – Algorithms – Mahout

Solutions

Command line launcher

bin/mahout list (This shows the list of algorithms)

Valid program names are:

1. canopy: : Canopy clustering

2. cleansvd: : Cleanup and verification of SVD output

3. clusterdump: : Dump cluster output to text

4. dirichlet: : Dirichlet Clustering

5. fkmeans: : Fuzzy K-means clustering

4/19/2013 555555

5. fkmeans: : Fuzzy K-means clustering

6. fpg: : Frequent Pattern Growth

7. itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering

8. kmeans: : K-means clustering

9. lda: : Latent Dirchlet Allocation

10. ldatopics: : LDA Print Topics

11. lucene.vector: : Generate Vectors from a Lucene index

12. matrixmult: : Take the product of two matrices

13. meanshift: : Mean Shift clustering

14. recommenditembased: : Compute recommendations using item-based collaborative filtering

…..

Clickstream – Solutions – Algorithms – build a model

Solutions

• Learn a model from a manually trained dataset

• Predict the class of an unseen object based on features

• E.g. features of user profile, product, click path to predict users’ preferences.

4/19/2013 565656

Clickstream – Solutions – Algorithms – build a model

Solutions

• Learn a model from a manually trained dataset

• Predict the class of an unseen object based on features

• E.g. features of user profile, product, click path to predict users’ preferences.

4/19/2013 575757

Clickstream – Solutions – Clickstream Data

Warehouse

Solutions

Traditional Clickstream Data Warehouse Schema

Common Dimensions:

1. Customer

2. Product

3. Time

4. Geography

4/19/2013 5858

5. Page

6. Content (meta-data)

7. User

Facts:

1. Sales

2. User Activities

Design:

Schema Design depends on the data we have and the measures we have

Solutions

Clickstream – Solutions – Clickstream Data

Warehouse

4/19/2013 5959

Source: Clickstream Data warehouse By Mark Sweiger

Solutions

Clickstream – Solutions – Clickstream Data

Warehouse

4/19/2013 6060

Source: Clickstream Data warehouse By Mark Sweiger

Solutions

Clickstream – Solutions – Clickstream Data

Warehouse

4/19/2013 6161

Source: Clickstream Data warehouse by Albert H

Clickstream – Solutions – technology stack

Solutions

ETL (INFA,

BI TOOL

RMDB, Oracle MySQL ZooKeeper

Model

Application

Reporting

Reports Web App

Hosting Models

4/19/2013 626262

APACHE HIVE,

HBASE

STATISTICAL

MAHOUT

ETL (INFA,

Talend)

APACHE HADOOP Clickstream logs

Algorithms

ModelData Movement

Data-

warehouse

What about a Case study - demo?

63

study - demo?

• An Asia based Hotspot Wi-Fi provider, wireless routers throughout

China/Hong Kong.

• Revenue Model: Advertising

– Advertisers place ads when users browse the Net.

• Data

– Survey data: Users are required to fill a survey before logging in.

Clickstream – Case Demo

Demo

4/19/2013 64

– Survey data: Users are required to fill a survey before logging in.

– Click logs including Ad click-through

• Data Size:

– 12GB+ compressed a day.

– 150M+ clicks and 2.4M click through a day.

• Problem definition: click-through rate is too low

64

Clickstream – Case Demo

4/19/2013 6565Demo

Hadoop – running Cloudera CDH4

Clickstream – Case Demo

Demo

• Meet the Clickstream logs

4/19/2013 6666

MAC AddrAD Site ClickedRouter LocationWhen the click is

recorded

Clickstream – Case Demo

Demo

• Meet the survey questions

4/19/2013 6767

Some Sample of Survey Questions

Clickstream – Case Demo

Demo

• Meet the answer and survey results

Options

For Survey

Answers

4/19/2013 6868

Clickstream – Case Demo

Demo• Vectorize the data for users who click weibo.com

4/19/2013 6969MAC Addr

Data Vectors

Clickstream – Case Demo

Demo

Training data set

Resultant

4/19/2013 7070

“cosine value

Distance”

Resultant

Vector

Clickstream – Case Demo

Demo

Test data Set

4/19/2013 7171

Area under

The curve is a table with two rows and two columns that reports

the number of false positives, false negatives, true positives,

and true negatives.

Clickstream – Case Demo

Demo

macaddr q16 q17 q18 q19 q20 q21 q22 q23 q24 q25 AUC Value

Actually

chicked

00:22:5f:34:54:3e 116 166 135 146 157 169 172 177 183 193 0.76 Y

00:1f:5b:b3:26:6d 117 125 136 144 162 0 0 0 0 197 0.65 N

00:1a:73:e8:56:c6 117 122 137 152 159 169 172 177 190 195 0.65 N

00:18:de:1f:fe:c0 0 0 0 0 0 0 0 0 0 193 0.61 Y

00:1e:65:51:34:80 0 0 137 141 157 0 0 0 0 210 0.59 N

00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N

2 out

Of 6

Are

Predicted

right

> 0.5 is goodTest Results

4/19/2013 72

00:17:c4:a9:16:6c 0 0 0 0 0 0 0 0 0 0 0.53 N

00:1f:3b:06:87:3d 118 131 0 0 0 0 0 0 0 201 0.41 Y

00:21:19:a4:8d:ea 0 0 134 151 157 170 172 177 184 211 0.32 N

00:1e:65:7d:2d:d2 0 0 0 0 0 0 0 0 0 0 0.29 N

00:16:44:c7:80:35 0 0 0 0 0 0 0 0 0 0 0.24 Y

00:16:44:d4:11:9a 0 0 0 0 0 0 0 0 0 0 0.22 Y

00:13:02:a4:33:9c 0 0 0 0 0 0 0 0 0 0 0.2 Y

00:21:19:9a:64:ad 0 0 0 0 0 0 0 0 0 0 0.18 N

00:1f:df:75:0a:8e 0 0 0 0 0 0 0 0 0 0 0.16 N

00:25:d3:50:37:92 118 127 0 0 0 0 0 0 0 0 0.13 Y

00:21:00:d6:98:2c 118 123 0 0 0 169 172 176 187 192 0.11 Y

00:17:c4:9b:2c:e2 0 0 0 0 0 0 0 0 0 0 0.11 N

00:0d:f0:6d:fc:47 0 0 0 0 0 0 0 0 0 0 0.11 Y

00:1e:65:3f:e1:6c 0 0 0 0 0 0 0 177 188 0 0.1 N

00:21:00:e3:a5:f1 0 0 0 0 0 0 0 0 0 0 0.08 N

Clickstream – Case Demo

Demo

• Meet the ETL process with Talend BD V 5.2

4/19/2013 7373

Clickstream – Case Demo

Demo

• Meet some sample reports

4/19/2013 7474

Clickstream – Case Demo

Demo

• Meet some sample reports

4/19/2013 7575

Objective of this session

Introduction of Clickstream Data

4/19/2013 7676

Start thinking how to fully

Utilize Clickstream Data

Get started

Individually and as

An organization - a

Sample Demo

Introduction of Clickstream Data

Solutions and

Available

Technologies

4/19/2013 7777

Thank you!

Albert Hui, MBA, MASc., P.Eng, CSM

EPAM Canada, Associate Director

Email: albert_hui@epam.com

Follow me at Twitter: @dataeconomist

4/19/2013 7878

Please help fill an evaluation form

www.ioug.org/eval

Session # 353