Download - Predictive Models and Analysis For Webpage Depth‐level ...sz255/papers/jasist2018.pdf · quencies of HTML tags, webpage keywords, page size, the number of secondary URLs, and so

Predictive Models and Analysis For WebpageDepth-level Dwell Time

Chong WangInformation Systems, New Jersey Institute of Technology, Newark, NJ 07003, USA

Shuai ZhaoMartin Tuchman School of Management, New Jersey Institute of Technology, Newark, NJ 07003, USA

Achir KalraForbes Media, 499 Washington Blvd, Jersey City, NJ 07310, USA

Cristian BorceaComputer Science, New Jersey Institute of Technology, Newark, NJ 07003, USA

Yi ChenMartin Tuchman School of Management, New Jersey Institute of Technology, Newark, NJ 07003, USA

A half of online display ads are not rendered viewablebecause the users do not scroll deep enough or spend suf-ficient time at the page depth where the ads are placed. Inorder to increase the marketing efficiency and adeffectiveness, there is a strong demand for viewability pre-diction from both advertisers and publishers. This paperaims to predict the dwell time for a given huser ;page;depthi triplet based on historic data collected by publishers.This problem is difficult because of user behavior variabil-ity and data sparsity. To solve it, we propose predictivemodels based on Factorization Machines and Field-awareFactorization Machines in order to overcome the data spar-sity issue and provide flexibility to add auxiliary informa-tion such as the visible area of a user’s browser. Inaddition, we leverage the prior dwell time behavior of theuser within the current page view, that is, time series infor-mation, to further improve the proposed models. Experi-mental results using data from a large web publisherdemonstrate that the proposed models outperform com-parison models. Also, the results show that adding timeseries information further improves the performance.

Introduction

Online display advertising provides many benefits that

traditional marketing channels do not, such as fast brand

building and effective targeting. In online display advertis-

ing, an advertiser pays a publisher for space on webpages to

display ads while a user is viewing the webpage. There are

two main existing advertising pricing models, pay-by-action

and pay-by-impression. In pay-by-action, advertisers are

charged only when the ads are clicked (i.e., converted).

Such actions may directly bring profits to advertisers. But

the rates of conversion are very low, in which case adver-

tisers often receive little feedback from users. Also, some

advertisers, for example, car vendors, do not expect users to

click and make purchases through their ads. They just want

to increase brand awareness and make more users aware of

their logos or products. In pay-by-impression, an impression

is counted once the ad is sent to a user’s browser, that is,

served. Thus, user actions are not required in this model.

However, recent studies (Google, 2014; Holf, 2014)

show that about a half of the served ads are not viewed by

users. There are two reasons for this problem: users do not

scroll to the depths where the ads are displayed on screen,

and users do not spend sufficient time at the page depth

where the ads are placed. In such cases, advertisers still have

to pay for the served ads, and they lose money without any

return on investment. To alleviate this issue, a new model is

emerging: pricing ads by the number of impressions viewed

by users for a certian time, instead of just being served.

Dwell time is used to measure how long an ad is shown in a

user’s screen. The Interactive Advertising Bureau (IAB

2013) defines a viewable impression as one that is at least

Received September 6, 2016; revised October 7, 2017; accepted

February 4, 2018

VC 2018 ASIS&T � Published online May 20, 2018 in Wiley Online

Library (wileyonlinelibrary.com). DOI: 10.1002/asi.24025

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 69(8):1007–1022, 2018

50% shown on the user’s screen for at least one continuous

second. This new model is attractive for advertisers because

they can specify dwell time requirements for their ad cam-

paigns to prevent investment waste and enhance advertising

effectiveness.

Publishers, however, demand predictions of ad dwell

time, in particular, the dwell time of each depth where ads

are placed. Depth-level dwell time prediction for a (user,

webpage) pair helps publishers to fulfill the contracts with

advertisers and satisfy advertisers’ dwell time requirements.

In addition, it can help publishers to maximize their revenue

by placing the most profitable ads to the most viewable pla-

ces. Further, publishers can dynamically generate different

webpage layouts to different users based on viewability pre-

diction so that both user experience and overall ad viewabil-

ity are optimized. Therefore, ad viewability prediction is

essential to fulfill the marketing requirements and thus maxi-

mize the return on investment for advertisers. Moreover, it

can also boost the advertising revenue for publishers.

Despite its importance, page depth-level dwell time pre-

dictions is still an open problem. Existing work (Liu, White,

& Dumais, 2010; Kim, Hassan, White, & Zitouni, 2014; Yi,

Hong, Zhong, Liu, & Rajan, 2014) focus on predicting the

time a user will spend on the whole page, rather than a par-

ticular page depth, as discussed in Related Work. However,

working at a finer granularity, depth-level dwell time predic-

tion is more challenging. The problem is non-trivial because

of the variability of user behavior and data sparsity, that is,

most users read only a few webpages, while a webpage is

visited by a small subset of users. It is also difficult to

explicitly model user interests as well as the characteristics

of entire pages and depths.

In this study, we investigate how to predict webpage

depth-level dwell time. We develop predictive models based

on Factorization Machines (FM) and Field-aware Factoriza-

tion Machines (FFM) because they are able to capture the

interaction between input features, overcome the data spar-

sity issue, and provide flexibility to add auxiliary informa-

tion. The proposed predictive models can be applied to

predict the dwell time of any items on a page. Our models

consider the basic factors (i.e., user, page, and page depth)

and other auxiliary information. We also propose a smooth-

ing technique to further improve the performance. In addi-

tion, we leverage the prior dwell time behavior of the user

within the current page view, that is, time series information,

and integrate them into the FM-based models. We evaluated

our models using real-data from a large web publisher. The

experimental results demonstrate that our models outper-

form comparison models. The FM model with viewport,

channel, and Doc2Vec vector obtains the best performance.

In addition, the performance is further improved by adding

the time series information of prior dwell time behavior. We

also present the analysis of feature combinations by extract-

ing the latent feature vectors from the proposed models,

which provides insights on publishers’ business strategies

for ad placement.

Our contributions are summarized as follows: (a) To the

best of our knowledge, this is the first work that studies page

depth-level dwell time prediction. We define a new problem

in viewability prediction in order to build new ad pricing

standards. (b) We are the first to propose predictive models

based on FM and FFM to solve this new problem, and we

show that our proposed models outperform other compari-

son methods. (c) We demonstrate that adding time series

information can further improve the model performance. (d)

We also present the analysis of feature combinations and

provide insights on publishers’ ad business strategies.

The rest of the paper is organized as follows. Related

Work discusses the related work. Page depth Dwell Time

Prediction describes the proposed models for webpage

depth-level dwell time prediction. Experimental results and

feature analysis are presented in Experimental Evaluation

and Feature Analysis. An FM model with time-series infor-

mation is presented in Time Series Model for Page Depth-

level Dwell Time Prediction. The paper concludes in

Conclusion.

Related Work

The problem of dwell time prediction is challenging

because of the variability of user behaviors and the data

sparsity. Liu, White, and Dumais (2010) investigate the fea-

sibility of predicting from features the Weibull distribution

of page-level dwell time. They use Multiple Additive

Regression Trees (MART). The features include the fre-

quencies of HTML tags, webpage keywords, page size, the

number of secondary URLs, and so on. They find that page-

level dwell time is highly related to webpage length and

topics. Yi, Hong, Zhong, Liu, and Rajan (2014) view the

average dwell time of a webpage as one of the item’s inher-

ent characteristic, which provides important average user

engagement information on how much time the user will

spend on this item. The authors present a machine learning

method to predict dwell time of article stories using simple

features. The features they consider are content length, topi-

cal category of the article (e.g., finance), and the context in

which the article would be shown (e.g., devices). The

authors use Support Vector Regression models to predict

page-level dwell time. Kim, Hassan, White, and Zitouni

(2014) present regression method to estimate the parameters

of the Gamma distributions of click dwell time (i.e., the time

that the user spends on a clicked result). The features they

adopt are similar to those used in Liu et al. In contrast, this

proposed research predicts dwell time at a specific depth in

a page, which is still an open question. Working at a finer

granularity, depth-level dwell time prediction is more chal-

lenging than page-level dwell time prediction. Yin, Luo,

Lee, and Wang (2013) find that the dwell time satisfies a

log-Gaussian distribution. They claim viewing item is such

a casual behavior that people may terminate the viewing

process at any time. The dwell time varies a lot because of

the factors from both items and persons: (a) Items may differ

not only in their form and volume; (b) There are many

1008 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018

DOI: 10.1002/asi

subjective human factors to affect dwell time. For example,

different people receive information at different speed. The

time of consuming the same item differs from person to per-

son. The authors develop a model which estimates how

much a user likes the viewed item according to the item-

level dwell time.

From a methodological point of view, there is similarity

between page-level and depth-level dwell time prediction.

However, compared with page-level dwell time prediction,

depth-level dwell time prediction is more challenging

because of user behavior variability. Existing work on page-

level dwell time prediction does not consider individual

users. Furthermore, they do not achieve good performance

when they are modified to include user information (Liu,

White, & Dumais, 2010; Yi, Hong, Zhong, Liu, & Rajan,

2014; Kim, Hassan, White, & Zitouni, 2014). However,

users’ heterogeneity should be considered because their

behaviors vary a lot. Different users have different reading

habits and interests, which largely affect their reading

behavior. However, zooming into individual-level behavior

inevitably leads to the problem of data sparsity (i.e., the

interaction between users and pages is highly sparse). Also,

user detailed profile information is not accessible to publish-

ers, in which case we have to rely just of user IDs. There-

fore, in this paper, we propose to adopt the FM and FFM

models, which consider individual users and pages, and

overcomes data sparsity. Our experiments demonstrate that

FM and FFM outperform an existing method (Yi, Hong,

Zhong, Liu, & Rajan, 2014), which was modified by adding

individual user information and depth information.

The only existing work that narrows down to dwell time

of a part of a page is (Lagun & Lalmas, 2016), in which

Lagun et al. define and measure viewport time in order to

infer user interest. However, our work focuses on the predic-tion of the dwell time at each individual page depth. We

preliminary studied depth-level dwell time predictions in

our previous work (C. Wang, Kalra, Borcea, & Chen, 2016).

This paper, however, provides a substantial extension, with

new prediction models and additional features.

Another area of study, which is less related to our work,

is the prediction of the likelihood of a user to scroll to a

given page depth in a page. Wang et al. (C. Wang, Kalra,

Borcea, & Chen, 2015; C. Wang, Kalra, Zhou, Borcea, &

Chen, 2017) propose probabilistic latent class models that

predict the probability that a user scrolls to a given page

depth where an ad may be placed. In contrast, this paper

presents models to predict how long a user may stay at a

given page depth. Compared to prediction of scroll depths,

webpage depth-level dwell time prediction can better satisfy

publishers’ need for detailed estimations of ad viewability.

Page Depth Dwell Time Prediction

We define the problem of depth-level dwell time predic-

tion as below.

Problem Definition 1. Given a page view, that is, a user u

and a webpage a, the goal is to predict the dwell time of a

given page depth X, that is, the time duration that X is shown

on the screen. The dwell time of X is denoted as TuaX.

Data Set

A large web publisher (i.e., Forbes Media) provides user

browsing logs collected from real website visits in 1 week of

Dec 2015 and webpage metadata. The data set contains 2

million page views. For each page view, it records the user

id, page url, state-level user geo location, user agent, and

browsing events, for example the user opened/left/read the

page. Each event stores the event time stamp and the page

depths where the top and bottom of the user screen are.

Once a user scrolls to a page depth and stays for one second,

an event is recorded. The page depth is represented as the

percentage of the page. The reason that we adopted page

percentage rather than pixels is because it provides a relative

measure independent of device screen size. If a user reads

50% of a page on a mobile device, whereas another user

reads 50% of the same page on a desktop, it can be assumed

that they read the same content.

Table 1 is a simplified example of the user log. Each

event has a time stamp so that the time that a user spent on a

part of page can be calculated. To infer the current part of a

page that a user is looking at, the user log also records the

page depths at which the first and the last rows of pixels of

the screen are. Thus, we are able to infer when a user

scrolled to which part of a page and how long the user

stayed.

In Table 1, the user scrolled to 30–60% of the page after

reading 20–50% of the page for 1 minute. Thus, the dwell

time of the page depths that have been scrolled past can be

determined. For example, the dwell time of 20% - 30% is 1

minute at this moment.

Factorization Machines (FM)

It is intuitive that the dwell time of a page depth is highly

related to the user’s interests and reading habits, the topic of

the article in the page, the design at that page depth, etc.

More importantly, the interactions of these three factors

must be modeled so that their joint effect is captured: (a)

The interaction of users and pages captures a user’s interest

in a page. (b) The interaction of users and page depths can

reflect individual users’ browsing habits. For example, some

users read entire pages carefully, but some only read the

upper half. (c) The interaction of pages and depths models

TABLE 1. A simplified example of the user log.

User URL Time . . . Event User Behavior

001 /abc 2/1/2015

10:00:00

. . . Read

Page

{“first row”:20,

“last row:50,. . .”}

001 /abc 2/1/2015

10:01:00

. . . Read

Page

{“first row”:30,

“last row:60,. . .”}

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018

DOI: 10.1002/asi

1009

the design of individual pages at individual page depths. For

example, pages that have a picture at a depth may receive

relatively short dwell time at that depth because people usu-

ally can understand a picture more quickly than text. How-

ever, it is non-trivial to explicitly model user interests, page

characteristics, the attractiveness of page depths, and their

interactions. Also, although implicit feedback, for example,

reading dwell time, is more abundant than explicit feedback,

for example, ratings, it often has higher variability (Yin

et al., 2013), which makes prediction more challenging.

Therefore, we adopt Factorization machines (FM) (Rendle,

2012), which are a generic approach that combines the high-

prediction accuracy of factorization models with the flexibility

of feature engineering. The FM model has been used in appli-

cations such as context-aware rating prediction (Rendle, 2012),

retweeting (Hong, Doumith, & Davison, 2013), and microblog

ranking (Qiang, Liang, & Yang, 2013). The reason that we

adopt the FM model is that it can capture the interaction of

multiple inter-related factors, overcome the data sparsity, and

provide the flexibility to add auxiliary information.

According to the problem definition, the basic FM model

requires three factors: user, page, and page depth. The input

is derived from the user-page-depth matrix built from the

user logs: In the basic form of depth-level dwell time predic-

tion, we have a three-dimensional cube containing nu users,

na pages, and nd page depths. Thus, each dwell time is asso-

ciated with a unique triplet huser; page; depthi. Such a 3D

matrix can be converted into a list of (nu 1 na 1 nd) rows.

The target variable for each row corresponds to an observed

dwell time represented by the triplet. N training page views

lead to N � 100 rows, as each page view contains 100

observed dwell time values (one for each percent from 1 to

100% page depth). This input is similar to what is prepared

for regressions. However, regressions would not work well

because the data are very sparse and they are unable to cap-

ture the interaction between the input variables.

The basic idea of FM is to model each target variable as

a linear combination of interactions between input variables.

Formally, it is defined as following.

yFMðxÞ5w01Xn

i51

wixi1Xn21

i51

Xn

j5i11

hvi; vjixixj (1)

where, yðxÞ is the prediction outcome given an input x. w0 is

a global bias, that is the overall average depth-level dwell

time.Xn

i51wixi is the bias of individual input variables.

For example, some users would like to read more carefully

than others; some pages can attract users to spend more time

on them; some page depths, for example, very bottom of a

page, usually receive little dwell time. The first two terms

are the same as in linear regression. The third term captures

the sparse interaction between each pair of input variables.

The FM model uses a factorized parametrization to cap-

ture the feature interaction effects (Equation [2]). That infor-

mation is difficult to learn for linear regressions because

standard regression models learn the weight of each

interaction using only a real number wij. The latent feature

vectors, on the other hand, allow the FM models to estimate

reliable parameters even in sparse data. This is because the

latent features are learned from all feature pairs instead of

only one pair, which provides a better representation of the

influence to the output variable and the combined effects

with other features.

hvi; vji5XK

k51

vikvjk (2)

Field-aware Factorization Machines (FFM)

Very recently, Juan et al. (Juan, Zhuang, Chin, & Lin,

2016) proposed a variant of FM, field-aware factorization

machines (FFM), which has shown its superiority over exist-

ing models in machine learning competitions.

FM builds one vector for each individual feature. The

latent vector of a feature is used to compute the interaction

with any other feature in the input data. If n is the number of

features and k the dimensionality of latent vectors, then the

number of variables needed to learn is n 3 k.

FFM assumes that features can be grouped into different

fields. For instance, the feature “lifestyle” belongs to the field

“channel,” the feature “320 3 480” belongs to the field

“viewport.” The intuition in FFM is that a feature should use

different representations (e.g. latent vectors) when interacting

with other features which belong to different fields because

they may emphasize different aspects of the feature. For

example, “depth_20%” should use different latent vectors,

vdepth20%;channel and vdepth20%;viewport, to calculate the interac-

tion of (depth_20%, lifestyle) and (depth_20%, 320 3 480)

because “lifestyle” and “320 3 480” belong to two different

fields. FM builds one vector for a feature, each of which is

used to compute the interaction effects with all other features.

In contrast, FFM builds multiple vectors for a feature, each of

which is used to compute the interaction with a feature from

the corresponding field. Thus, FFM has better capability to fit

the input data. It is also a more complicated model than FM

in terms of the number of parameters: If the number of fields

in the input data is f, one feature should have f vectors. In this

case, the number of parameters needed to learn is n3f 3k.

Thus, FFM is formally defined as below:

yFFMðxÞ5Xn

j151

Xn

j25j111

ðwj1;f2 � wj2;f1Þxj1xj2 (3)

where f1 and f2 are respectively the fields of features j1 and

j2. n is the number of features.

In this study, we apply both FM and FFM to the applica-

tion of page depth-level dwell time prediction.

Feature Engineering

The basic FM model works with only three factors: user,

page, and depth. However, context information can also


DOI: 10.1002/asi

help improve the prediction performance. Thus, we identify

three context features, viewport (i.e., the part of a user

browser visible on the screen), local hour, and local day ofthe week (denoted by weekday in the experiments), which

are likely related to user reading behavior. The viewport

indicates the device used by the user (e.g., a mobile device

usually have a much smaller visible browser area than a

desktop) and can directly determine the user experience.

Specifically, one viewport value consists of the height and

the width of a browser, for example, 1855 3 1107. To

reduce sparsity, both heights and widths are put into buckets

with size 100 pixels. For instance, 1855 3 1107 can be dis-

cretized into 18 3 11.

For user demographics, we consider user geo locations

because this is the only explicit feature about users that can

be easily obtained by publishers without violation of user

privacy. User geo, inferred from IPs, may reflect a user’s

interests and education, and it may determine the user’s net-

work condition. Specifically, geo is the country name if the

user is outside the United States or a state name if she is

within the United States.

For page attributes, we consider article length, channel,

and freshness. Article length is represented by the word

count of the article in the page, and it has been proven to be

a significant factor impacting page-level dwell time (Yi,

Hong, Zhong, Liu, & Rajan, 2014). However, its influence

on page-depth-level dwell time is still unclear. The channel

of the article in a page is its topical category on the publish-

er’s website, for example, finance and lifestyle. Freshness is

the time span between the page is read and the page is firstly

published on the website. Freshness is measured by days.

The freshness of an article may determine the interests of a

user on it. Fresh news may receive more user engagement.

The viewport content is also modeled by several state-of-

the-art models because it is believed that the content shown

in a user’s browser affects the time that the user spends on

it. The user log records the position of each viewport and the

article metadata include the content of each article. So it is

possible to obtain the textual content shown in the user’s

browser.

Several the most popular existing models are used to

model the semantics of each viewport content:TF-IDF,

LDA, and Doc2Vec. TF-IDF (Wu, Luk, Wong, & Kwok,

2008), short for term frequency-inverse document fre-

quency, is a highly commonly-used method to weight words

based on their importance to a textual document in a collec-

tion. The TF-IDF value increases proportionally to the num-

ber of times a word appears in the document, but is offset by

the frequency of the word in the corpus, which helps to

adjust for the fact that some words appear more frequently

in general. LDA (C. Wang & Blei, 2011), short for Latent

Dirichlet Allocation, is an unsupervised process for inferring

the topics in a textual document. It outputs a clearly-defined

probability for arbitrary documents. Because the webpage

articles in our corpus are relatively long (compared to short

text, e.g. tweets), it is suitable to use LDA to model the topic

distribution of a document because of the presence of

abundant word co-occurrence. Thus, all training articles are

fed into the LDA model. The learned model can be used to

infer the topic distribution of each test viewport content. In

the experiments, we compare two different ways to incorpo-

rate LDA outcome into the FM model. The first is to only

consider the latent topic with the highest probability and

concatenate it with other feature using one-hot encoding.

The second strategy is to consider all latent topics. The topic

distribution vector will be concatenated with other features.

In addition, we evaluate different pre-specified numbers of

latent topics in the experiments.

Doc2Vec (Le & Mikolov, 2014) is an unsupervised learn-

ing of continuous representations for variable-length pieces

of texts, such as sentences, paragraphs or entire documents.

Unlike TF-IDF, Doc2Vec takes into account the ordering

and semantics of the words. Given a chunk of text, Doc2Vec

provides a fixed-length feature vector to represent the mean-

ing of the text. The vector can be used as an input in the FM

model. The Doc2Vec vectors of two pieces of text which

have close meaning should be very close to each other.

Given a unseen piece of text, a fully trained Doc2Vec model

can infer a vector to represent its meaning. The Doc2Vec

used in this project is developed based on Gensim1. All

training articles are fed into the Doc2Vec model. The

learned model can be use to infer the feature vector of each

test viewport content. We evaluate different dimensionalities

of the feature vector in the experiments.

Smoothing Technique

In our preliminary experiments, we observe that the pre-

diction of depth-level dwell time for each page view often

exhibits a see-saw behavior. In contrast, the ground truth of

depth-level dwell time of a page view is rather stable and

continual. This is because the adjacent depths tend to be

shown on a screen at the same time. Thus, the performance

can be further improved if the prediction outcome is

smoothed.

Our solution divides the 100 page depths into continuous

intervals of size d; all depths within an interval are adjacent

to each other and will be assigned the same smoothed dwell

time. The smoothed dwell time can be calculated by a func-

tion f, which takes as input the d predicted dwell times in the

interval. The results of any f with d 5 1 are the same as the

original predictions. Candidate functions include mean,

median, min, max, quartile, etc. The optimal f and d can be

determined based on the data set. Mathematically, smooth-

ing is defined as below.

y0i5f ðSÞ; S5 yijyi 2 y; i 2 bi=dc; di=deð �f g (4)

where y is all 100 predictions of a page view, i is a page

depth, d is the interval size, f is the pooling function, yi is

the original prediction, and y0i is the smoothed prediction.

The interval that includes i is defined by: bi=dc; di=deð �.

1Footnoteshttps://radimrehurek.com/gensim/models/doc2vec.html


DOI: 10.1002/asi

1011

https://radimrehurek.com/gensim/models/doc2vec.html

This method is inspired by the pooling layer (Schmid-

huber, 2015) used in convolutional neural networks. The

main difference is that pooling layer reduces the dimension

of the input, whereas the smoothing does not. Smoothing

can also be used in other models because machine learning

models may tend to always predict higher or lower values

than the ground truth. Smoothing can learn such characteris-

tic from a validation set and align the prediction on the test

set.

Experimental Evaluation

Settings

A 1-week user log is collected as described in Data Set.

To avoid the cold-start problem, we iteratively remove the

page views whose users and pages occur less than 10 times

in the data set. In this case, we guarantee that all users and

pages occur adequate times in the training data. All users

and pages in the test data occur in the training data. The final

data set is randomly shuffled into three sets of training

(160K1 depths), validation (10K1 depths) and test (15K1

depths) data. The validation data are used to determine the

optimal smoothing technique and the iterations for early

stopping. The experimental results are reported by taking

the average over the sets.

Comparison Models

Several comparison systems are developed as following:

GlobalAverage. In dwell time prediction, that is, Page

Depth-level Dwell Time Prediction, it computes the average

dwell time of each page depth X in all training page views.

If a user did not scroll to X before leaving the page, its dwell

time in the page view is zero. In viewability prediction, that

is, Viewability Prediction, it computes the fraction of train-

ing page depths whose dwell times are no less than the

required dwell time. In both tests, 100 constant numbers are

obtained after iterating over all training pageviews. They are

used to make a deterministic prediction for the correspond-

ing page depth.

UserAverage. It is like GlobalAverage. But it computes

the average dwell time of each depth X based on each user’s

reading history (rather than all training page views). In view-

ability prediction, for a depth of a training pageview,

whether or not it is viewed for at least a certain second is

recorded, that is, 0 or 1. The probabilistic prediction is made

based on the average over all binary outcomes of a page

depth of a user.

PageAverage. Like UserAverage, it computes the average

dwell time of each depth X based on each page’s history.

Regression. We select the regression model in (Yi, Hong,

Zhong, Liu, & Rajan, 2014) as a baseline because it is repre-

sentative for page-level dwell time prediction models. We

modify it by adding individual user information and depth

information in order to apply it in our application. We also

use another regression model which uses the feature combi-

nation that works the best in the FM model. In particular,

Two regression models are built. Regress_bc is developed

based on (Yi, Hong, Zhong, Liu, & Rajan, 2014). To apply

to depth-level prediction, one more feature, that is, page

depth, is added. For viewability prediction, logistic regres-

sion with the same features is adopted. 2) The second,

Regress_feat, is developed based on the finding in Section

4.4 that shows user, page, depth, viewport, doc2vec_150,

and channel are the best features for FM.

Metrics

RMSD. It measures the differences between the values pre-

dicted by a model, yi, and the values actually observed, yi.

For depth-level dwell time prediction, it is defined as the

square root of the mean square error:

RMSD5

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN

j51

X100

i51ðyij2yijÞ2

N � 100

vuut

where N is the number of test page views. The second sigma

accumulates the errors at all 100 page depths in the ith page

view. yij is the actual dwell time, at the jth page depth in the

ith page view. yij is the corresponding predicted dwell time.

Logistic loss. It penalizes a method more for being both

confident and wrong. For example, for a particular observa-

tion, a classification model assigns a very small probability

to the correct class then the corresponding contribution to

the logloss will be very huge. In our case, the probability is

interpreted as how likely it is that the dwell time of a page

depth is at least a certain amount of time.

logloss521

N � 100XN

i51

X100j51

yijlog ðyijÞ1ð12yijÞlog ð12yijÞh i

Comparison of Feature Combinations

The basic FM and FFM models contain the user, the

page, and the depth. We then add context and auxiliary fea-

tures, including user features, page features, and depth fea-

tures, to the basic models in order to evaluate the effect of

different combinations. The models are applied to predict

the dwell time of every page depth in each test page view.

To find the best feature combination, we first add one fea-

ture to the basic models and then keep adding more features

to the best one once at a time. The result of adding one addi-

tional feature is presented in Table 2. The performance is

measured by RMSD. We also vary the dimension of the

2-way interactions, K, which is the length of the latent vector

v for each variable (Equation [2]). We test for k510, 20, and


DOI: 10.1002/asi

30 and find that no one single k value dominates the results.

Different feature combinations have different best k.

The result shows that some features can significantly

improve the prediction performance of the basic FM and

FFM. Viewport is one of the most significant context fea-

ture. Intuitively, viewport indicates the type of device, which

influences reading experience and thus the way users engage

with webpages. Channel is also a significant one in that, rep-

resenting the topic of the whole page, it directly determines

the user’s interest in the page. The Channel information is

provided by the creators of the article metadata, and thus it

can be considered 100% correct.

We observe that some features cannot help with the pre-

diction. For instance, adding weekday or hour of the day

cannot decrease the RMSD of the FM model and lead to

only a small improvement in FFM. This indicates that the

time that one user spends on a page depth does not signifi-

cantly vary with the hour of the day and the day of the

week. Also, user’s location does not enhance the perfor-

mance. The possible reason is that the granularity of user

geo location is too coarse. In the user log, user geo is state-

level if in USA, otherwise country-level.

Four methods are adopted to model the content of a view-

port. “TF-IDF keywords” considers all non-stopwords with

high TF-IDF in the text in a viewport. “topic_n” considers the

most probable topic calculated by LDA with the topic number

n. “topic_group_n” considers the topic distribution calculated

by LDA with the topic number n. In contrast to “topic_n,”

“topic_group_n” takes into account all latent topics whose

probability is more than 0. The value of the possible topics is

its probability. In this way, the most probable topic is still

weighted higher than others. Hence, it is expected that

“topic_group_n” can provide more details about the topic of a

viewport content. Lastly, “doc2vec_m”uses a Doc2Vec vector

to model viewport text. We vary the length of the vector m to

see its impact on performance.

To further explore the best performance, more than one

feature are incorporated into the basic models at the same

time. Because FM (doc2vec_150) with K 5 20 and FFM

(viewport) with K 5 30 reach the lowest RMSD, additional

features are added into these models. We select one of the

four adopted methods to model the viewport content. As

doc2vec models viewport content, the other features, that is,

TF-IDF keywords and LDA, that fall into the same category

are not considered in these experiments.

Table 3 shows that the FM model with dov2vec_150,

channel, and viewport as additional features achieves the

lowest RMSD, that is, the best performance. In other words,

the dwell time of a given depth is determined by the content

around that depth (captured by dov2vec), the topic of the

whole article (captured by channel), and the size of the

browser (captured by viewport). Similarly, we find the best

feature combination for FFM is viewport, channel, freshness

and topic20 according to Table 4.

Page Depth-level Dwell Time Prediction

We compare the best models obtained from the previous

experiment, that is, FM (dov2vec_150 1 channel 1 view-

port) with K520 and FFM (viewport 1 channel 1 fresh-

ness 1 topic20) with K530, with the other comparison

systems. All models are applied to predict the exact dwell

time of each page depth in test page views. We also add the

smoothing technique for both FM and FFM methods. The

smoothing process used in the test sets are determined by

the validation sets. In particular, the best smoothing for FM

is mean with d53. The best for FFM is 75% quartile with

d57. The results in Table 5 demonstrate that FM and FFM

significantly outperform the comparison systems, with the

TABLE 2. RMSD Comparison by Adding One Additional Feature:

The results are reported by selecting the best from k510, 20, and 30.

Feature groups Models FM FFM

Basic Basic 12.4680 12.6550

Context Weekday 12.6575 12.4759

Hour 12.7563 12.4542

Viewport 12.2909 12.3435

User Geo 12.5465 12.6499

Article Length 12.636 12.4400

Channel 12.3489 12.3610

Freshness 12.4770 12.3487

Viewport

Content

TF-IDF 12.6239 12.3648

Topic_10 12.7308 12.4205

Topic_20 12.3929 12.4080

Topic_30 12.5168 12.4911

Topic_group_10 12.2912 12.4134

Topic_group_20 12.4393 12.3831

Topic_group_30 12.3899 12.3773

Doc2vec_50 12.3042 12.5161

Doc2vec_150 12.2065 12.4075

TABLE 3. RMSD comparison by adding more additional features to

FM (doc2vec_150).

Models K520

FM (doc2vec_1501viewport) 12.2301

FM (doc2vec_1501channel) 12.0733

FM (doc2vec_1501freshness) 12.3487

FM (doc2vec_1501channel1viewport) 12.0419

FM (dov2vec_1501channel1freshness) 12.1985

FM (doc2vec_1501channel1viewport1freshness) 12.1827

TABLE 4. RMSD comparison by adding more additional features to

FFM (viewport).

Models K530

FFM (viewport1channel) 12.2775

FFM (viewport1freshness) 12.2703

FFM (viewport1topic20) 12.2992

FFM (viewport1doc2vec_150) 12.4524

FFM (viewport1channel1freshness) 12.2588

FFM (viewport1channel1topic20) 12.2743

FFM (viewport1freshness1topic20) 12.2599

FFM (viewport1channel1freshness1topic20) 12.2542

FFM (viewport1channel1freshness1topic201keywords) 12.2863


DOI: 10.1002/asi

1013

best model as being FM 1 smoothing technique. This is

because they are able to overcome sparsity and capture pair-

wise interactions between features. The RMSDs of PageA-

verage and UserAverage are better than GlobalAverage

because their predictions are tailored to each page or each

user. Also, the results indicate that controlling the user varia-

bles seems to be more effective than controlling the page

variables. This is because dwell time is influenced more by

individual users’ subjective behaviors. The RMSD of

Regress_bc is not as low as the one of UserAverage, which

indicates that methods for page-level dwell time prediction

cannot be easily applied to depth-level prediction. Without

capturing the interaction of features, Regress_feat does not

obtain a prediction as good as the ones of FM and FFM.

The results shown in Table 5 are calculated over all test

page depths. In order to look into the performance at differ-

ent areas of pages and evaluate the robustness of the pro-

posed method, page depths are split into different buckets:

bucket1: [1%, 25%], bucket2: [26%, 50%], bucket3: [51%,

75%], and bucket4: [76%, 100%]. According to the results

shown in Figure 1, the proposed FM and FFM methods con-

sistently outperform the others in all buckets. With smooth-

ing, their performance is further enhanced. Unexpectedly,

FFM does not outperform FM becuase FFM suffers from

overfitting. FFM contains many more parameters than FM.

Especially, for doc2vec_50 and doc2vec_150 which are

very dense features, FFM builds multiple latent vectors for

each doc2vec latent feature. Inaccurate latent vectors may

impact the prediction of all depths.

Generally, the prediction error decreases with the

increase of the page depth. The reason is that most users

only read the first half of the page. Therefore, the dwell time

of the page depths near the bottom of the page is mostly

zero. Because it is easier to predict at the bottom of the

pages, the performances of all methods are closer in

bucket4, while the proposed methods are still the best.

Viewability Prediction

Viewability can be regarded as the probability that an

item (e.g., an ad) at a page depth will be viewable. This can

be treated as probabilistic classification. Therefore, we run

an experiment to evaluate whether the FM/FFM models can

handle this problem.

We vary the dwell time threshold of a viewable impres-

sion from 1s (IAB standard) to 10s. The target variable of

each page depth in the data set is 1 if its dwell time is at least

T seconds; otherwise 0. In this way, the prediction problem

is converted from regression to classification. The prediction

outcome of each test page depth is the probability that its

dwell time is at least T seconds.

Figure 2 shows that the FM and FFM models clearly out-

perform the baselines. The best smoothing function for the

FM is mean, while d varies from 2–6 with the minimum

dwell time threshold. The best smoothing function for FFM

is min, while d varies from 3–5. We observe that the

FM 1 smoothing model achieves the best performance at

the two ends (1s and 10s). Given a page depth, it is more

challenging to predict if the dwell time is at least 5s. The

reason is that the number of page depths with dwell time at

least 5s and the number of page depths with dwell time less

than 5s are very close (about 50%). In contrast, there are

about 70% page depths whose dwell time is at least 1s.

GlobelAverage and LogisticRegress_bc have similar perfor-

mance. Also, the LogisticRegress with significant features is

better than the other baselines.

TABLE 5. Depth dwell time prediction comparison (RMSD).

Approaches RMSD

GlobalAverage 13.6971

PageAverage 13.5243

Regress_bc 13.2643

UserAverage 13.1482

Regress_feat 12.9043

FM 12.0419

FM1smooth 11.8808

FFM 12.2542

FFM1smooth 12.2510

FIG. 1. Depth dwell time prediction comparison (Buckets).


DOI: 10.1002/asi

One interesting observation is that, although UserAver-

age and PageAverage outperform GlobalAverage by

RMSD, as shown in Table 5, they are much worse than

GlobalAverage by logistic loss in viewability prediction.

Also, they do not have as stable performance as the other

methods. The main reason is that most users and pages in

the test data have few historical page views in the training

data. Also, most pageviews have sparse dwell time distribu-

tion, that is, the dwell times of many page depths are 0. In

this case, for individual users or pages, the viewability pre-

dictions of a depth are close to 0 or 1. Once the prediction is

incorrect in the test data, the penalty by logistic loss will be

large because it heavily penalises classifiers that are confi-

dent about an incorrect classification. For instance, the dwell

times at 10% depth of a user’s all historical page views are

0s, 0s, 0s, and 3s. In a test page view, the user spent 1s at

that page depth, which is the ground truth. Given T 5 1, the

prediction of the user at this depth will be always

ð0101011Þ=450:25. This means the classifier thinks

that the depth will very likely be viewed for less than 1s.

However, the logistic loss of the prediction is

loglossð0:25; 1Þ525:9047, which is a huge penalty. This is

because in classification problems it is better to be some-

what wrong than emphatically wrong. This characteristic is

very important for publishers because it can help them avoid

large decision-making errors.

Effects of Smoothing

To investigate how smoothing impacts prediction perfor-

mance, different smoothing settings are applied to the pre-

diction outcome of the best FM model by varying the

smoothing function f and the interval size d. Figure 3 and

Figure 4 are the results of dwell time prediction and view-

ability prediction for 1s, respectively. For dwell time predic-

tion, the smoothing technique with f5mean and d54

obtains the best performance. For viewability prediction, the

smoothing with f5quartile(75) and d52 obtains the best

performance. Both figures show that performance generally

decreases for d 5 5 or higher. This means that smoothing

with coarse granularity tends to hurt performance. The best

performance is often obtained when d is between 2 and 4.

For f, Min and max always generate worse performance

because they use the extreme value in each interval to make

the final prediction. The best smoothing setting for FM

learned from the validation sets is with mean and d53. The

best for FFM is with quartile(75) and d55.

Feature Analysis

We also look into some features and investigate how user

reading behaviors are related with the feature values. This

may influence advertisers’ biding behaviors as well as pub-

lishers’ ad allocation strategies and website design.

Weekday and Hours

We investigate whether user reading behavior varies with

time. The long-term data are provided by Google Analytics

(GA)2. Because the time recorded in GA is the visit’s time

converted to the timezone configured for the GA profile

(Forbes profile uses the US Eastern Time), we fix the region

of the visits to the New York State.

Figure 5 shows website traffic and the mean page-level

dwell time on different days of week. Although website traf-

fic varies by the days of week, the mean page-level dwell

time almost does not have any fluctuation. Users spend the

FIG. 2. Viewability prediction comparison.

FIG. 3. Comparison of smoothing techniques for dwell time

prediction.

2https://analytics.google.com/


DOI: 10.1002/asi

1015

https://analytics.google.com

same time on the pages on different days of week. Besides,

Figure 6 presents a more obvious pattern that the page-level

dwell time does not vary by the change of the hours.

The dwell time distributions over all page depths on rep-

resentative weekdays/hours are plotted in Figure 7 and 8.

These weekdays/hours are either the time point that have

either the longest or the shortest mean page-level dwell

time. Similar to the page-level dwell time, the depth-level

dwell time is not influenced much by time as well.

Studies (Yuan, Wang, & Zhao, 2013; J. Wang & Yuan,

2015) discover that, in the current pay-by-impression pricing

model, the wining bidding prices significantly vary by the

hours of day. However, our research finds that the page

depth-level dwell time does not vary much. The chance that

ads exposed on screen for long enough time at midnight is

as the same as that in the daytime. Therefore, through this

research, advertisers hopefully can realize that the impres-

sions at midnight do not have much lower viewability.

Hence, they do not necessarily compete with each other in

the daytime and consequently pay higher prices for the mar-

keting chances that they can also get during non-peak time.

Channels

In each of the six primary channels on Forbes website,

2000 pageviews are randomly sampled. For each pageview,

the dwell time of the user spent on every page depth is cal-

culated. Thus, each pageview has a vector of length 100.

Each value in the vector is the time that the user spent on the

corresponding page depth. For each channel, the centroids

of the 2000 vectors are calculated by averaging. The six

centroids can be considered as summaries of the dwell time

patterns of corresponding channels. The centroids are plotted

in Figure 9.

All six plots indicate that users usually spend more time

on the first half of the page than the second half. Also, the

top several percent of the page are usually skipped because

this area is always the menu bar. However, the patterns of

individual channels are not identical. Users tend to spend

less time on the lifestyle channel, which usually publishes

web articles about travel, sports, and autos. Intuitively, users

may not read every single sentence in these pages. On the

other hand, users spend long time on the opinion channel,

which publishes updated analysis on popular news. It is rea-

sonable in that these opinion articles are original and can

attract users to read about the authors’ points. Likewise, as

the most well-known product, the lists channel, which usu-

ally publishes the rankings, receive high engagement on the

first half, whereas users quickly lose attention at the second

half. The possible reason is that most users only focus on the

top positions when reading a list. In addition, although busi-

ness and Asia share very similar patterns across page depths,

Asia receives slightly longer dwell time. Publishing articles

about the economy and billionaires of Asia, the Asia channel

has significantly more Asian visitors. Because of the lan-

guage barrier and relatively slow network connection, Asian

visitors usually spend relatively more time on pages.

Viewports

We also investigate user reading behaviors on different

viewports. Because it is possible that users may adjust their

browser into many different sizes, we group viewport sizes

by every 100 pixels. For example, “320x520” is represented

as “3x5.”

Only popular viewport sizes are considered in this experi-

ment. According to an online public resource3, viewport

sizes are grouped into four categories which represent four

main display devices: (a) Mobile: “3x6,” “3x3,” and “3x5.”

(b) Tablet: “7x9,” “7x10,” “10x7,” and “10x9.” (c) Laptop:

“13x6,” “13x7,” “12x6,” and “12x7.” {d) Big screens:

“25x12,” “25x13.” In each category, 2000 pageviews are

randomly sampled. The result is shown in Figure 10.

People generally spend less time on mobile devices.

Existing research shows that people usually use mobile devi-

ces for casual reading (Cui & Roto, 2008). In this case, users

may not stay long on pages. Also, the dwell time distribution

of mobile devices seemingly has two peaks: one is near

30%, the other is near about 60%. The reason may be that

flicking fingers on the screen is as easy as scrolling the

wheel of a mouse (Kyle, 2013). In contrast, the dwell time

distribution of tablet devices is smoother because a tablet

have a bigger screen than the a mobile device. Thus, when a

user is reading the first/last paragraph, the middle part is

also in-view. In this case, the dwell time in the middle of a

page is not significantly lower than that in the two tails.

According to Figure 10, dwell time is increasing with the

increase of the viewport size. The main reason is that bigger

viewports can display more content. So users would stay

longer to read on the depths displayed in the viewports with-

out much scrolling.

Feature Interactions

FM and FFM build raw-rank matrices that consist of

latent vectors of each feature. Using these latent vectors,

they compute pairwise interactions using dot product. The

assumption is that the relationship between independent var-

iables and the target variable is not linear. The value of the

FIG. 4. Comparison of smoothing techniques for viewability prediction

(1s).

3http://viewportsizes.com/


DOI: 10.1002/asi

http://viewportsizes.com

FIG. 5. Day of week vs. traffic and mean page-level dwell time (New York State; 05Sunday).

FIG. 6. Hour of day vs. traffic and mean page-level dwell time (New York State).

FIG. 7. The comparison of mean depth-level dwell time on Wednesday and Saturday (in seconds).

FIG. 8. The comparison of mean depth-level dwell time on different hours of day (in seconds).


DOI: 10.1002/asi

1017

target variable is determined by the interaction of the inde-

pendent variables. We have accessed the trained raw-rank

matrix for FM to investigate the interaction of users and

channels. This study is meaningful because: (a) our experi-

ments show that channel is a significant feature in both FM

and FFM models; (b) the interaction of users and channels

can help publishers understand user interest and thus recom-

mend web articles.

We first use the FM model with the best feature set and

k 5 20 to predict depth-level viewability. We then store the

final latent vectors of all users and channels. For each user,

we calculate the dot products with all channel latent vectors.

There is a positive correlation between the dot products and

the engagement of the user on a page from a channel: A

large dot product may lead to high viewability and dwell

time. Therefore, each user is represented by a vector consist-

ing of dot products. This matrix can be used to cluster simi-

lar users based on the dwell time behavior and then make

recommendations. We observe from the resultant matrix of

dot products that several channels tend to always have high

engagement, that is, Asia and opinions, or low engagement,

for example, lifestyle. We remove the columns of these

FIG. 9. The comparison of mean depth-level dwell time across channels (in seconds).


DOI: 10.1002/asi

channels from the matrix to eliminate their overall bias. We

also remove the dummy column which represent the

“unknown” channel.

Figure 11 presents a sample heatmap based on the matrix,

which contains randomly-selected users. In the figure, a

darker color means a larger dot product value. It shows that

users have different tastes in channels: Controlling the user

variable, it is hard to say which channel leads to a large dot

product. For example, The 9th user has medium interest on

the other channels, but very high interest in technology. The

situation is opposite for the 11th user who seems to have no

interest in technology. Hence, viewability does not change

linearly with the channels. The same conclusion also holds

for users. Therefore, the interaction of users and channels

should be captured to predict viewability. This explains why

the FM-based models, which consider feature interactions,

outperform the baselines in our experiments.

As Section 3.4 explains, the FM models represent each

user as a latent vector. The similarity between two user

latent vectors describes the similarity between their interests.

Figure 12 is the visualization of two randomly chosen users

who have smaller cosine distance for their two user latent

vectors. Similar with Figure 11, a darker color means a

larger dot product value of user, depth, and channel. Figure

12 shows that both users are interested in business and inves-

ting channels, but are not interested in leadership, lists, and

technology channels. There are a few shared-interest depths

on the articles in the business channel, such as depths 3, 20,

50. The possible reason is that it is easy to get the main

points in business articles and thus users view them fast.

Publishers can place ads on the segments of high values in

the business channel in order to increase the ad viewability.

FIG. 10. The comparison of mean depth-level dwell time across viewport categories (in seconds).

FIG. 11. Sample heatmap of the FM dot product matrix for latent vec-

tors of users and channels.


DOI: 10.1002/asi

1019

But because there are many segments of high value in the

investing channel, publishers can choose to balance user expe-

rience and the ad viewability. Therefore, publishers can use

this type of visualization in the ad placement decision-making.

Time Series Model for Page Depth-level DwellTime Prediction

So far, our FM and FFM models have been designed to

predict depth-level dwell time before a page is loaded.

Another way to further improve the prediction is to leverage

user browsing behavior while reading the page. For exam-

ple, if we know the user behavior in the first half of the

page, we could attempt to predict the depth-level dwell time

in the second half of the page.

Making predictions during page reading is significant and

feasible. To guarantee ad viewability, publishers may select

to hold off the selling of an ad until a user scrolls to the posi-

tion of the ad. A new ad viewability prediction algorithm

can be used in real-time to predict how long the user will

stay at the position. Based on discussions with a large pub-

lisher, we know that such an algorithm is fesible if a predic-

tion can be made in 100ms.

In order to capture the user reading behavior in a page,

we define a browsing action as a triplet: (top, bottom, dwell),where top and bottom are the positions of the first and the

last line of the viewport, respectively, and dwell is the dwell

time that the user spends at this position. In our data set,

once a user stays at a position for one second, an action is

recorded in the user log. A page view must contain at least

one action. Figure 13 is an example of a page view, which

contains three user browsing actions. The actions occurs

sequentially based on user scrolling. In this action-level

application, the increment is not by 1% of the page depth,

but driven by user scrolling. Therefore, for example, we can

predict the dwell time of the third action based on previous

one or two actions which have been observed.

Formally, adding time series information, we define a

new problem setting based on browsing actions:

FIG. 12. Sample heatmap of the FM dot product matrix for latent vectors of depths and channels for two users with similar interests.

FIG. 13. An example of a page view which contains three browsing

actions.

TABLE 6. Results of time series model.

Window size Metrics FM

h50 RMSD 10.4663

Logloss (3s) 0.4740

Logloss (7s) 0.6560

h51 RMSD 10.1801

Logloss (3s) 0.4713

Logloss (7s) 0.6520

h52 RMSD 10.0935

Logloss (3s) 0.4649

Logloss (7s) 0.6425


DOI: 10.1002/asi

Problem Definition 2. Given the previous h browsing

actions A5ðtopi2h, bottomi2h, dwelli2hÞ; . . . ; ðtopi21; bottomi21;

dwelli21Þ that a user u just conducted on a webpage a and the

ith position (i.e., topi and bottomi), the goal is to predict the

dwell time dwelli that u will stay at the ith position. In practice,

the ith position can be the current position where the user just

scrolled to.

The FM model which performs the best in Experimental

Evaluation can be extended to be applied in this action-level

prediction. In particular, the extended model should take

into account the previous user actions, the current position

of the user, and the change of the adjacent actions. There-

fore, we add additional features to the best FM model. In

addition to the best feature combination, we also add addi-

tional features for the time series setting: the current position

of the viewport top and bottom (i.e., topi and bottomi), delta

distance (i.e., topi2topi21), the dwell time of the previous

action (i.e., dwelli21). These features can be easily measured

or calculated at the time that the prediction is made. Note

that the first action of a page view can be predicted by the

models proposed in Factorization Machines, as it does not

have any previous action.

In the experiments, we consider h is 1 and 2. The data set

we collect contains only the page views which have more

than 2 actions. The data set is then transformed into an

appropriate input format, in which each row is an action in a

page view. The data sets for h51 and h52 have the same

set of actions. The only difference is that each row of the

data set for h52 has additional features about the i – 2th

action. The data sets contain about 400K actions from about

70K page views. They are then partitioned into training, val-

idation, and test sets by 8:1:1. Note that, because one test

action only has one output, the proposed smoothing tech-

nique is not applicable here.

Table 6 presents the experimental results. The FM model

is capable of handling time series information. FM with h50

does not use any time series information. It just predicts the

dwell time or viewability of an action based on the informa-

tion at this action. FMs with h51 and 2 use time series infor-

mation from past actions in the same page views. The results

show that FM with h51 and 2 have better performance than

FM without time series information by both RMSD and

Logloss.This reflects that adding more information about user

engagement within the page view leads to better performance.

In addition, all results of h52 are better than the correspond-

ing results of h51. This indicates that adding more previous

actions may further improve the performance. However, in

practice, using a large h will limit the usage of the model

because the model requires at least h previous actions to

make prediction. In this case, separate models need to be built

for actions which have less than h actions.”

Conclusions

The emerging ad pricing model prices ads by the number

of impressions viewed by users, instead of the number of ad

served to webpages. The publishers and advertisers that use

this model demand prediction of the dwell time of each

depth where an ad is placed. This prediction can help to

maximize the publishers’ profit and improve the advertisers’

return on investment. However, how to solve this demand

was an open problem until now.

This paper presents the first study of depth-level dwell

time prediction. We propose predictive models based on FM

and FFM, and add a smoothing technique to further improve

the performance. In addition, we use the prior dwell time

behavior of the user within the current page view, that is,

time series information, and apply it into the FM models.

Using real-world data for our experiments, we show that our

models outperform the comparison models. In particular, the

FM model with viewport, channel, and Doc2Vec vector

obtains the best performance. Also, our experiments demon-

strate that adding time series information can further

improve the performance.

Finally, we extracted the latent feature vectors and pro-

vided an analysis of some feature combinations. The insights

gained from this analysis can be applied to help a publisher

understand user behavior patterns and enhance its business

strategies.

Acknowledgement

This work is partially supported by NSF under grants No.

CAREER IIS-1322406, CNS 1409523, and DGE 1565478,

by a Google Research Award, and by an endowment from

the Leir Charitable Foundations. Any opinions, findings,

and conclusions expressed in this material are those of the

authors and do not necessarily reflect the views of the fund-

ing agencies

References

Cui, Y., & Roto, V. (2008). How people use the web on mobile devices.

In WWW’08: Proceedings of the 17th international conference on

World Wide Web (pp. 905–914).

Google. (2014). The importance of being seen. https://think.storage.goo-

gleapis.com/docs/the-importance-of-being-seen_study.pdf.

Holf, R. (2014). Digital Ad Fraud Is Improving - But Many Ads Still

Aren’t Seen By Real People. https://www.forbes.com/sites/roberthof/

2014/01/29/digital-ad-fraud-is-improving-but-many-ads-still-arent-seen-

by-real-people/.

Hong, L., Doumith, A. S., & Davison, B. D. (2013). Co-factorization

machines: modeling user interests and predicting individual decisions

in twitter. In WSDM’13: Proceedings of the 6th ACM International

Conference on Web Search and Data Mining (pp. 557–566).

Juan, Y., Zhuang, Y., Chin, W.-S., & Lin, C.-J. (2016). Field-aware fac-

torization machines for ctr prediction. In RecSys’16: Proceedings of

the 10th ACM Conference on Recommender Systems (pp. 43–50).

Kim, Y., Hassan, A., White, R. W., & Zitouni, I. (2014). Modeling

dwell time to predict click-level satisfaction. In WSDM ’14: Proceed-

ings of the 7th ACM International Conference on Web Search and

Data Mining (pp. 193–202).

Kyle, S. (2013). Experimenting in Loyalty Conversion with WNYC:

Achieving Mobile-Desktop Parity. http://blog.chartbeat.com/2013/10/

07/experimenting-loyalty-conversion-wnyc-achieving-mobile-desktop-

parity/.

Lagun, D., & Lalmas, M. (2016). Understanding user attention and

engagement in online news reading. In WSDM ’16: Proceedings of


DOI: 10.1002/asi

1021

https://think.storage.googleapis.com/docs/the-importance-of-being-seen_study.pdf

https://think.storage.googleapis.com/docs/the-importance-of-being-seen_study.pdf

https://www.forbes.com/sites/roberthof/2014/01/29/digital-ad-fraud-is-improving-but-many-ads-still-arent-seen-by-real-people/



http://blog.chartbeat.com/2013/10/07/experimenting-loyalty-conversion-wnyc-achieving-mobile-desktop-parity/



the 9th ACM International Conference on Web Search and Data Min-

ing (pp. 113–122).

Le, Q. V., & Mikolov, T. (2014). Distributed representations of senten-

ces and documents. In Proceedings of the 31st International Confer-

ence on Machine Learning (pp. 1188–1196).

Liu, C., White, R. W., & Dumais, S. (2010). Understanding web brows-

ing behaviors through weibull analysis of dwell time. In SIGIR ’10:

Proceeding of the 33rd international ACM SIGIR conference on

Research and development in information retrieval (pp. 379–386).

Qiang, R., Liang, F., & Yang, J. (2013). Exploiting ranking factorization

machines for microblog retrieval. In CIKM’13: Proceedings of the

22nd ACM International Conference on Information and Knowledge

Management (pp. 1783–1788).

Rendle, S. (2012). Factorization machines with libfm. ACM Transac-

tions on Intelligent Systems and Technology (TIST)3357:1–57:22.

Schmidhuber, J. (2015). Deep learning in neural networks: an overview.

Neural networks, 61, 85–117.

Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for rec-

ommending scientific articles. In KDD’11: Proceedings of the 17th

ACM SIGKDD international conference on Knowledge discovery and

data mining (pp. 448–456).

Wang, C., Kalra, A., Borcea, C., & Chen, Y. (2015). Viewability predic-

tion for online display ads. In CIKM’15: Proceedings of the 26th

ACM International Conference on Information and Knowledge Man-

agement (pp. 413–422).

Wang, C., Kalra, A., Borcea, C., & Chen, Y. (2016). Webpage depth-

level dwell time prediction. In CIKM’16: Proceedings of the 25th

ACM International Conference on Information and Knowledge Man-

agement (pp. 1937–1940).

Wang, C., Kalra, A., Zhou, L., Borcea, C., & Chen, Y. (2017). Probabilis-

tic models for ad viewability prediction on the web. IEEE Transactions

on Knowledge and Data Engineering (TKDE), 29(9), 2012–2025.

Wang, J., & Yuan, S. (2015). Real-time bidding: a new frontier of computa-

tional advertising research. In WSDM’15: Proceedings of the 8th ACM

International Conference on Web Search and Data Mining (pp. 415–416).

Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Inter-

preting tf-idf term weights as making relevance decisions. ACM

Transactions on Information Systems (TOIS)26313:1–13:37.

Yi, X., Hong, L., Zhong, E., Liu, N. N., & Rajan, S. (2014). Beyond

clicks: dwell time for personalization. In RecSys’14: Proceedings of

the 8th ACM Conference on Recommender Systems (pp. 113–120).

Yin, P., Luo, P., Lee, W.-C., & Wang, M. (2013). Silence is also evi-

dence: interpreting dwell time for recommendation from psychologi-

cal perspective. In KDD’13: Proceedings of the 19th ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining

(pp. 989–997).

Yuan, S., Wang, J., & Zhao, X. (2013). Real-time bidding for online

advertising: measurement and analysis. In Proceedings of the Seventh

International Workshop on Data Mining for Online Advertising (pp.

3:1–3:8).


DOI: 10.1002/asi