Predictive Models and Analysis For WebpageDepth-level Dwell Time
Chong WangInformation Systems, New Jersey Institute of Technology, Newark, NJ 07003, USA
Shuai ZhaoMartin Tuchman School of Management, New Jersey Institute of Technology, Newark, NJ 07003, USA
Achir KalraForbes Media, 499 Washington Blvd, Jersey City, NJ 07310, USA
Cristian BorceaComputer Science, New Jersey Institute of Technology, Newark, NJ 07003, USA
Yi ChenMartin Tuchman School of Management, New Jersey Institute of Technology, Newark, NJ 07003, USA
A half of online display ads are not rendered viewablebecause the users do not scroll deep enough or spend suf-ficient time at the page depth where the ads are placed. Inorder to increase the marketing efficiency and adeffectiveness, there is a strong demand for viewability pre-diction from both advertisers and publishers. This paperaims to predict the dwell time for a given huser ;page;depthi triplet based on historic data collected by publishers.This problem is difficult because of user behavior variabil-ity and data sparsity. To solve it, we propose predictivemodels based on Factorization Machines and Field-awareFactorization Machines in order to overcome the data spar-sity issue and provide flexibility to add auxiliary informa-tion such as the visible area of a user’s browser. Inaddition, we leverage the prior dwell time behavior of theuser within the current page view, that is, time series infor-mation, to further improve the proposed models. Experi-mental results using data from a large web publisherdemonstrate that the proposed models outperform com-parison models. Also, the results show that adding timeseries information further improves the performance.
Introduction
Online display advertising provides many benefits that
traditional marketing channels do not, such as fast brand
building and effective targeting. In online display advertis-
ing, an advertiser pays a publisher for space on webpages to
display ads while a user is viewing the webpage. There are
two main existing advertising pricing models, pay-by-action
and pay-by-impression. In pay-by-action, advertisers are
charged only when the ads are clicked (i.e., converted).
Such actions may directly bring profits to advertisers. But
the rates of conversion are very low, in which case adver-
tisers often receive little feedback from users. Also, some
advertisers, for example, car vendors, do not expect users to
click and make purchases through their ads. They just want
to increase brand awareness and make more users aware of
their logos or products. In pay-by-impression, an impression
is counted once the ad is sent to a user’s browser, that is,
served. Thus, user actions are not required in this model.
However, recent studies (Google, 2014; Holf, 2014)
show that about a half of the served ads are not viewed by
users. There are two reasons for this problem: users do not
scroll to the depths where the ads are displayed on screen,
and users do not spend sufficient time at the page depth
where the ads are placed. In such cases, advertisers still have
to pay for the served ads, and they lose money without any
return on investment. To alleviate this issue, a new model is
emerging: pricing ads by the number of impressions viewed
by users for a certian time, instead of just being served.
Dwell time is used to measure how long an ad is shown in a
user’s screen. The Interactive Advertising Bureau (IAB
2013) defines a viewable impression as one that is at least
Received September 6, 2016; revised October 7, 2017; accepted
February 4, 2018
VC 2018 ASIS&T � Published online May 20, 2018 in Wiley Online
Library (wileyonlinelibrary.com). DOI: 10.1002/asi.24025
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 69(8):1007–1022, 2018
50% shown on the user’s screen for at least one continuous
second. This new model is attractive for advertisers because
they can specify dwell time requirements for their ad cam-
paigns to prevent investment waste and enhance advertising
effectiveness.
Publishers, however, demand predictions of ad dwell
time, in particular, the dwell time of each depth where ads
are placed. Depth-level dwell time prediction for a (user,
webpage) pair helps publishers to fulfill the contracts with
advertisers and satisfy advertisers’ dwell time requirements.
In addition, it can help publishers to maximize their revenue
by placing the most profitable ads to the most viewable pla-
ces. Further, publishers can dynamically generate different
webpage layouts to different users based on viewability pre-
diction so that both user experience and overall ad viewabil-
ity are optimized. Therefore, ad viewability prediction is
essential to fulfill the marketing requirements and thus maxi-
mize the return on investment for advertisers. Moreover, it
can also boost the advertising revenue for publishers.
Despite its importance, page depth-level dwell time pre-
dictions is still an open problem. Existing work (Liu, White,
& Dumais, 2010; Kim, Hassan, White, & Zitouni, 2014; Yi,
Hong, Zhong, Liu, & Rajan, 2014) focus on predicting the
time a user will spend on the whole page, rather than a par-
ticular page depth, as discussed in Related Work. However,
working at a finer granularity, depth-level dwell time predic-
tion is more challenging. The problem is non-trivial because
of the variability of user behavior and data sparsity, that is,
most users read only a few webpages, while a webpage is
visited by a small subset of users. It is also difficult to
explicitly model user interests as well as the characteristics
of entire pages and depths.
In this study, we investigate how to predict webpage
depth-level dwell time. We develop predictive models based
on Factorization Machines (FM) and Field-aware Factoriza-
tion Machines (FFM) because they are able to capture the
interaction between input features, overcome the data spar-
sity issue, and provide flexibility to add auxiliary informa-
tion. The proposed predictive models can be applied to
predict the dwell time of any items on a page. Our models
consider the basic factors (i.e., user, page, and page depth)
and other auxiliary information. We also propose a smooth-
ing technique to further improve the performance. In addi-
tion, we leverage the prior dwell time behavior of the user
within the current page view, that is, time series information,
and integrate them into the FM-based models. We evaluated
our models using real-data from a large web publisher. The
experimental results demonstrate that our models outper-
form comparison models. The FM model with viewport,
channel, and Doc2Vec vector obtains the best performance.
In addition, the performance is further improved by adding
the time series information of prior dwell time behavior. We
also present the analysis of feature combinations by extract-
ing the latent feature vectors from the proposed models,
which provides insights on publishers’ business strategies
for ad placement.
Our contributions are summarized as follows: (a) To the
best of our knowledge, this is the first work that studies page
depth-level dwell time prediction. We define a new problem
in viewability prediction in order to build new ad pricing
standards. (b) We are the first to propose predictive models
based on FM and FFM to solve this new problem, and we
show that our proposed models outperform other compari-
son methods. (c) We demonstrate that adding time series
information can further improve the model performance. (d)
We also present the analysis of feature combinations and
provide insights on publishers’ ad business strategies.
The rest of the paper is organized as follows. Related
Work discusses the related work. Page depth Dwell Time
Prediction describes the proposed models for webpage
depth-level dwell time prediction. Experimental results and
feature analysis are presented in Experimental Evaluation
and Feature Analysis. An FM model with time-series infor-
mation is presented in Time Series Model for Page Depth-
level Dwell Time Prediction. The paper concludes in
Conclusion.
Related Work
The problem of dwell time prediction is challenging
because of the variability of user behaviors and the data
sparsity. Liu, White, and Dumais (2010) investigate the fea-
sibility of predicting from features the Weibull distribution
of page-level dwell time. They use Multiple Additive
Regression Trees (MART). The features include the fre-
quencies of HTML tags, webpage keywords, page size, the
number of secondary URLs, and so on. They find that page-
level dwell time is highly related to webpage length and
topics. Yi, Hong, Zhong, Liu, and Rajan (2014) view the
average dwell time of a webpage as one of the item’s inher-
ent characteristic, which provides important average user
engagement information on how much time the user will
spend on this item. The authors present a machine learning
method to predict dwell time of article stories using simple
features. The features they consider are content length, topi-
cal category of the article (e.g., finance), and the context in
which the article would be shown (e.g., devices). The
authors use Support Vector Regression models to predict
page-level dwell time. Kim, Hassan, White, and Zitouni
(2014) present regression method to estimate the parameters
of the Gamma distributions of click dwell time (i.e., the time
that the user spends on a clicked result). The features they
adopt are similar to those used in Liu et al. In contrast, this
proposed research predicts dwell time at a specific depth in
a page, which is still an open question. Working at a finer
granularity, depth-level dwell time prediction is more chal-
lenging than page-level dwell time prediction. Yin, Luo,
Lee, and Wang (2013) find that the dwell time satisfies a
log-Gaussian distribution. They claim viewing item is such
a casual behavior that people may terminate the viewing
process at any time. The dwell time varies a lot because of
the factors from both items and persons: (a) Items may differ
not only in their form and volume; (b) There are many
1008 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
subjective human factors to affect dwell time. For example,
different people receive information at different speed. The
time of consuming the same item differs from person to per-
son. The authors develop a model which estimates how
much a user likes the viewed item according to the item-
level dwell time.
From a methodological point of view, there is similarity
between page-level and depth-level dwell time prediction.
However, compared with page-level dwell time prediction,
depth-level dwell time prediction is more challenging
because of user behavior variability. Existing work on page-
level dwell time prediction does not consider individual
users. Furthermore, they do not achieve good performance
when they are modified to include user information (Liu,
White, & Dumais, 2010; Yi, Hong, Zhong, Liu, & Rajan,
2014; Kim, Hassan, White, & Zitouni, 2014). However,
users’ heterogeneity should be considered because their
behaviors vary a lot. Different users have different reading
habits and interests, which largely affect their reading
behavior. However, zooming into individual-level behavior
inevitably leads to the problem of data sparsity (i.e., the
interaction between users and pages is highly sparse). Also,
user detailed profile information is not accessible to publish-
ers, in which case we have to rely just of user IDs. There-
fore, in this paper, we propose to adopt the FM and FFM
models, which consider individual users and pages, and
overcomes data sparsity. Our experiments demonstrate that
FM and FFM outperform an existing method (Yi, Hong,
Zhong, Liu, & Rajan, 2014), which was modified by adding
individual user information and depth information.
The only existing work that narrows down to dwell time
of a part of a page is (Lagun & Lalmas, 2016), in which
Lagun et al. define and measure viewport time in order to
infer user interest. However, our work focuses on the predic-tion of the dwell time at each individual page depth. We
preliminary studied depth-level dwell time predictions in
our previous work (C. Wang, Kalra, Borcea, & Chen, 2016).
This paper, however, provides a substantial extension, with
new prediction models and additional features.
Another area of study, which is less related to our work,
is the prediction of the likelihood of a user to scroll to a
given page depth in a page. Wang et al. (C. Wang, Kalra,
Borcea, & Chen, 2015; C. Wang, Kalra, Zhou, Borcea, &
Chen, 2017) propose probabilistic latent class models that
predict the probability that a user scrolls to a given page
depth where an ad may be placed. In contrast, this paper
presents models to predict how long a user may stay at a
given page depth. Compared to prediction of scroll depths,
webpage depth-level dwell time prediction can better satisfy
publishers’ need for detailed estimations of ad viewability.
Page Depth Dwell Time Prediction
We define the problem of depth-level dwell time predic-
tion as below.
Problem Definition 1. Given a page view, that is, a user u
and a webpage a, the goal is to predict the dwell time of a
given page depth X, that is, the time duration that X is shown
on the screen. The dwell time of X is denoted as TuaX.
Data Set
A large web publisher (i.e., Forbes Media) provides user
browsing logs collected from real website visits in 1 week of
Dec 2015 and webpage metadata. The data set contains 2
million page views. For each page view, it records the user
id, page url, state-level user geo location, user agent, and
browsing events, for example the user opened/left/read the
page. Each event stores the event time stamp and the page
depths where the top and bottom of the user screen are.
Once a user scrolls to a page depth and stays for one second,
an event is recorded. The page depth is represented as the
percentage of the page. The reason that we adopted page
percentage rather than pixels is because it provides a relative
measure independent of device screen size. If a user reads
50% of a page on a mobile device, whereas another user
reads 50% of the same page on a desktop, it can be assumed
that they read the same content.
Table 1 is a simplified example of the user log. Each
event has a time stamp so that the time that a user spent on a
part of page can be calculated. To infer the current part of a
page that a user is looking at, the user log also records the
page depths at which the first and the last rows of pixels of
the screen are. Thus, we are able to infer when a user
scrolled to which part of a page and how long the user
stayed.
In Table 1, the user scrolled to 30–60% of the page after
reading 20–50% of the page for 1 minute. Thus, the dwell
time of the page depths that have been scrolled past can be
determined. For example, the dwell time of 20% - 30% is 1
minute at this moment.
Factorization Machines (FM)
It is intuitive that the dwell time of a page depth is highly
related to the user’s interests and reading habits, the topic of
the article in the page, the design at that page depth, etc.
More importantly, the interactions of these three factors
must be modeled so that their joint effect is captured: (a)
The interaction of users and pages captures a user’s interest
in a page. (b) The interaction of users and page depths can
reflect individual users’ browsing habits. For example, some
users read entire pages carefully, but some only read the
upper half. (c) The interaction of pages and depths models
TABLE 1. A simplified example of the user log.
User URL Time . . . Event User Behavior
001 /abc 2/1/2015
10:00:00
. . . Read
Page
{“first row”:20,
“last row:50,. . .”}
001 /abc 2/1/2015
10:01:00
. . . Read
Page
{“first row”:30,
“last row:60,. . .”}
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1009
the design of individual pages at individual page depths. For
example, pages that have a picture at a depth may receive
relatively short dwell time at that depth because people usu-
ally can understand a picture more quickly than text. How-
ever, it is non-trivial to explicitly model user interests, page
characteristics, the attractiveness of page depths, and their
interactions. Also, although implicit feedback, for example,
reading dwell time, is more abundant than explicit feedback,
for example, ratings, it often has higher variability (Yin
et al., 2013), which makes prediction more challenging.
Therefore, we adopt Factorization machines (FM) (Rendle,
2012), which are a generic approach that combines the high-
prediction accuracy of factorization models with the flexibility
of feature engineering. The FM model has been used in appli-
cations such as context-aware rating prediction (Rendle, 2012),
retweeting (Hong, Doumith, & Davison, 2013), and microblog
ranking (Qiang, Liang, & Yang, 2013). The reason that we
adopt the FM model is that it can capture the interaction of
multiple inter-related factors, overcome the data sparsity, and
provide the flexibility to add auxiliary information.
According to the problem definition, the basic FM model
requires three factors: user, page, and page depth. The input
is derived from the user-page-depth matrix built from the
user logs: In the basic form of depth-level dwell time predic-
tion, we have a three-dimensional cube containing nu users,
na pages, and nd page depths. Thus, each dwell time is asso-
ciated with a unique triplet huser; page; depthi. Such a 3D
matrix can be converted into a list of (nu 1 na 1 nd) rows.
The target variable for each row corresponds to an observed
dwell time represented by the triplet. N training page views
lead to N � 100 rows, as each page view contains 100
observed dwell time values (one for each percent from 1 to
100% page depth). This input is similar to what is prepared
for regressions. However, regressions would not work well
because the data are very sparse and they are unable to cap-
ture the interaction between the input variables.
The basic idea of FM is to model each target variable as
a linear combination of interactions between input variables.
Formally, it is defined as following.
yFMðxÞ5w01Xn
i51
wixi1Xn21
i51
Xn
j5i11
hvi; vjixixj (1)
where, yðxÞ is the prediction outcome given an input x. w0 is
a global bias, that is the overall average depth-level dwell
time.Xn
i51wixi is the bias of individual input variables.
For example, some users would like to read more carefully
than others; some pages can attract users to spend more time
on them; some page depths, for example, very bottom of a
page, usually receive little dwell time. The first two terms
are the same as in linear regression. The third term captures
the sparse interaction between each pair of input variables.
The FM model uses a factorized parametrization to cap-
ture the feature interaction effects (Equation [2]). That infor-
mation is difficult to learn for linear regressions because
standard regression models learn the weight of each
interaction using only a real number wij. The latent feature
vectors, on the other hand, allow the FM models to estimate
reliable parameters even in sparse data. This is because the
latent features are learned from all feature pairs instead of
only one pair, which provides a better representation of the
influence to the output variable and the combined effects
with other features.
hvi; vji5XK
k51
vikvjk (2)
Field-aware Factorization Machines (FFM)
Very recently, Juan et al. (Juan, Zhuang, Chin, & Lin,
2016) proposed a variant of FM, field-aware factorization
machines (FFM), which has shown its superiority over exist-
ing models in machine learning competitions.
FM builds one vector for each individual feature. The
latent vector of a feature is used to compute the interaction
with any other feature in the input data. If n is the number of
features and k the dimensionality of latent vectors, then the
number of variables needed to learn is n 3 k.
FFM assumes that features can be grouped into different
fields. For instance, the feature “lifestyle” belongs to the field
“channel,” the feature “320 3 480” belongs to the field
“viewport.” The intuition in FFM is that a feature should use
different representations (e.g. latent vectors) when interacting
with other features which belong to different fields because
they may emphasize different aspects of the feature. For
example, “depth_20%” should use different latent vectors,
vdepth20%;channel and vdepth20%;viewport, to calculate the interac-
tion of (depth_20%, lifestyle) and (depth_20%, 320 3 480)
because “lifestyle” and “320 3 480” belong to two different
fields. FM builds one vector for a feature, each of which is
used to compute the interaction effects with all other features.
In contrast, FFM builds multiple vectors for a feature, each of
which is used to compute the interaction with a feature from
the corresponding field. Thus, FFM has better capability to fit
the input data. It is also a more complicated model than FM
in terms of the number of parameters: If the number of fields
in the input data is f, one feature should have f vectors. In this
case, the number of parameters needed to learn is n3f 3k.
Thus, FFM is formally defined as below:
yFFMðxÞ5Xn
j151
Xn
j25j111
ðwj1;f2 � wj2;f1Þxj1xj2 (3)
where f1 and f2 are respectively the fields of features j1 and
j2. n is the number of features.
In this study, we apply both FM and FFM to the applica-
tion of page depth-level dwell time prediction.
Feature Engineering
The basic FM model works with only three factors: user,
page, and depth. However, context information can also
1010 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
help improve the prediction performance. Thus, we identify
three context features, viewport (i.e., the part of a user
browser visible on the screen), local hour, and local day ofthe week (denoted by weekday in the experiments), which
are likely related to user reading behavior. The viewport
indicates the device used by the user (e.g., a mobile device
usually have a much smaller visible browser area than a
desktop) and can directly determine the user experience.
Specifically, one viewport value consists of the height and
the width of a browser, for example, 1855 3 1107. To
reduce sparsity, both heights and widths are put into buckets
with size 100 pixels. For instance, 1855 3 1107 can be dis-
cretized into 18 3 11.
For user demographics, we consider user geo locations
because this is the only explicit feature about users that can
be easily obtained by publishers without violation of user
privacy. User geo, inferred from IPs, may reflect a user’s
interests and education, and it may determine the user’s net-
work condition. Specifically, geo is the country name if the
user is outside the United States or a state name if she is
within the United States.
For page attributes, we consider article length, channel,
and freshness. Article length is represented by the word
count of the article in the page, and it has been proven to be
a significant factor impacting page-level dwell time (Yi,
Hong, Zhong, Liu, & Rajan, 2014). However, its influence
on page-depth-level dwell time is still unclear. The channel
of the article in a page is its topical category on the publish-
er’s website, for example, finance and lifestyle. Freshness is
the time span between the page is read and the page is firstly
published on the website. Freshness is measured by days.
The freshness of an article may determine the interests of a
user on it. Fresh news may receive more user engagement.
The viewport content is also modeled by several state-of-
the-art models because it is believed that the content shown
in a user’s browser affects the time that the user spends on
it. The user log records the position of each viewport and the
article metadata include the content of each article. So it is
possible to obtain the textual content shown in the user’s
browser.
Several the most popular existing models are used to
model the semantics of each viewport content:TF-IDF,
LDA, and Doc2Vec. TF-IDF (Wu, Luk, Wong, & Kwok,
2008), short for term frequency-inverse document fre-
quency, is a highly commonly-used method to weight words
based on their importance to a textual document in a collec-
tion. The TF-IDF value increases proportionally to the num-
ber of times a word appears in the document, but is offset by
the frequency of the word in the corpus, which helps to
adjust for the fact that some words appear more frequently
in general. LDA (C. Wang & Blei, 2011), short for Latent
Dirichlet Allocation, is an unsupervised process for inferring
the topics in a textual document. It outputs a clearly-defined
probability for arbitrary documents. Because the webpage
articles in our corpus are relatively long (compared to short
text, e.g. tweets), it is suitable to use LDA to model the topic
distribution of a document because of the presence of
abundant word co-occurrence. Thus, all training articles are
fed into the LDA model. The learned model can be used to
infer the topic distribution of each test viewport content. In
the experiments, we compare two different ways to incorpo-
rate LDA outcome into the FM model. The first is to only
consider the latent topic with the highest probability and
concatenate it with other feature using one-hot encoding.
The second strategy is to consider all latent topics. The topic
distribution vector will be concatenated with other features.
In addition, we evaluate different pre-specified numbers of
latent topics in the experiments.
Doc2Vec (Le & Mikolov, 2014) is an unsupervised learn-
ing of continuous representations for variable-length pieces
of texts, such as sentences, paragraphs or entire documents.
Unlike TF-IDF, Doc2Vec takes into account the ordering
and semantics of the words. Given a chunk of text, Doc2Vec
provides a fixed-length feature vector to represent the mean-
ing of the text. The vector can be used as an input in the FM
model. The Doc2Vec vectors of two pieces of text which
have close meaning should be very close to each other.
Given a unseen piece of text, a fully trained Doc2Vec model
can infer a vector to represent its meaning. The Doc2Vec
used in this project is developed based on Gensim1. All
training articles are fed into the Doc2Vec model. The
learned model can be use to infer the feature vector of each
test viewport content. We evaluate different dimensionalities
of the feature vector in the experiments.
Smoothing Technique
In our preliminary experiments, we observe that the pre-
diction of depth-level dwell time for each page view often
exhibits a see-saw behavior. In contrast, the ground truth of
depth-level dwell time of a page view is rather stable and
continual. This is because the adjacent depths tend to be
shown on a screen at the same time. Thus, the performance
can be further improved if the prediction outcome is
smoothed.
Our solution divides the 100 page depths into continuous
intervals of size d; all depths within an interval are adjacent
to each other and will be assigned the same smoothed dwell
time. The smoothed dwell time can be calculated by a func-
tion f, which takes as input the d predicted dwell times in the
interval. The results of any f with d 5 1 are the same as the
original predictions. Candidate functions include mean,
median, min, max, quartile, etc. The optimal f and d can be
determined based on the data set. Mathematically, smooth-
ing is defined as below.
y0i5f ðSÞ; S5 yijyi 2 y; i 2 bi=dc; di=deð �f g (4)
where y is all 100 predictions of a page view, i is a page
depth, d is the interval size, f is the pooling function, yi is
the original prediction, and y0i is the smoothed prediction.
The interval that includes i is defined by: bi=dc; di=deð �.
1Footnoteshttps://radimrehurek.com/gensim/models/doc2vec.html
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1011
This method is inspired by the pooling layer (Schmid-
huber, 2015) used in convolutional neural networks. The
main difference is that pooling layer reduces the dimension
of the input, whereas the smoothing does not. Smoothing
can also be used in other models because machine learning
models may tend to always predict higher or lower values
than the ground truth. Smoothing can learn such characteris-
tic from a validation set and align the prediction on the test
set.
Experimental Evaluation
Settings
A 1-week user log is collected as described in Data Set.
To avoid the cold-start problem, we iteratively remove the
page views whose users and pages occur less than 10 times
in the data set. In this case, we guarantee that all users and
pages occur adequate times in the training data. All users
and pages in the test data occur in the training data. The final
data set is randomly shuffled into three sets of training
(160K1 depths), validation (10K1 depths) and test (15K1
depths) data. The validation data are used to determine the
optimal smoothing technique and the iterations for early
stopping. The experimental results are reported by taking
the average over the sets.
Comparison Models
Several comparison systems are developed as following:
GlobalAverage. In dwell time prediction, that is, Page
Depth-level Dwell Time Prediction, it computes the average
dwell time of each page depth X in all training page views.
If a user did not scroll to X before leaving the page, its dwell
time in the page view is zero. In viewability prediction, that
is, Viewability Prediction, it computes the fraction of train-
ing page depths whose dwell times are no less than the
required dwell time. In both tests, 100 constant numbers are
obtained after iterating over all training pageviews. They are
used to make a deterministic prediction for the correspond-
ing page depth.
UserAverage. It is like GlobalAverage. But it computes
the average dwell time of each depth X based on each user’s
reading history (rather than all training page views). In view-
ability prediction, for a depth of a training pageview,
whether or not it is viewed for at least a certain second is
recorded, that is, 0 or 1. The probabilistic prediction is made
based on the average over all binary outcomes of a page
depth of a user.
PageAverage. Like UserAverage, it computes the average
dwell time of each depth X based on each page’s history.
Regression. We select the regression model in (Yi, Hong,
Zhong, Liu, & Rajan, 2014) as a baseline because it is repre-
sentative for page-level dwell time prediction models. We
modify it by adding individual user information and depth
information in order to apply it in our application. We also
use another regression model which uses the feature combi-
nation that works the best in the FM model. In particular,
Two regression models are built. Regress_bc is developed
based on (Yi, Hong, Zhong, Liu, & Rajan, 2014). To apply
to depth-level prediction, one more feature, that is, page
depth, is added. For viewability prediction, logistic regres-
sion with the same features is adopted. 2) The second,
Regress_feat, is developed based on the finding in Section
4.4 that shows user, page, depth, viewport, doc2vec_150,
and channel are the best features for FM.
Metrics
RMSD. It measures the differences between the values pre-
dicted by a model, yi, and the values actually observed, yi.
For depth-level dwell time prediction, it is defined as the
square root of the mean square error:
RMSD5
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN
j51
X100
i51ðyij2yijÞ2
N � 100
vuut
where N is the number of test page views. The second sigma
accumulates the errors at all 100 page depths in the ith page
view. yij is the actual dwell time, at the jth page depth in the
ith page view. yij is the corresponding predicted dwell time.
Logistic loss. It penalizes a method more for being both
confident and wrong. For example, for a particular observa-
tion, a classification model assigns a very small probability
to the correct class then the corresponding contribution to
the logloss will be very huge. In our case, the probability is
interpreted as how likely it is that the dwell time of a page
depth is at least a certain amount of time.
logloss521
N � 100XN
i51
X100j51
yijlog ðyijÞ1ð12yijÞlog ð12yijÞh i
Comparison of Feature Combinations
The basic FM and FFM models contain the user, the
page, and the depth. We then add context and auxiliary fea-
tures, including user features, page features, and depth fea-
tures, to the basic models in order to evaluate the effect of
different combinations. The models are applied to predict
the dwell time of every page depth in each test page view.
To find the best feature combination, we first add one fea-
ture to the basic models and then keep adding more features
to the best one once at a time. The result of adding one addi-
tional feature is presented in Table 2. The performance is
measured by RMSD. We also vary the dimension of the
2-way interactions, K, which is the length of the latent vector
v for each variable (Equation [2]). We test for k510, 20, and
1012 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
30 and find that no one single k value dominates the results.
Different feature combinations have different best k.
The result shows that some features can significantly
improve the prediction performance of the basic FM and
FFM. Viewport is one of the most significant context fea-
ture. Intuitively, viewport indicates the type of device, which
influences reading experience and thus the way users engage
with webpages. Channel is also a significant one in that, rep-
resenting the topic of the whole page, it directly determines
the user’s interest in the page. The Channel information is
provided by the creators of the article metadata, and thus it
can be considered 100% correct.
We observe that some features cannot help with the pre-
diction. For instance, adding weekday or hour of the day
cannot decrease the RMSD of the FM model and lead to
only a small improvement in FFM. This indicates that the
time that one user spends on a page depth does not signifi-
cantly vary with the hour of the day and the day of the
week. Also, user’s location does not enhance the perfor-
mance. The possible reason is that the granularity of user
geo location is too coarse. In the user log, user geo is state-
level if in USA, otherwise country-level.
Four methods are adopted to model the content of a view-
port. “TF-IDF keywords” considers all non-stopwords with
high TF-IDF in the text in a viewport. “topic_n” considers the
most probable topic calculated by LDA with the topic number
n. “topic_group_n” considers the topic distribution calculated
by LDA with the topic number n. In contrast to “topic_n,”
“topic_group_n” takes into account all latent topics whose
probability is more than 0. The value of the possible topics is
its probability. In this way, the most probable topic is still
weighted higher than others. Hence, it is expected that
“topic_group_n” can provide more details about the topic of a
viewport content. Lastly, “doc2vec_m”uses a Doc2Vec vector
to model viewport text. We vary the length of the vector m to
see its impact on performance.
To further explore the best performance, more than one
feature are incorporated into the basic models at the same
time. Because FM (doc2vec_150) with K 5 20 and FFM
(viewport) with K 5 30 reach the lowest RMSD, additional
features are added into these models. We select one of the
four adopted methods to model the viewport content. As
doc2vec models viewport content, the other features, that is,
TF-IDF keywords and LDA, that fall into the same category
are not considered in these experiments.
Table 3 shows that the FM model with dov2vec_150,
channel, and viewport as additional features achieves the
lowest RMSD, that is, the best performance. In other words,
the dwell time of a given depth is determined by the content
around that depth (captured by dov2vec), the topic of the
whole article (captured by channel), and the size of the
browser (captured by viewport). Similarly, we find the best
feature combination for FFM is viewport, channel, freshness
and topic20 according to Table 4.
Page Depth-level Dwell Time Prediction
We compare the best models obtained from the previous
experiment, that is, FM (dov2vec_150 1 channel 1 view-
port) with K520 and FFM (viewport 1 channel 1 fresh-
ness 1 topic20) with K530, with the other comparison
systems. All models are applied to predict the exact dwell
time of each page depth in test page views. We also add the
smoothing technique for both FM and FFM methods. The
smoothing process used in the test sets are determined by
the validation sets. In particular, the best smoothing for FM
is mean with d53. The best for FFM is 75% quartile with
d57. The results in Table 5 demonstrate that FM and FFM
significantly outperform the comparison systems, with the
TABLE 2. RMSD Comparison by Adding One Additional Feature:
The results are reported by selecting the best from k510, 20, and 30.
Feature groups Models FM FFM
Basic Basic 12.4680 12.6550
Context Weekday 12.6575 12.4759
Hour 12.7563 12.4542
Viewport 12.2909 12.3435
User Geo 12.5465 12.6499
Article Length 12.636 12.4400
Channel 12.3489 12.3610
Freshness 12.4770 12.3487
Viewport
Content
TF-IDF 12.6239 12.3648
Topic_10 12.7308 12.4205
Topic_20 12.3929 12.4080
Topic_30 12.5168 12.4911
Topic_group_10 12.2912 12.4134
Topic_group_20 12.4393 12.3831
Topic_group_30 12.3899 12.3773
Doc2vec_50 12.3042 12.5161
Doc2vec_150 12.2065 12.4075
TABLE 3. RMSD comparison by adding more additional features to
FM (doc2vec_150).
Models K520
FM (doc2vec_1501viewport) 12.2301
FM (doc2vec_1501channel) 12.0733
FM (doc2vec_1501freshness) 12.3487
FM (doc2vec_1501channel1viewport) 12.0419
FM (dov2vec_1501channel1freshness) 12.1985
FM (doc2vec_1501channel1viewport1freshness) 12.1827
TABLE 4. RMSD comparison by adding more additional features to
FFM (viewport).
Models K530
FFM (viewport1channel) 12.2775
FFM (viewport1freshness) 12.2703
FFM (viewport1topic20) 12.2992
FFM (viewport1doc2vec_150) 12.4524
FFM (viewport1channel1freshness) 12.2588
FFM (viewport1channel1topic20) 12.2743
FFM (viewport1freshness1topic20) 12.2599
FFM (viewport1channel1freshness1topic20) 12.2542
FFM (viewport1channel1freshness1topic201keywords) 12.2863
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1013
best model as being FM 1 smoothing technique. This is
because they are able to overcome sparsity and capture pair-
wise interactions between features. The RMSDs of PageA-
verage and UserAverage are better than GlobalAverage
because their predictions are tailored to each page or each
user. Also, the results indicate that controlling the user varia-
bles seems to be more effective than controlling the page
variables. This is because dwell time is influenced more by
individual users’ subjective behaviors. The RMSD of
Regress_bc is not as low as the one of UserAverage, which
indicates that methods for page-level dwell time prediction
cannot be easily applied to depth-level prediction. Without
capturing the interaction of features, Regress_feat does not
obtain a prediction as good as the ones of FM and FFM.
The results shown in Table 5 are calculated over all test
page depths. In order to look into the performance at differ-
ent areas of pages and evaluate the robustness of the pro-
posed method, page depths are split into different buckets:
bucket1: [1%, 25%], bucket2: [26%, 50%], bucket3: [51%,
75%], and bucket4: [76%, 100%]. According to the results
shown in Figure 1, the proposed FM and FFM methods con-
sistently outperform the others in all buckets. With smooth-
ing, their performance is further enhanced. Unexpectedly,
FFM does not outperform FM becuase FFM suffers from
overfitting. FFM contains many more parameters than FM.
Especially, for doc2vec_50 and doc2vec_150 which are
very dense features, FFM builds multiple latent vectors for
each doc2vec latent feature. Inaccurate latent vectors may
impact the prediction of all depths.
Generally, the prediction error decreases with the
increase of the page depth. The reason is that most users
only read the first half of the page. Therefore, the dwell time
of the page depths near the bottom of the page is mostly
zero. Because it is easier to predict at the bottom of the
pages, the performances of all methods are closer in
bucket4, while the proposed methods are still the best.
Viewability Prediction
Viewability can be regarded as the probability that an
item (e.g., an ad) at a page depth will be viewable. This can
be treated as probabilistic classification. Therefore, we run
an experiment to evaluate whether the FM/FFM models can
handle this problem.
We vary the dwell time threshold of a viewable impres-
sion from 1s (IAB standard) to 10s. The target variable of
each page depth in the data set is 1 if its dwell time is at least
T seconds; otherwise 0. In this way, the prediction problem
is converted from regression to classification. The prediction
outcome of each test page depth is the probability that its
dwell time is at least T seconds.
Figure 2 shows that the FM and FFM models clearly out-
perform the baselines. The best smoothing function for the
FM is mean, while d varies from 2–6 with the minimum
dwell time threshold. The best smoothing function for FFM
is min, while d varies from 3–5. We observe that the
FM 1 smoothing model achieves the best performance at
the two ends (1s and 10s). Given a page depth, it is more
challenging to predict if the dwell time is at least 5s. The
reason is that the number of page depths with dwell time at
least 5s and the number of page depths with dwell time less
than 5s are very close (about 50%). In contrast, there are
about 70% page depths whose dwell time is at least 1s.
GlobelAverage and LogisticRegress_bc have similar perfor-
mance. Also, the LogisticRegress with significant features is
better than the other baselines.
TABLE 5. Depth dwell time prediction comparison (RMSD).
Approaches RMSD
GlobalAverage 13.6971
PageAverage 13.5243
Regress_bc 13.2643
UserAverage 13.1482
Regress_feat 12.9043
FM 12.0419
FM1smooth 11.8808
FFM 12.2542
FFM1smooth 12.2510
FIG. 1. Depth dwell time prediction comparison (Buckets).
1014 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
One interesting observation is that, although UserAver-
age and PageAverage outperform GlobalAverage by
RMSD, as shown in Table 5, they are much worse than
GlobalAverage by logistic loss in viewability prediction.
Also, they do not have as stable performance as the other
methods. The main reason is that most users and pages in
the test data have few historical page views in the training
data. Also, most pageviews have sparse dwell time distribu-
tion, that is, the dwell times of many page depths are 0. In
this case, for individual users or pages, the viewability pre-
dictions of a depth are close to 0 or 1. Once the prediction is
incorrect in the test data, the penalty by logistic loss will be
large because it heavily penalises classifiers that are confi-
dent about an incorrect classification. For instance, the dwell
times at 10% depth of a user’s all historical page views are
0s, 0s, 0s, and 3s. In a test page view, the user spent 1s at
that page depth, which is the ground truth. Given T 5 1, the
prediction of the user at this depth will be always
ð0101011Þ=450:25. This means the classifier thinks
that the depth will very likely be viewed for less than 1s.
However, the logistic loss of the prediction is
loglossð0:25; 1Þ525:9047, which is a huge penalty. This is
because in classification problems it is better to be some-
what wrong than emphatically wrong. This characteristic is
very important for publishers because it can help them avoid
large decision-making errors.
Effects of Smoothing
To investigate how smoothing impacts prediction perfor-
mance, different smoothing settings are applied to the pre-
diction outcome of the best FM model by varying the
smoothing function f and the interval size d. Figure 3 and
Figure 4 are the results of dwell time prediction and view-
ability prediction for 1s, respectively. For dwell time predic-
tion, the smoothing technique with f5mean and d54
obtains the best performance. For viewability prediction, the
smoothing with f5quartile(75) and d52 obtains the best
performance. Both figures show that performance generally
decreases for d 5 5 or higher. This means that smoothing
with coarse granularity tends to hurt performance. The best
performance is often obtained when d is between 2 and 4.
For f, Min and max always generate worse performance
because they use the extreme value in each interval to make
the final prediction. The best smoothing setting for FM
learned from the validation sets is with mean and d53. The
best for FFM is with quartile(75) and d55.
Feature Analysis
We also look into some features and investigate how user
reading behaviors are related with the feature values. This
may influence advertisers’ biding behaviors as well as pub-
lishers’ ad allocation strategies and website design.
Weekday and Hours
We investigate whether user reading behavior varies with
time. The long-term data are provided by Google Analytics
(GA)2. Because the time recorded in GA is the visit’s time
converted to the timezone configured for the GA profile
(Forbes profile uses the US Eastern Time), we fix the region
of the visits to the New York State.
Figure 5 shows website traffic and the mean page-level
dwell time on different days of week. Although website traf-
fic varies by the days of week, the mean page-level dwell
time almost does not have any fluctuation. Users spend the
FIG. 2. Viewability prediction comparison.
FIG. 3. Comparison of smoothing techniques for dwell time
prediction.
2https://analytics.google.com/
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1015
same time on the pages on different days of week. Besides,
Figure 6 presents a more obvious pattern that the page-level
dwell time does not vary by the change of the hours.
The dwell time distributions over all page depths on rep-
resentative weekdays/hours are plotted in Figure 7 and 8.
These weekdays/hours are either the time point that have
either the longest or the shortest mean page-level dwell
time. Similar to the page-level dwell time, the depth-level
dwell time is not influenced much by time as well.
Studies (Yuan, Wang, & Zhao, 2013; J. Wang & Yuan,
2015) discover that, in the current pay-by-impression pricing
model, the wining bidding prices significantly vary by the
hours of day. However, our research finds that the page
depth-level dwell time does not vary much. The chance that
ads exposed on screen for long enough time at midnight is
as the same as that in the daytime. Therefore, through this
research, advertisers hopefully can realize that the impres-
sions at midnight do not have much lower viewability.
Hence, they do not necessarily compete with each other in
the daytime and consequently pay higher prices for the mar-
keting chances that they can also get during non-peak time.
Channels
In each of the six primary channels on Forbes website,
2000 pageviews are randomly sampled. For each pageview,
the dwell time of the user spent on every page depth is cal-
culated. Thus, each pageview has a vector of length 100.
Each value in the vector is the time that the user spent on the
corresponding page depth. For each channel, the centroids
of the 2000 vectors are calculated by averaging. The six
centroids can be considered as summaries of the dwell time
patterns of corresponding channels. The centroids are plotted
in Figure 9.
All six plots indicate that users usually spend more time
on the first half of the page than the second half. Also, the
top several percent of the page are usually skipped because
this area is always the menu bar. However, the patterns of
individual channels are not identical. Users tend to spend
less time on the lifestyle channel, which usually publishes
web articles about travel, sports, and autos. Intuitively, users
may not read every single sentence in these pages. On the
other hand, users spend long time on the opinion channel,
which publishes updated analysis on popular news. It is rea-
sonable in that these opinion articles are original and can
attract users to read about the authors’ points. Likewise, as
the most well-known product, the lists channel, which usu-
ally publishes the rankings, receive high engagement on the
first half, whereas users quickly lose attention at the second
half. The possible reason is that most users only focus on the
top positions when reading a list. In addition, although busi-
ness and Asia share very similar patterns across page depths,
Asia receives slightly longer dwell time. Publishing articles
about the economy and billionaires of Asia, the Asia channel
has significantly more Asian visitors. Because of the lan-
guage barrier and relatively slow network connection, Asian
visitors usually spend relatively more time on pages.
Viewports
We also investigate user reading behaviors on different
viewports. Because it is possible that users may adjust their
browser into many different sizes, we group viewport sizes
by every 100 pixels. For example, “320x520” is represented
as “3x5.”
Only popular viewport sizes are considered in this experi-
ment. According to an online public resource3, viewport
sizes are grouped into four categories which represent four
main display devices: (a) Mobile: “3x6,” “3x3,” and “3x5.”
(b) Tablet: “7x9,” “7x10,” “10x7,” and “10x9.” (c) Laptop:
“13x6,” “13x7,” “12x6,” and “12x7.” {d) Big screens:
“25x12,” “25x13.” In each category, 2000 pageviews are
randomly sampled. The result is shown in Figure 10.
People generally spend less time on mobile devices.
Existing research shows that people usually use mobile devi-
ces for casual reading (Cui & Roto, 2008). In this case, users
may not stay long on pages. Also, the dwell time distribution
of mobile devices seemingly has two peaks: one is near
30%, the other is near about 60%. The reason may be that
flicking fingers on the screen is as easy as scrolling the
wheel of a mouse (Kyle, 2013). In contrast, the dwell time
distribution of tablet devices is smoother because a tablet
have a bigger screen than the a mobile device. Thus, when a
user is reading the first/last paragraph, the middle part is
also in-view. In this case, the dwell time in the middle of a
page is not significantly lower than that in the two tails.
According to Figure 10, dwell time is increasing with the
increase of the viewport size. The main reason is that bigger
viewports can display more content. So users would stay
longer to read on the depths displayed in the viewports with-
out much scrolling.
Feature Interactions
FM and FFM build raw-rank matrices that consist of
latent vectors of each feature. Using these latent vectors,
they compute pairwise interactions using dot product. The
assumption is that the relationship between independent var-
iables and the target variable is not linear. The value of the
FIG. 4. Comparison of smoothing techniques for viewability prediction
(1s).
3http://viewportsizes.com/
1016 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
FIG. 5. Day of week vs. traffic and mean page-level dwell time (New York State; 05Sunday).
FIG. 6. Hour of day vs. traffic and mean page-level dwell time (New York State).
FIG. 7. The comparison of mean depth-level dwell time on Wednesday and Saturday (in seconds).
FIG. 8. The comparison of mean depth-level dwell time on different hours of day (in seconds).
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1017
target variable is determined by the interaction of the inde-
pendent variables. We have accessed the trained raw-rank
matrix for FM to investigate the interaction of users and
channels. This study is meaningful because: (a) our experi-
ments show that channel is a significant feature in both FM
and FFM models; (b) the interaction of users and channels
can help publishers understand user interest and thus recom-
mend web articles.
We first use the FM model with the best feature set and
k 5 20 to predict depth-level viewability. We then store the
final latent vectors of all users and channels. For each user,
we calculate the dot products with all channel latent vectors.
There is a positive correlation between the dot products and
the engagement of the user on a page from a channel: A
large dot product may lead to high viewability and dwell
time. Therefore, each user is represented by a vector consist-
ing of dot products. This matrix can be used to cluster simi-
lar users based on the dwell time behavior and then make
recommendations. We observe from the resultant matrix of
dot products that several channels tend to always have high
engagement, that is, Asia and opinions, or low engagement,
for example, lifestyle. We remove the columns of these
FIG. 9. The comparison of mean depth-level dwell time across channels (in seconds).
1018 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
channels from the matrix to eliminate their overall bias. We
also remove the dummy column which represent the
“unknown” channel.
Figure 11 presents a sample heatmap based on the matrix,
which contains randomly-selected users. In the figure, a
darker color means a larger dot product value. It shows that
users have different tastes in channels: Controlling the user
variable, it is hard to say which channel leads to a large dot
product. For example, The 9th user has medium interest on
the other channels, but very high interest in technology. The
situation is opposite for the 11th user who seems to have no
interest in technology. Hence, viewability does not change
linearly with the channels. The same conclusion also holds
for users. Therefore, the interaction of users and channels
should be captured to predict viewability. This explains why
the FM-based models, which consider feature interactions,
outperform the baselines in our experiments.
As Section 3.4 explains, the FM models represent each
user as a latent vector. The similarity between two user
latent vectors describes the similarity between their interests.
Figure 12 is the visualization of two randomly chosen users
who have smaller cosine distance for their two user latent
vectors. Similar with Figure 11, a darker color means a
larger dot product value of user, depth, and channel. Figure
12 shows that both users are interested in business and inves-
ting channels, but are not interested in leadership, lists, and
technology channels. There are a few shared-interest depths
on the articles in the business channel, such as depths 3, 20,
50. The possible reason is that it is easy to get the main
points in business articles and thus users view them fast.
Publishers can place ads on the segments of high values in
the business channel in order to increase the ad viewability.
FIG. 10. The comparison of mean depth-level dwell time across viewport categories (in seconds).
FIG. 11. Sample heatmap of the FM dot product matrix for latent vec-
tors of users and channels.
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1019
But because there are many segments of high value in the
investing channel, publishers can choose to balance user expe-
rience and the ad viewability. Therefore, publishers can use
this type of visualization in the ad placement decision-making.
Time Series Model for Page Depth-level DwellTime Prediction
So far, our FM and FFM models have been designed to
predict depth-level dwell time before a page is loaded.
Another way to further improve the prediction is to leverage
user browsing behavior while reading the page. For exam-
ple, if we know the user behavior in the first half of the
page, we could attempt to predict the depth-level dwell time
in the second half of the page.
Making predictions during page reading is significant and
feasible. To guarantee ad viewability, publishers may select
to hold off the selling of an ad until a user scrolls to the posi-
tion of the ad. A new ad viewability prediction algorithm
can be used in real-time to predict how long the user will
stay at the position. Based on discussions with a large pub-
lisher, we know that such an algorithm is fesible if a predic-
tion can be made in 100ms.
In order to capture the user reading behavior in a page,
we define a browsing action as a triplet: (top, bottom, dwell),where top and bottom are the positions of the first and the
last line of the viewport, respectively, and dwell is the dwell
time that the user spends at this position. In our data set,
once a user stays at a position for one second, an action is
recorded in the user log. A page view must contain at least
one action. Figure 13 is an example of a page view, which
contains three user browsing actions. The actions occurs
sequentially based on user scrolling. In this action-level
application, the increment is not by 1% of the page depth,
but driven by user scrolling. Therefore, for example, we can
predict the dwell time of the third action based on previous
one or two actions which have been observed.
Formally, adding time series information, we define a
new problem setting based on browsing actions:
FIG. 12. Sample heatmap of the FM dot product matrix for latent vectors of depths and channels for two users with similar interests.
FIG. 13. An example of a page view which contains three browsing
actions.
TABLE 6. Results of time series model.
Window size Metrics FM
h50 RMSD 10.4663
Logloss (3s) 0.4740
Logloss (7s) 0.6560
h51 RMSD 10.1801
Logloss (3s) 0.4713
Logloss (7s) 0.6520
h52 RMSD 10.0935
Logloss (3s) 0.4649
Logloss (7s) 0.6425
1020 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
Problem Definition 2. Given the previous h browsing
actions A5ðtopi2h, bottomi2h, dwelli2hÞ; . . . ; ðtopi21; bottomi21;
dwelli21Þ that a user u just conducted on a webpage a and the
ith position (i.e., topi and bottomi), the goal is to predict the
dwell time dwelli that u will stay at the ith position. In practice,
the ith position can be the current position where the user just
scrolled to.
The FM model which performs the best in Experimental
Evaluation can be extended to be applied in this action-level
prediction. In particular, the extended model should take
into account the previous user actions, the current position
of the user, and the change of the adjacent actions. There-
fore, we add additional features to the best FM model. In
addition to the best feature combination, we also add addi-
tional features for the time series setting: the current position
of the viewport top and bottom (i.e., topi and bottomi), delta
distance (i.e., topi2topi21), the dwell time of the previous
action (i.e., dwelli21). These features can be easily measured
or calculated at the time that the prediction is made. Note
that the first action of a page view can be predicted by the
models proposed in Factorization Machines, as it does not
have any previous action.
In the experiments, we consider h is 1 and 2. The data set
we collect contains only the page views which have more
than 2 actions. The data set is then transformed into an
appropriate input format, in which each row is an action in a
page view. The data sets for h51 and h52 have the same
set of actions. The only difference is that each row of the
data set for h52 has additional features about the i – 2th
action. The data sets contain about 400K actions from about
70K page views. They are then partitioned into training, val-
idation, and test sets by 8:1:1. Note that, because one test
action only has one output, the proposed smoothing tech-
nique is not applicable here.
Table 6 presents the experimental results. The FM model
is capable of handling time series information. FM with h50
does not use any time series information. It just predicts the
dwell time or viewability of an action based on the informa-
tion at this action. FMs with h51 and 2 use time series infor-
mation from past actions in the same page views. The results
show that FM with h51 and 2 have better performance than
FM without time series information by both RMSD and
Logloss.This reflects that adding more information about user
engagement within the page view leads to better performance.
In addition, all results of h52 are better than the correspond-
ing results of h51. This indicates that adding more previous
actions may further improve the performance. However, in
practice, using a large h will limit the usage of the model
because the model requires at least h previous actions to
make prediction. In this case, separate models need to be built
for actions which have less than h actions.”
Conclusions
The emerging ad pricing model prices ads by the number
of impressions viewed by users, instead of the number of ad
served to webpages. The publishers and advertisers that use
this model demand prediction of the dwell time of each
depth where an ad is placed. This prediction can help to
maximize the publishers’ profit and improve the advertisers’
return on investment. However, how to solve this demand
was an open problem until now.
This paper presents the first study of depth-level dwell
time prediction. We propose predictive models based on FM
and FFM, and add a smoothing technique to further improve
the performance. In addition, we use the prior dwell time
behavior of the user within the current page view, that is,
time series information, and apply it into the FM models.
Using real-world data for our experiments, we show that our
models outperform the comparison models. In particular, the
FM model with viewport, channel, and Doc2Vec vector
obtains the best performance. Also, our experiments demon-
strate that adding time series information can further
improve the performance.
Finally, we extracted the latent feature vectors and pro-
vided an analysis of some feature combinations. The insights
gained from this analysis can be applied to help a publisher
understand user behavior patterns and enhance its business
strategies.
Acknowledgement
This work is partially supported by NSF under grants No.
CAREER IIS-1322406, CNS 1409523, and DGE 1565478,
by a Google Research Award, and by an endowment from
the Leir Charitable Foundations. Any opinions, findings,
and conclusions expressed in this material are those of the
authors and do not necessarily reflect the views of the fund-
ing agencies
References
Cui, Y., & Roto, V. (2008). How people use the web on mobile devices.
In WWW’08: Proceedings of the 17th international conference on
World Wide Web (pp. 905–914).
Google. (2014). The importance of being seen. https://think.storage.goo-
gleapis.com/docs/the-importance-of-being-seen_study.pdf.
Holf, R. (2014). Digital Ad Fraud Is Improving - But Many Ads Still
Aren’t Seen By Real People. https://www.forbes.com/sites/roberthof/
2014/01/29/digital-ad-fraud-is-improving-but-many-ads-still-arent-seen-
by-real-people/.
Hong, L., Doumith, A. S., & Davison, B. D. (2013). Co-factorization
machines: modeling user interests and predicting individual decisions
in twitter. In WSDM’13: Proceedings of the 6th ACM International
Conference on Web Search and Data Mining (pp. 557–566).
Juan, Y., Zhuang, Y., Chin, W.-S., & Lin, C.-J. (2016). Field-aware fac-
torization machines for ctr prediction. In RecSys’16: Proceedings of
the 10th ACM Conference on Recommender Systems (pp. 43–50).
Kim, Y., Hassan, A., White, R. W., & Zitouni, I. (2014). Modeling
dwell time to predict click-level satisfaction. In WSDM ’14: Proceed-
ings of the 7th ACM International Conference on Web Search and
Data Mining (pp. 193–202).
Kyle, S. (2013). Experimenting in Loyalty Conversion with WNYC:
Achieving Mobile-Desktop Parity. http://blog.chartbeat.com/2013/10/
07/experimenting-loyalty-conversion-wnyc-achieving-mobile-desktop-
parity/.
Lagun, D., & Lalmas, M. (2016). Understanding user attention and
engagement in online news reading. In WSDM ’16: Proceedings of
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi
1021
the 9th ACM International Conference on Web Search and Data Min-
ing (pp. 113–122).
Le, Q. V., & Mikolov, T. (2014). Distributed representations of senten-
ces and documents. In Proceedings of the 31st International Confer-
ence on Machine Learning (pp. 1188–1196).
Liu, C., White, R. W., & Dumais, S. (2010). Understanding web brows-
ing behaviors through weibull analysis of dwell time. In SIGIR ’10:
Proceeding of the 33rd international ACM SIGIR conference on
Research and development in information retrieval (pp. 379–386).
Qiang, R., Liang, F., & Yang, J. (2013). Exploiting ranking factorization
machines for microblog retrieval. In CIKM’13: Proceedings of the
22nd ACM International Conference on Information and Knowledge
Management (pp. 1783–1788).
Rendle, S. (2012). Factorization machines with libfm. ACM Transac-
tions on Intelligent Systems and Technology (TIST)3357:1–57:22.
Schmidhuber, J. (2015). Deep learning in neural networks: an overview.
Neural networks, 61, 85–117.
Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for rec-
ommending scientific articles. In KDD’11: Proceedings of the 17th
ACM SIGKDD international conference on Knowledge discovery and
data mining (pp. 448–456).
Wang, C., Kalra, A., Borcea, C., & Chen, Y. (2015). Viewability predic-
tion for online display ads. In CIKM’15: Proceedings of the 26th
ACM International Conference on Information and Knowledge Man-
agement (pp. 413–422).
Wang, C., Kalra, A., Borcea, C., & Chen, Y. (2016). Webpage depth-
level dwell time prediction. In CIKM’16: Proceedings of the 25th
ACM International Conference on Information and Knowledge Man-
agement (pp. 1937–1940).
Wang, C., Kalra, A., Zhou, L., Borcea, C., & Chen, Y. (2017). Probabilis-
tic models for ad viewability prediction on the web. IEEE Transactions
on Knowledge and Data Engineering (TKDE), 29(9), 2012–2025.
Wang, J., & Yuan, S. (2015). Real-time bidding: a new frontier of computa-
tional advertising research. In WSDM’15: Proceedings of the 8th ACM
International Conference on Web Search and Data Mining (pp. 415–416).
Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Inter-
preting tf-idf term weights as making relevance decisions. ACM
Transactions on Information Systems (TOIS)26313:1–13:37.
Yi, X., Hong, L., Zhong, E., Liu, N. N., & Rajan, S. (2014). Beyond
clicks: dwell time for personalization. In RecSys’14: Proceedings of
the 8th ACM Conference on Recommender Systems (pp. 113–120).
Yin, P., Luo, P., Lee, W.-C., & Wang, M. (2013). Silence is also evi-
dence: interpreting dwell time for recommendation from psychologi-
cal perspective. In KDD’13: Proceedings of the 19th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining
(pp. 989–997).
Yuan, S., Wang, J., & Zhao, X. (2013). Real-time bidding for online
advertising: measurement and analysis. In Proceedings of the Seventh
International Workshop on Data Mining for Online Advertising (pp.
3:1–3:8).
1022 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2018
DOI: 10.1002/asi