A novel point of interest (POI) location based recommender system utilizing userlocation and web interactions
Mayy Habayeb1, Behjat Soltanifar2, Bora Caglayan3, Ayse Bener4
Data Science Lab
Department of Mechanical and Industrial Engineering
Ryerson University, Canada
(mayy.habayeb1, behjat.soltanifar2, bora.caglayan3, ayse.bener4)@ryerson.ca
Abstract—Location aware mobile devices have increased theavailability of user trajectory information making point ofinterest recommenders a popular service on mobile devices.However, one of the main challenges in this area is sparsity ofthe historical trajectory data. So far, most of the recommendersystems take users’ historical trajectory information into con-sideration to recommend different places. Web interactionsreveal rich information on the user interests, and hence arecommender system should take into consideration such data.
In this study, we present a model that combines andassociates users interest/taste information, obtained from theirweb interactions together with location information obtainedfrom the Open Street Map (OSM). Then, we combine thisinformation with the users’ real time trajectory information(longitude, latitude and timestamp) to present a list of recom-mended points of interest close to the current location.
Keywords-Location based recommender system, Recom-mender System, Big Data, Point of Interest.
I. INTRODUCTION
Mobile devices detect various contexts in the environment,
such as the location, the surroundings, people in the vicinity,
or the weather forecast. The user could be at the train station,
in a shopping mall or even in a different city. These shifts
in interaction, as well as the availability of vast amount
of user data and context data, bring an opportunity for
the company to serve its customers better by anticipating
their preferences in a dynamic manner. Once location opt-in
function is enabled on mobile devices, several companies
collect user location data based on user consent in order
to better understand user behaviour and therefore provide
contextual personalization. They also collect user data from
many different sources such as browsing history, application
usage, device usage and social graph. Combining both user
location information and user web interactions and further
integrating with external data sources is a challenge. This
is a big data problem not only because the data volume
is massive, but also because data comes in both structured
and unstructured formats and the complexity requires new
algorithms to tackle the personalization problem.
In the past few years we see the rise of the Location-
Based Social Networking (LBSN)[8][3]. LBSN can gener-
ally be classified into three major groups: a) Geo-tagged-
media-based (Flickr, Geo-twitter), b) Point location-based
(Foursquare, Google Latitude), c) Trajectory based (Bikely,
Geo-life). Many interesting research projects[14][15][17]
have been done around these LBSN for location-based rec-
ommendation. Common approaches to the recommendation
problems include constructing the User-User association
graph, User-Location association graph and the Location-
location graph 1.
Our case study relates to a large North American Tele-
communications company. In the past two years, the Data
Science team at the company has built a personalized
interest graph for each individual user to drive personalized
ads targeting and application recommendation services. The
interest graph leverages user browsing data, application
usage data and many other data sources. With social graph,
interest graph and the location data as valuable data assets,
the company sees a big opportunity to leverage these data
to drive location-based recommendations to the end users.
However, the challenges are many-folds: a) The company
has little previous experience with LBSN, b) Harnessing
the heterogeneous data for recommendation is a challenging
problem, c) Interest graph, social graph and location graph
are complex graph problems on their own right. Combining
these graphs to solve the business problem poses new
challenges d) The data volume is huge and requires scalable
algorithms that run on distributed platforms such as Hadoop
or graph databases such as Neo4j or Titan.
Information filtering systems gained their current popu-
larity with the popularity of the internet. Popular sites like
stackoverflow, Amazon and Google employ information fil-
tering systems that guide user decisions with minimal effect
on the core functionality. We can group information filtering
systems as collaborative, context based, and hybrid [11].
Some problems in these systems include data sparsity, cold
start and rating subjectivity. The information given by the
system depends on the user’s situation. The trend today is to
recommend relevant information to users, using supervised
machine learning techniques. However, there is a need to
address the concerns of cold start, data sparsity, and rating
1http://research.microsoft.com/en-us/projects/lbsn/
2016 IEEE Second International Conference on Big Data Computing Service and Applications
978-1-5090-2251-9/16 $31.00 © 2016 IEEE
DOI 10.1109/BigDataService.2016.42
121
2016 IEEE Second International Conference on Big Data Computing Service and Applications
978-1-5090-2251-9/16 $31.00 © 2016 IEEE
DOI 10.1109/BigDataService.2016.42
121
2016 IEEE Second International Conference on Big Data Computing Service and Applications
978-1-5090-2251-9/16 $31.00 © 2016 IEEE
DOI 10.1109/BigDataService.2016.42
121
2016 IEEE Second International Conference on Big Data Computing Service and Applications
978-1-5090-2251-9/16 $31.00 © 2016 IEEE
DOI 10.1109/BigDataService.2016.42
121
subjectivity problems and adopt a collaborative information
filtering system to the user in a ubiquitous environment.
We need to inspire from the models of human reasoning
developed in robotics, combining reinforcement learning and
case-based reasoning to define a contextual recommendation
process based on different context dimensions (cognitive,
social, temporal, geographic).
The flow of this paper is as follows: In section II we
provide a summary of related work. In section III we
describe our methodology, results of our exploratory analysis
and details of our proposed model. In section IV we evaluate
our model. In section V we describe threats to our work and
finally in section VI we summarize and conclude our work.
II. RELATED WORK
In 2008 Li et al. [5] investigated measuring user similari-
ties based on user location histories. They proposed a frame-
work “Hierarchical Graph-based Similarity Measurement”
(HGSM), that took into consideration the sequence property
of people’s movement behaviours and the hierarchy property
of geographic spaces. For their research they monitored 65
volunteers for a period of six months. As a result they
compared their framework (HGSM) with other similarity
measures such as cosine and person and reported more
accurate results using the HGSM measure. Similarly Zheng
et al. [18] presented a model to mine interesting locations
and classical travel sequences by modelling user location
histories in a tree-based hierarchical graph then proposed a
Hyper text Induced Topic search based inference model to
relate the user to a location.
Xiao et al.[14] further continued in the same field by
trying to identify user similarity based on Semantic location
history instead of the low level geographic locations, they
use a Maximum travel match algorithm to calculate the user
semantic location history similarity. We differ from their
approach as we focus on user taste/ interest similarity that
is mapped back to physical Points of Interest.
Zheng et al. [16] investigated the possibility of recommend-
ing activities related to the recommended location in addition
to the recommended location. They used collective matrix
factorization to mine interesting locations and activities, yet
similar in terms of modelling the semantic categories of POI.
In 2010,Y Zanh et al. [17] proposed a personalized location
recommendation by the use of correlation between locations
considering the sequentially between locations visited by
a specific user. They evaluated their algorithm using GPS
dataset information of 112 users over a time period of 1.5
years. In 2012 L.-Y. Wei et al. [12] proposed a framework,
called a Route Interface framework based on Collective
Knowledge (RICK), which helps to extract popular routes
from uncertain trajectories. Basically, given a time span
and a location sequence, this model recommends top-k
popular routes, inferred from historical check-in record of
those places generated by other tourists. Recently in 2012
Bao et al. [1] tackled the problem of sparse geo-social
data in the user location matrix. They used a Four Square
dataset to build a location recommender that models each
individual’s personal preference with a weighted category
hierarchy (WCH). They use users check-in data as to build
Category Hierarchy, also the user location history similarity.
In our case study we differ from them as we use the Open
Street Map and we do not have check-in data from other
users.
Other studies carried out by the Foursquare research group
[10],[9] focus on utilizing users check-in data to introduce
various recommender applications such as recommending
interesting events in real time [10]. In [15], the authors esti-
mated user similarity based on modelling semantic location
history of a user’s GPS trajectories, and using maximum
travel match algorithm.
Our work differs from the work described above mainly
in two aspects, first we do not build on history of user
trajectories [5][18] nor on route trajectories [12], as this
is an expensive operation to maintain and is subject to
sparsity of data and signal transmissions. Instead, we use the
user transmitted location and the points of interest within
a certain diameter combined with user web interactions
to provide recommendations. Second, we capitalize on the
internal customer data available within the organization and
model the similarities between the user web interactions
and the points of interest around the user location for our
recommender system; in addition, if user web interactions
are not available we use user-user similarities to provide
recommendations.
III. METHODOLOGY
In this section we illustrate steps of our methodology that
we used to address the research problem. First we carried
out an exploratory analysis exercise, then we identified key
strength and finally we built a high level model.
A. Data collection & exploratory data analysis
We examined three dimensions for this step :
1) User trajectory information: We collected users’ his-
torical trajectory data and investigated it in order to identify
any trends. We looked at the number of transmitted signals.
For this exercise we agreed with the company to retrieve
data related to all users subscribed in one city ”Waterloo” in
Canada and to extract all their historical trajectory data over
a period of one month from from 27/4/2014 to 29/5/2014. In
total there were 284,744 observations, Table I demonstrates
the all countries visited by this cities users: -
We noted that most of the signals were transmitted from
Canada 97.5% this not a surprising result, yet 60 other
countries were also visited, although the time period was
not a seasonal vacation period. Figure 1 visually illustrates
our findings. We further analysed the data in terms of the
most frequent cites that were visited. Table II illustrates the
122122122122
Table ICOUNTIRES VISITED DURING 25/4/2014 TO 29/5/2014
Country AE AG AU AZ BB BE BR BS CA CH CL CN CO CR CU CZCount 10 38 16 5 1 5 63 30 277,786 13 2 44 1 53 2 7
Country DE DK DO EG ES FI FR GB GR HK HR HT HU ID IE ILCount 32 4 155 1 11 4 31 78 4 6 7 41 4 2 57 3
Country IN IR IS IT JM JP KN KR MT MU MX MY NG NL null PACount 23 1 1 62 18 2 37 10 3 2 78 13 13 4 69 2
Country PL PR PT RU SA SI TH TR US VE VI VN YECount 7 2 21 1 1 2 9 3 5,813 20 1 9 1
Figure 1. GPS signals transmitted by Waterloo users period(27/4/2014 to29/5/2014)
top 20 cities visited. As noticed from Table II the home city
was where most signals were transmitted, followed by the
nearby cities. This indicates that most of the users live and
work in their home city.
Table IIMOST VISITED CITIES WITHIN CANADA PERIOD 25/4/2014 TO
29/5/2014
Rank City # No.of sig-nals
Rank City # No.of sig-nals
1 Waterloo 191,346 11 Brampton 1,7732 Kitchener 27,168 12 Orangeville 1,6243 Cambridge 9,918 13 London 1,6204 Guelph 5,986 14 Hamilton 1,2755 Stratford 4,770 15 Woodstock 1,1416 Toronto 2,998 16 Burlington 1,0267 Owen
Sound2,969 17 Ottawa 842
8 Etobicoke 2,503 18 Huntsville 8119 Milton 2,419 19 Collingwood 78610 Mississauga 1,917 20 Willowdale 775
Total 263,667
We followed this by taking a closer look at the trajectory
signals transmitted per user. Figure 2 below shows a box
plot of all the user signals transmitted during the period of
32 days from 27/4/2014 to 29/5/2014.
Figure 2. Frequency of signals(GPS and WIFI) transmitted by Waterloousers period(27/4/2014 to 29/5/2014)
Form Figure 2 we note that the median for number of
signals transmitted per user is 63 signals over the period of
32 days. On average each user transmits 2 signals per day,
the number of users in the first quartile only transmit 1 signal
per day and in the third quartile only 3 signals per day.This
indicates sparsity of data. In order to confirm further we
wrote a code to calculate the time difference between signals
emitted per user. Our calculations reveal that the median
of time difference between signals per user is 5.999 hours,
while the mean is 8.744 hours.
These results indicate sparse location history data which
would not enable us to further carry out analysis at the level
of location history similarity as per the previous research
[5], [14], [16].
We Selected the most active user, i.e. the user who trans-
mitted the highest number of signals, during the 32 day
time period. We plotted on a heat map of all the 1592
trajectories as shown in figure 3. As noticed there were six
areas, marked with red, where this user visited. This result
confirms previous research results that people tend to follow
a certain pattern in their movements and there is a lot of
duplicate data in the location history data.
We looked at the temporal aspect of the data in order to
see if there were any certain time spans where users would
transmit additional information, such as weekends. Table III
illustrates the outcome of this exercise. Column one is the
123123123123
Figure 3. Most active user trajectory movements during the period(27/4/2014 to 29/5/2014)
day of the week, column two reflects the total number of
signals emitted on that day, column three shows the daily
average. As a result, there were no clear or obvious days
of the week where users would transmit additional GPS
trajectories.
Table IIISIGNAL DISTRIBUTION BY DAY OF THE WEEK
Day # Observations AverageThursday 35,787 7,157.4Friday 37,920 7,584Saturday 39,127 7,825.4Wednesday 40,754 8,150.8Sunday 42,830 8,566Monday 43,562 8,712.4Tuesday 44,764 8,952.8
We looked at the type of signals transmitted, GPS or
WIFI. We noted that most of the signals came from the
WIFI, this indicates that users prefer to connect to WIFI
and transmit location information, most likely because they
won’t incur charges. Figure 4 illustrates the split. We further
investigated the time intervals between signal transmission
per user in order to observe if there is a difference in the
time frequency of transmission between WIFI and GPS.
����
��������
���
Figure 4. Type of signals(GPS and WIFI) transmitted by Waterloo usersperiod(27/4/2014 to 29/5/2014)
Figure 5 display these results.
As it can be noted the Median for GPS is 16,250 seconds
4.5 hours while for WIFI 21,600 seconds 6 hours which is
Time difference between Signals in seconds per user for signals transmitted via GPS �
Figure 5. Time intervals between signal transmission per user per type ofsignals(GPS and WIFI) transmitted by Waterloo users period(27/4/2014 to29/5/2014)
higher by 25% , because most of the signals are transmitted
through WIFI, the overall average is still 5.99.
2) Open Street Map: We investigated the “Open Street
Map” (OSM) [6] because it would be the main source for
the Points of Interest (POI) recommendations. The open
street map is a collaborative project with the purpose of
creating a free editable map of the world. Volunteers collect
the data across the world, in addition some government
agencies have released official data on appropriate licenses
[6]. Much of this data has come from the United States.
Over the past few years the map has gained a lot of
popularity and many leading companies such as Apple,
Flicker..etc. have used the (OSM) data as infrastructure
for their goe-applications. (OSM) uses a topological data
structure, with four core elements[6]: First: Nodes are points
with a geographic position, stored as coordinates (pairs
of a latitude and a longitude) Outside of their usage in
ways, they are used to represent map features without a
size, such as points of interest or mountain peaks. Second:
Ways are ordered lists of nodes, representing a polyline, or
possibly a polygon if they form a closed loop. They are
used both for representing linear features such as streets
and rivers, and areas, like forests, parks, parking areas and
lakes. Third: Relations are ordered lists of nodes, ways and
relations (together called ”members”), where each member
can optionally have a ”role” (a string). Relations are used
for representing the relationship of existing nodes and ways.
Examples include turn restrictions on roads, routes that
span several existing ways (for instance, a long-distance
motorway), and areas with holes. Forth: Tags are key-value
pairs (both arbitrary strings). They are used to store metadata
about the map objects (such as their type, their name and
their physical properties). Tags are not free-standing, but are
always attached to an object: to a node, a way, a relation,
or to a member of a relation. A recommended ontology
of map features (the meaning of tags) is maintained on
a wiki. (OSM) represents physical features on the ground
(e.g., roads or buildings) using tags attached to its basic
data structures (its nodes, ways, and relations). Each tag
124124124124
describes a geographic attribute of the feature being shown
by that specific node, way or relation.[8]
The logical structure, features of the Map, are grouped
into 26 main categories, each category has further sub-
categories [6]. Table IV illustrates the main feature groups.
For the purpose of this project we investigated each of
Table IVCATEGORIES OF OSM
Feature Feature1.1 Aerialway 1.14 Man Made1.2 Aeroway 1.15 Military1.3 Amenity 1.16 Natural1.4 Barrier 1.17 Office
1.5 Boundary 1.18 Places1.6 Building 1.19 Power
1.7 Craft 1.20 Public Transport1.8 Emergency 1.21 Railway1.9 Geological 1.22 Route1.10 Highway 1.23 Shop1.11 Historic 1.24 Sport1.12 Landuse 1.25 Tourism1.13 Leisure 1.26 Waterway
the above categories and subcategories and found that a
lot of information is irrelevant to our project for example
Roads, highways, Railways, Waterway etc. Accordingly we
selected the main categories that would be relevant for
Point of Interest (POI) recommendation and filtered them
out for two geographic regions in order to carry out some
initial data explorations and test the quality of the data. The
regions were identified with the company and we agreed
to investigate one region in Canada (Waterloo) and one
region in the United States (New York - Manhattan). The
reasons behind these selections were in several folds; Frist,
the company had another source for the (POI), only for
US users, so it made sense to look at this new source.
Second, the company had built an off-line recommender
based on last signal received from the user showing the
nearest ten locations from the second US source. These
recommendations were based on previous end of day data.
Third, the company had a lot of users in North America
and these users frequently travelled between Canada and the
USA. Fourth, despite having the regions in two countries
yet the proximity between locations was interesting. Fifth,
the nature of the two regions was quite different and worth
comparing. For the purpose of this investigation we had
to first identify a set of meaningful key values,and then
carry out data extractions using PIG Latin programming
language, then utilizing Heat maps package in “R” scripting
we managed to plot the data points retrieved. Figure 6
illustrates different category selection for Manhattan region,
one focusing on the food - drink POIs’ versus food- drink,
shops, leisure , sports art and history POIs. As it can be noted
the number of POIs increase drastically with the inclusion of
extra categories. This might confuse the end user and provide
�� ������������������������������������������������ ������������� ������������ ������������������ ��������� ����������� ����� �����
Figure 6. POI for Manhattan, left side only food and drink, right sidefood and drink and other categories
irrelevant information. In figure 7 we compare presence
Waterloo�Manhattan�
Figure 7. POIs in Waterloo versus Manhatten POIs
of POIs for two regions Waterloo and Manhattan for the
same set of categories. As noted before the density pattern
between the two regions under study is quite different. In
Waterloo we noted several focal points while in Manhattan
the POIs’ seemed continuous. This is due to the nature of the
regions while Waterloo is considered a business residential
region, Manhattan has more of a touristic character, thus
every few meters POIs may appear.
These facts need to be considered while designing the
recommender system. Also, as we carried out the exercise
we noted that the quality of the U.S. data was more complete
than the quality of the Canadian data. This can be attributed
to the fact that many contributions have come from some
U.S. government agencies.3) User taste/interest data: We investigated the users
information that was available at the company. Basically
they have only one demographic feature, which is learned
throughout the different social databases and the taste graph
data that the company had collected. The taste taxonomy
is a big data initiative that the company has invested and
dedicated a lot of resources to build. Users information is
gathered during their interactions via their mobile devices
125125125125
by looking at several dimensions, such as the browsing his-
tory, the most frequently visited websites, online purchases,
application usage, device usage and social graph. Thus per
user a taste graph is developed reflective of his/her interests
and tastes. The Taste graph is currently being utilized for
recommendation purposes through the user network channel.
Accordingly for the city under study, we investigated the
user data at three levels: 1) Taste 2) Age 3) Location history.
Our analysis showed that the city of Waterloo had 9,988
subscribers out of which 90.4% had age data and 61.3% had
Taste graph data while only 40% had location history data.
Figure 8 below illustrates the user data for the three dimen-
sions under study. A closer look at the taste data we observed
Total Waterloo Users : 9,988
Total Waterloo Users with age Info: 9,033
Total unique Waterloo Users with taste: 6,129
Total Waterloo Users tastes : 83,676
Total Waterloo Users with location info : 4,000+ (note :one month) �����
����
���
Figure 8. Data related city under study
that the taste Taxonomy Graph had several layers (Tiers),
as per the company’s recommendation we only looked at
Tier one and Tier two. Tier one is a high level category of
interest while tier two can be considered as a sub-category of
Tier one. For example if tier one indicates “Sports” tier two
would be “Swimming” or “Ice-hockey”. Upon investigating
the taste graph for our study city we note that not all users
have Tier 1&2 data. Also the majority of users have Taste of
type technology and Computing around 8,992 observations.
Similarly further observations were noted with regard to
Tier 1 categories for example there were 5,282 “Hobbies &
Interests” and 5,102 “Arts & Entertainment” upon discussing
these facts with the company it was advised that any user
who carries out browsing activities gets assigned to the
“technology and Computing” tier until they identify a certain
browsing pattern. As for the other two the explanation was
that they have reached a point where they could identify
that the user is interested in the major category for instance
arts while his/her further activities have not shown a clear
sub-category. Based on these explanations our model needs
to cater for these cases and give a weighted average for the
user taste/interest.
B. Summary of findings
The location data history available is very sparse, im-
posing a challenge in terms of identifying a user location
pattern. The (OSM) data structure scope covers much more
than what is required for the project scope.The (OSM) data
is rich, yet depending on the country / region the quality
of data varies. The (OSM) and the Taste Graph have a
lot of similarity in terms of their logical structure, which
could be a good opportunity to take into consideration
while building the model. Based on the Algorithms the
company is currently adopting to build the taste graph. Our
model should take into account a weighted approach in
determining the appropriate taste for each user. The current
model (algorithm) available only covers one country the U.S.
and builds on the previous day data based on geographic
closeness. The Taste data available is of good quality and
is currently being used for other recommender applications
successfully. There is minimum demographic user data other
than “Age”, which can be utilized in our model. The majority
of signals are coming from WIFI indicating users preference
to transmit location data through locations where WIFI is
available rather than GPS.
C. Model
Based on our initial findings we focused on the strong
points identified in order to build our model. The strengths
were: 1. The data quality of the Taste Graph information
per user is high 2. There are daily processes enhancing the
quality of this taste graph 3. A high degree of similarity
between the categorical structure of the OSM map and the
taxonomy of the taste graph structure. 4. Age information
is available and being maintained. Our high-level model
comprises of two modules, one offline and one online, as
per Figure 9. The offline module prepares the data that
requires high computational time and could be run on a
periodic basis. Three main files are prepared in the offline
module: 1. The POIs from the (OSM) are filtered based
on a certain subset of categories that are relevant to our
recommender location project. 2. The tagging of the above
filtered file with the respective mapping category against the
categories of the Taste taxonomy, we call this file POICT.
3. The User- to- User similarity based on taste and age. The
online module receives the trajectories from the users device
indicating where the user is and retrieves from the files
prepared by the offline module the relevant nearest points
of interest based on our proposed algorithm illustrated in
Figure 10. As per figure 10 the first step in the on-line
module is to check weather the user has sent a signal within
a reasonable distance from the last place he/she used to be, in
order to control this we define distance and time thresholds
to be set by the company and calculate the distance/time
difference between the last two locations the user was at.
Then the module would retrieve all the POIs’ that are within
a certain diameter from the user location. This diameter is
another parameter to control the module. Following that the
module would give high priority for the POIs’ that have
a tag matching the users taste. To enhance the accuracy of
recommender system, we also used user similarity approach,
126126126126
���������� ��!��"��#�����#!�����$����!������ % ����&������'��#��( ����� ���) �����&*���� �������+� ��������� ��! �����,���� ���������� �����*���� ������ ����#!���!������� -��!����� �! ��.�����!������� -�
������������# #
�!�� �!��!�+� ��!���������! �!
Figure 11. Model example
Data Preparation
Model POICT (OSM & Taste)
Model User Similarity
(Taste & Age)
�����
POICT
User Similarit
y
Recommender Logic
������
Figure 9. High level Model
which allows the system to consider other user with similar
tastes to the active user, when we do not have sufficient
information of the taste of active user for a particular taste.
In this case, user similarity is defined as two users with the
similar interest/taste and age closeness. The last alternative
is to recommend POIs’ applying geographical closeness to
the current user location. This option ignores user profile
information, and gives the same recommendation to the
different possible users at the same place. Figure 11 shows
�������� ����������������������� ��������������������������������������/������������������0������/����0�����/���������� 0! �"��"������������#�$���������������1�/$0 ����%��&�&2��31�%������������%���'�����%�&2���31���������1�4(�������1���������%)�����(�**�����"������������������+�������)������������������������������������������5�����������������)�,������� ��������������������%)�����$(��������������)�,����������� � ���������������������� �%)�����$(���������������������&%�6������$(��������������-��$�+ �������������(����1��62(�����������
Figure 10. Pseudo code for algorithm
an example.
Below we further clarify the details of the model, this
comes in two steps. First, modelling the points of interests
and second creating the user similarity model as mentioned
in section III-C.
1) Modelling points of interest POI: Model POI Cate-
gories with the Taste Graph (POICT) is an offline process
127127127127
which contains three main steps:- step1: Filtering the re-
spective POI categories; step2: Creating a Mapping Standard
between POI categories; and step3: Tagging POI data with
taste category standards. For step one we filtered the OSM
based on a subset of categories that relate to the Points of
Interest relevant to the location recommender. Table V shows
these categories, this step enabled us to have a manageable
size of the map. Then we tagged each taste with a unique
ID “Tag” as per the example illustrated in figure 12. This
“Tag” was then put on a parameter file mapping relevant
Key/Values from the OSM to the tag.
��������Taste Graph :
Sports Swimming
Tag :
T1637
OSM Key/Value OS ey/y Va ue
Figure 12. Example of mapping OSM to Taste Graph
Table VCATEGORIES OF OSM USED IN OUR PROJECT
Shopcuisine
craftDiets-*amenitytourism
sportleisurenatural
Buildingoffice
2) Modelling user similarity: The other major offline
part of the model is to model user similarity. The purpose
of this stage is to find one hundred similar users for each
user and store them in a file. Then refer to this file when
we need to investigate similar user behaviour for an active
user regarding the tastes that we do not have sufficient taste
information. As mentioned earlier, the reason for this is to
cover for cases where the user taste is missing.
1. We first built a user—taste matrix, where each row
represents a user and each column indicates a taste.
⎡⎣ : : : : : : ...1 1 0 0 1 1 0: : : : : : ...
⎤⎦
Xij = 1 if and only if user i is interested in taste j.
Xij = 0 Otherwise.
Accordingly each user can be represented by a vector,
which shows user interest on all given tastes.
�u∗i = [1100110]
2. As the importance of each taste has different effect
on similarity score between two users, more popular tastes
are less important in similarity measure [11], we assign
a weight to each taste to control this effect as per Equation 1:
wj = log
(n
nj
)(1)
Wheren : indicated the number of usersnj : indicates the number of users with taste j
3. We apply the taste weight on the user-taste matrix,
using the following two steps:
a) Transfer weight vector (�W)to a (n ∗ n) matrix:
Wn∗n =
⎡⎢⎢⎣w1 0 0 0 ... 00 w2 0 0 ... 0: : : : ... :0 0 0 0 ... wn
⎤⎥⎥⎦
b) Apply the weight to user-taste vector:
(�ui)n∗1 = Wn∗n ∗ (�u∗i )n∗1 (2)
�ui =
⎡⎢⎢⎣w1 0 0 0 ... 00 w2 0 0 ... 0: : : : ... :0 0 0 0 ... wn
⎤⎥⎥⎦ ∗
⎡⎢⎢⎣u∗i1
u∗i2
:u∗in
⎤⎥⎥⎦ =
⎡⎢⎢⎣w1u
∗i1
w2u∗i2
:wnu
∗in
⎤⎥⎥⎦
4. To add the effect of users’ age on similarity we
normalized age values, as per Equation 3:
Nage(ui) =Age(ui)
Max(Ages)(3)
5. Calculate differences between user i and user j using
Euclidean Distance [7] [2] as per Equation 4:
d(ui, uj) =√√√√λ(NAge(ui)− (NAge(uj))2 + (1− λ)n∑
k=1
(xik − xjk)2/n
(4)
6. Measure similarity score, according to Equation 5:
S(ui, uj) = max(∀i&j) (d(ui, uj))−d(ui, uj) = 1−d(ui, uj)(5)
128128128128
Table VICOMPARISON OF THE PERFORMANCE OF THE APPROACH IN THE
SIMULATIONS FOR TWO LOCATIONS
Location 1 Location 2Top 1 Hit Rate 0.5 0.6Top 2 Hit Rate 0.5 0.6Top 3 Hit Rate 0.6 0.7Top 4 Hit Rate 0.6 0.6Top 5 Hit Rate 0.7 0.7Top 6 Hit Rate 0.7 0.7Top 7 Hit Rate 0.7 0.8Top 8 Hit Rate 0.7 0.8Top 9 Hit Rate 0.7 0.8Top 10 Hit Rate 0.7 0.8
IV. EVALUATION OF THE MODEL
There are several approaches to evaluate recommender
systems. One common approach is an off-line evaluation
based on previous data [4]. In our case we couldn’t apply this
approach, due to the restrictions caused by the sparsity of
location history data and the lack of check-ins information of
visited places. Other options include on-line evaluation i.e.
putting the system into production and logging the responses
and feedback from users then applying a measurement of
satisfaction. We also could not apply this method, as the
company had not developed the system, yet.
Another option is to simulate a user scenario and mea-
sure user satisfaction through developing a prototype; we
adopted this approach to evaluate our proposed algorithm.
Accordingly, we developed a simulator, in Python. The
simulator receives the spatiotemporal information (Lat - Lon
, timestamp) for a user, and gives a list of recommended
locations around an active user.
In order to carry out the simulation, the company iden-
tified and provided us with 10 volunteer users ready to
participate in the experiment. Each volunteer is presented
with a scenario of being at two well-known locations in
his or her home town. Then two lists of points of interest
(POI) were provided to each volunteer. Each list contained
10 locations. The volunteers were requested to evaluate if
the points of interest presented to them were relevant or
not, based on their preference. In our survey, we checked
whether the top X recommended locations were relevant
to the volunteers or not [4]. The summary of the top
X relevance analysis is given in Table VI. 80% of the
volunteers identified at least one relevant interesting POI
in our recommendation list.
V. THREATS TO VALIDITY
In this section, we describe certain threats to the validity
of our study. We classify threats into four groups: external
validity, internal validity, construct validity and conclusion
validity [13].
A threat to external validity arises is that our work is
specific to one company as a study. We have identified a
clear relationship between the categories of user tastes and
the categories of the OSM, however this relationship might
not apply to other companies. Despite the case we believe
the framework and methodology proposed may easily be
applied to other companies. Differences can be captured
and addressed during the mapping process.
A threat to internal validity may exist in the
implementation because an incorrect implementation
can influence the output. For example, we wrote PIG scripts
to retrieve the data, Python scripts to build the simulator
and the similarity matrix. In our project, this threat is
mitigated by manually investigating the outputs.
Another threat exists when user taste profiles are incomplete,
in such cases the mapping into OSM categories would
result in not retrieving any relevant points of interest. In
order to mitigate this threat we use other user similarities
and the nearest location.
A threat to conclusion relates to our recommender output,
our assumption that the recommender provides relevant
points of interest to users has been evaluated by ten users
under a simulated environment, as a proof of concept.
Nevertheless, once our approach is deployed in real life,
end results may differ. For such we intend to deploy an
on-line feedback evaluation for the first three months once
the system is developed and deployed to measure the level
of overall satisfaction and amend the model as necessary.
A threat to construct validity relates to the OSM data
accuracy, as the data is constructed by volunteers across the
world, if this data is inaccurate then our recommender would
give inaccurate results. Our methodology does not address
this threat, yet all reports indicates that data quality of OSM
is enhancing year on year.
VI. CONCLUSIONS
The scope of the project was to develop an algorithm to be
used as an infrastructure for a location based recommender.
Our project has passed through several validation steps, the
methodology we adopted was an interactive and iterative
methodology whereby each step of investigation and devel-
opment was presented to the company for review and feed-
back. The feedback was taken into account in the next step.
Our key strategy was to identify points of strength in the
company’s processes and data and build up-on those points.
As the scope of the project involved massive data in various
structured and unstructured formats we considered that our
algorithm is scalable and further optimized to allow for
speedy recommendations. Thus we recommended a two step
approach whereby all high computational activities involving
large data that does not change frequently, to be carried out
129129129129
off-line. We have managed through a structured review of
the available literature and company data sources to deliver a
personalized, optimized algorithm that has been accepted by
the company. We evaluated our proposed algorithm through
a simulation program that we coded using feedback from 10
users.
VII. ACKNOWLEDGMENTS
This research is supported in part by NSERC DG Engage
Grant (EG) no. 461900 2013
REFERENCES
[1] J. Bao, Y. Zheng, and M. F. Mokbel. Location-based andpreference-aware recommendation using sparse geo-socialnetworking data. In Proceedings of the 20th InternationalConference on Advances in Geographic Information Systems,pages 199–208. ACM, 2012.
[2] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez.Recommender systems survey. Knowledge-Based Systems,46:109–132, 2013.
[3] Foursquare. Foursquare. https://foursquare.com/.
[4] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl.Evaluating collaborative filtering recommender systems. ACMTransactions on Information Systems (TOIS), 22(1):5–53,2004.
[5] Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.-Y. Ma. Min-ing user similarity based on location history. In Proceedingsof the 16th ACM SIGSPATIAL international conference onAdvances in geographic information systems, page 34. ACM,2008.
[6] O. S. Map. Open-street map. http://wiki.openstreetmap.org/wiki/MapFeatures.
[7] M. J. Pazzani and D. Billsus. Content-based recommendationsystems. In The adaptive web, pages 325–341. Springer, 2007.
[8] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo. Socio-spatial properties of online location-based social networks.ICWSM, 11:329–336, 2011.
[9] B. Shaw, J. Shea, S. Sinha, and A. Hogue. Learning to rankfor spatiotemporal search. In Proceedings of the sixth ACMinternational conference on Web search and data mining,pages 717–726. ACM, 2013.
[10] M. Sklar, B. Shaw, and A. Hogue. Recommending interestingevents in real-time with foursquare check-ins. In Proceedingsof the sixth ACM conference on Recommender systems, pages311–312. ACM, 2012.
[11] X. Su and T. M. Khoshgoftaar. A survey of collaborative fil-tering techniques. Advances in artificial intelligence, 2009:4,2009.
[12] L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popularroutes from uncertain trajectories. In Proceedings of the18th ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 195–203. ACM, 2012.
[13] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell,and A. Wesslen. Experimentation in software engineering.Springer Science & Business Media, 2012.
[14] X. Xiao, Y. Zheng, Q. Luo, and X. Xie. Finding similarusers using category-based location history. In Proceedings ofthe 18th SIGSPATIAL International Conference on Advancesin Geographic Information Systems, pages 442–445. ACM,2010.
[15] X. Xiao, Y. Zheng, Q. Luo, and X. Xie. Inferring socialties between users with human location history. Journal ofAmbient Intelligence and Humanized Computing, 5(1):3–19,2014.
[16] V. W. Zheng, Y. Zheng, X. Xie, and Q. Yang. Collaborativelocation and activity recommendations with gps history data.In Proceedings of the 19th international conference on Worldwide web, pages 1029–1038. ACM, 2010.
[17] Y. Zheng and X. Xie. Learning location correlation fromgps trajectories. In Mobile Data Management (MDM), 2010Eleventh International Conference on, pages 27–32. IEEE,2010.
[18] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interestinglocations and travel sequences from gps trajectories. InProceedings of the 18th international conference on World
wide web, pages 791–800. ACM, 2009.
130130130130
本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。
学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,
提供一站式文献检索和下载服务”的24 小时在线不限IP
图书馆。
图书馆致力于便利、促进学习与科研,提供最强文献下载服务。
图书馆导航:
图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具