Download - A Novel Point of Interest (POI) Location Based Recommender ...download.xuebalib.com/1d8pWCI86pYd.pdf · A novel point of interest (POI) location based recommender system utilizing

A novel point of interest (POI) location based recommender system utilizing userlocation and web interactions

Mayy Habayeb1, Behjat Soltanifar2, Bora Caglayan3, Ayse Bener4

Data Science Lab

Department of Mechanical and Industrial Engineering

Ryerson University, Canada

(mayy.habayeb1, behjat.soltanifar2, bora.caglayan3, ayse.bener4)@ryerson.ca

Abstract—Location aware mobile devices have increased theavailability of user trajectory information making point ofinterest recommenders a popular service on mobile devices.However, one of the main challenges in this area is sparsity ofthe historical trajectory data. So far, most of the recommendersystems take users’ historical trajectory information into con-sideration to recommend different places. Web interactionsreveal rich information on the user interests, and hence arecommender system should take into consideration such data.

In this study, we present a model that combines andassociates users interest/taste information, obtained from theirweb interactions together with location information obtainedfrom the Open Street Map (OSM). Then, we combine thisinformation with the users’ real time trajectory information(longitude, latitude and timestamp) to present a list of recom-mended points of interest close to the current location.

Keywords-Location based recommender system, Recom-mender System, Big Data, Point of Interest.

I. INTRODUCTION

Mobile devices detect various contexts in the environment,

such as the location, the surroundings, people in the vicinity,

or the weather forecast. The user could be at the train station,

in a shopping mall or even in a different city. These shifts

in interaction, as well as the availability of vast amount

of user data and context data, bring an opportunity for

the company to serve its customers better by anticipating

their preferences in a dynamic manner. Once location opt-in

function is enabled on mobile devices, several companies

collect user location data based on user consent in order

to better understand user behaviour and therefore provide

contextual personalization. They also collect user data from

many different sources such as browsing history, application

usage, device usage and social graph. Combining both user

location information and user web interactions and further

integrating with external data sources is a challenge. This

is a big data problem not only because the data volume

is massive, but also because data comes in both structured

and unstructured formats and the complexity requires new

algorithms to tackle the personalization problem.

In the past few years we see the rise of the Location-

Based Social Networking (LBSN)[8][3]. LBSN can gener-

ally be classified into three major groups: a) Geo-tagged-

media-based (Flickr, Geo-twitter), b) Point location-based

(Foursquare, Google Latitude), c) Trajectory based (Bikely,

Geo-life). Many interesting research projects[14][15][17]

have been done around these LBSN for location-based rec-

ommendation. Common approaches to the recommendation

problems include constructing the User-User association

graph, User-Location association graph and the Location-

location graph 1.

Our case study relates to a large North American Tele-

communications company. In the past two years, the Data

Science team at the company has built a personalized

interest graph for each individual user to drive personalized

ads targeting and application recommendation services. The

interest graph leverages user browsing data, application

usage data and many other data sources. With social graph,

interest graph and the location data as valuable data assets,

the company sees a big opportunity to leverage these data

to drive location-based recommendations to the end users.

However, the challenges are many-folds: a) The company

has little previous experience with LBSN, b) Harnessing

the heterogeneous data for recommendation is a challenging

problem, c) Interest graph, social graph and location graph

are complex graph problems on their own right. Combining

these graphs to solve the business problem poses new

challenges d) The data volume is huge and requires scalable

algorithms that run on distributed platforms such as Hadoop

or graph databases such as Neo4j or Titan.

Information filtering systems gained their current popu-

larity with the popularity of the internet. Popular sites like

stackoverflow, Amazon and Google employ information fil-

tering systems that guide user decisions with minimal effect

on the core functionality. We can group information filtering

systems as collaborative, context based, and hybrid [11].

Some problems in these systems include data sparsity, cold

start and rating subjectivity. The information given by the

system depends on the user’s situation. The trend today is to

recommend relevant information to users, using supervised

machine learning techniques. However, there is a need to

address the concerns of cold start, data sparsity, and rating

1http://research.microsoft.com/en-us/projects/lbsn/

2016 IEEE Second International Conference on Big Data Computing Service and Applications

978-1-5090-2251-9/16 $31.00 © 2016 IEEE

DOI 10.1109/BigDataService.2016.42

121


978-1-5090-2251-9/16 $31.00 © 2016 IEEE


121


978-1-5090-2251-9/16 $31.00 © 2016 IEEE


121


978-1-5090-2251-9/16 $31.00 © 2016 IEEE


121

subjectivity problems and adopt a collaborative information

filtering system to the user in a ubiquitous environment.

We need to inspire from the models of human reasoning

developed in robotics, combining reinforcement learning and

case-based reasoning to define a contextual recommendation

process based on different context dimensions (cognitive,

social, temporal, geographic).

The flow of this paper is as follows: In section II we

provide a summary of related work. In section III we

describe our methodology, results of our exploratory analysis

and details of our proposed model. In section IV we evaluate

our model. In section V we describe threats to our work and

finally in section VI we summarize and conclude our work.

II. RELATED WORK

In 2008 Li et al. [5] investigated measuring user similari-

ties based on user location histories. They proposed a frame-

work “Hierarchical Graph-based Similarity Measurement”

(HGSM), that took into consideration the sequence property

of people’s movement behaviours and the hierarchy property

of geographic spaces. For their research they monitored 65

volunteers for a period of six months. As a result they

compared their framework (HGSM) with other similarity

measures such as cosine and person and reported more

accurate results using the HGSM measure. Similarly Zheng

et al. [18] presented a model to mine interesting locations

and classical travel sequences by modelling user location

histories in a tree-based hierarchical graph then proposed a

Hyper text Induced Topic search based inference model to

relate the user to a location.

Xiao et al.[14] further continued in the same field by

trying to identify user similarity based on Semantic location

history instead of the low level geographic locations, they

use a Maximum travel match algorithm to calculate the user

semantic location history similarity. We differ from their

approach as we focus on user taste/ interest similarity that

is mapped back to physical Points of Interest.

Zheng et al. [16] investigated the possibility of recommend-

ing activities related to the recommended location in addition

to the recommended location. They used collective matrix

factorization to mine interesting locations and activities, yet

similar in terms of modelling the semantic categories of POI.

In 2010,Y Zanh et al. [17] proposed a personalized location

recommendation by the use of correlation between locations

considering the sequentially between locations visited by

a specific user. They evaluated their algorithm using GPS

dataset information of 112 users over a time period of 1.5

years. In 2012 L.-Y. Wei et al. [12] proposed a framework,

called a Route Interface framework based on Collective

Knowledge (RICK), which helps to extract popular routes

from uncertain trajectories. Basically, given a time span

and a location sequence, this model recommends top-k

popular routes, inferred from historical check-in record of

those places generated by other tourists. Recently in 2012

Bao et al. [1] tackled the problem of sparse geo-social

data in the user location matrix. They used a Four Square

dataset to build a location recommender that models each

individual’s personal preference with a weighted category

hierarchy (WCH). They use users check-in data as to build

Category Hierarchy, also the user location history similarity.

In our case study we differ from them as we use the Open

Street Map and we do not have check-in data from other

users.

Other studies carried out by the Foursquare research group

[10],[9] focus on utilizing users check-in data to introduce

various recommender applications such as recommending

interesting events in real time [10]. In [15], the authors esti-

mated user similarity based on modelling semantic location

history of a user’s GPS trajectories, and using maximum

travel match algorithm.

Our work differs from the work described above mainly

in two aspects, first we do not build on history of user

trajectories [5][18] nor on route trajectories [12], as this

is an expensive operation to maintain and is subject to

sparsity of data and signal transmissions. Instead, we use the

user transmitted location and the points of interest within

a certain diameter combined with user web interactions

to provide recommendations. Second, we capitalize on the

internal customer data available within the organization and

model the similarities between the user web interactions

and the points of interest around the user location for our

recommender system; in addition, if user web interactions

are not available we use user-user similarities to provide

recommendations.

III. METHODOLOGY

In this section we illustrate steps of our methodology that

we used to address the research problem. First we carried

out an exploratory analysis exercise, then we identified key

strength and finally we built a high level model.

A. Data collection & exploratory data analysis

We examined three dimensions for this step :

1) User trajectory information: We collected users’ his-

torical trajectory data and investigated it in order to identify

any trends. We looked at the number of transmitted signals.

For this exercise we agreed with the company to retrieve

data related to all users subscribed in one city ”Waterloo” in

Canada and to extract all their historical trajectory data over

a period of one month from from 27/4/2014 to 29/5/2014. In

total there were 284,744 observations, Table I demonstrates

the all countries visited by this cities users: -

We noted that most of the signals were transmitted from

Canada 97.5% this not a surprising result, yet 60 other

countries were also visited, although the time period was

not a seasonal vacation period. Figure 1 visually illustrates

our findings. We further analysed the data in terms of the

most frequent cites that were visited. Table II illustrates the

122122122122

Table ICOUNTIRES VISITED DURING 25/4/2014 TO 29/5/2014

Country AE AG AU AZ BB BE BR BS CA CH CL CN CO CR CU CZCount 10 38 16 5 1 5 63 30 277,786 13 2 44 1 53 2 7

Country DE DK DO EG ES FI FR GB GR HK HR HT HU ID IE ILCount 32 4 155 1 11 4 31 78 4 6 7 41 4 2 57 3

Country IN IR IS IT JM JP KN KR MT MU MX MY NG NL null PACount 23 1 1 62 18 2 37 10 3 2 78 13 13 4 69 2

Country PL PR PT RU SA SI TH TR US VE VI VN YECount 7 2 21 1 1 2 9 3 5,813 20 1 9 1

Figure 1. GPS signals transmitted by Waterloo users period(27/4/2014 to29/5/2014)

top 20 cities visited. As noticed from Table II the home city

was where most signals were transmitted, followed by the

nearby cities. This indicates that most of the users live and

work in their home city.

Table IIMOST VISITED CITIES WITHIN CANADA PERIOD 25/4/2014 TO

29/5/2014

Rank City # No.of sig-nals

Rank City # No.of sig-nals

1 Waterloo 191,346 11 Brampton 1,7732 Kitchener 27,168 12 Orangeville 1,6243 Cambridge 9,918 13 London 1,6204 Guelph 5,986 14 Hamilton 1,2755 Stratford 4,770 15 Woodstock 1,1416 Toronto 2,998 16 Burlington 1,0267 Owen

Sound2,969 17 Ottawa 842

8 Etobicoke 2,503 18 Huntsville 8119 Milton 2,419 19 Collingwood 78610 Mississauga 1,917 20 Willowdale 775

Total 263,667

We followed this by taking a closer look at the trajectory

signals transmitted per user. Figure 2 below shows a box

plot of all the user signals transmitted during the period of

32 days from 27/4/2014 to 29/5/2014.

Figure 2. Frequency of signals(GPS and WIFI) transmitted by Waterloousers period(27/4/2014 to 29/5/2014)

Form Figure 2 we note that the median for number of

signals transmitted per user is 63 signals over the period of

32 days. On average each user transmits 2 signals per day,

the number of users in the first quartile only transmit 1 signal

per day and in the third quartile only 3 signals per day.This

indicates sparsity of data. In order to confirm further we

wrote a code to calculate the time difference between signals

emitted per user. Our calculations reveal that the median

of time difference between signals per user is 5.999 hours,

while the mean is 8.744 hours.

These results indicate sparse location history data which

would not enable us to further carry out analysis at the level

of location history similarity as per the previous research

[5], [14], [16].

We Selected the most active user, i.e. the user who trans-

mitted the highest number of signals, during the 32 day

time period. We plotted on a heat map of all the 1592

trajectories as shown in figure 3. As noticed there were six

areas, marked with red, where this user visited. This result

confirms previous research results that people tend to follow

a certain pattern in their movements and there is a lot of

duplicate data in the location history data.

We looked at the temporal aspect of the data in order to

see if there were any certain time spans where users would

transmit additional information, such as weekends. Table III

illustrates the outcome of this exercise. Column one is the

123123123123

Figure 3. Most active user trajectory movements during the period(27/4/2014 to 29/5/2014)

day of the week, column two reflects the total number of

signals emitted on that day, column three shows the daily

average. As a result, there were no clear or obvious days

of the week where users would transmit additional GPS

trajectories.

Table IIISIGNAL DISTRIBUTION BY DAY OF THE WEEK

Day # Observations AverageThursday 35,787 7,157.4Friday 37,920 7,584Saturday 39,127 7,825.4Wednesday 40,754 8,150.8Sunday 42,830 8,566Monday 43,562 8,712.4Tuesday 44,764 8,952.8

We looked at the type of signals transmitted, GPS or

WIFI. We noted that most of the signals came from the

WIFI, this indicates that users prefer to connect to WIFI

and transmit location information, most likely because they

won’t incur charges. Figure 4 illustrates the split. We further

investigated the time intervals between signal transmission

per user in order to observe if there is a difference in the

time frequency of transmission between WIFI and GPS.

��

��

��

Figure 4. Type of signals(GPS and WIFI) transmitted by Waterloo usersperiod(27/4/2014 to 29/5/2014)

Figure 5 display these results.

As it can be noted the Median for GPS is 16,250 seconds

4.5 hours while for WIFI 21,600 seconds 6 hours which is

Time difference between Signals in seconds per user for signals transmitted via GPS �

Figure 5. Time intervals between signal transmission per user per type ofsignals(GPS and WIFI) transmitted by Waterloo users period(27/4/2014 to29/5/2014)

higher by 25% , because most of the signals are transmitted

through WIFI, the overall average is still 5.99.

2) Open Street Map: We investigated the “Open Street

Map” (OSM) [6] because it would be the main source for

the Points of Interest (POI) recommendations. The open

street map is a collaborative project with the purpose of

creating a free editable map of the world. Volunteers collect

the data across the world, in addition some government

agencies have released official data on appropriate licenses

[6]. Much of this data has come from the United States.

Over the past few years the map has gained a lot of

popularity and many leading companies such as Apple,

Flicker..etc. have used the (OSM) data as infrastructure

for their goe-applications. (OSM) uses a topological data

structure, with four core elements[6]: First: Nodes are points

with a geographic position, stored as coordinates (pairs

of a latitude and a longitude) Outside of their usage in

ways, they are used to represent map features without a

size, such as points of interest or mountain peaks. Second:

Ways are ordered lists of nodes, representing a polyline, or

possibly a polygon if they form a closed loop. They are

used both for representing linear features such as streets

and rivers, and areas, like forests, parks, parking areas and

lakes. Third: Relations are ordered lists of nodes, ways and

relations (together called ”members”), where each member

can optionally have a ”role” (a string). Relations are used

for representing the relationship of existing nodes and ways.

Examples include turn restrictions on roads, routes that

span several existing ways (for instance, a long-distance

motorway), and areas with holes. Forth: Tags are key-value

pairs (both arbitrary strings). They are used to store metadata

about the map objects (such as their type, their name and

their physical properties). Tags are not free-standing, but are

always attached to an object: to a node, a way, a relation,

or to a member of a relation. A recommended ontology

of map features (the meaning of tags) is maintained on

a wiki. (OSM) represents physical features on the ground

(e.g., roads or buildings) using tags attached to its basic

data structures (its nodes, ways, and relations). Each tag

124124124124

describes a geographic attribute of the feature being shown

by that specific node, way or relation.[8]

The logical structure, features of the Map, are grouped

into 26 main categories, each category has further sub-

categories [6]. Table IV illustrates the main feature groups.

For the purpose of this project we investigated each of

Table IVCATEGORIES OF OSM

Feature Feature1.1 Aerialway 1.14 Man Made1.2 Aeroway 1.15 Military1.3 Amenity 1.16 Natural1.4 Barrier 1.17 Office

1.5 Boundary 1.18 Places1.6 Building 1.19 Power

1.7 Craft 1.20 Public Transport1.8 Emergency 1.21 Railway1.9 Geological 1.22 Route1.10 Highway 1.23 Shop1.11 Historic 1.24 Sport1.12 Landuse 1.25 Tourism1.13 Leisure 1.26 Waterway

the above categories and subcategories and found that a

lot of information is irrelevant to our project for example

Roads, highways, Railways, Waterway etc. Accordingly we

selected the main categories that would be relevant for

Point of Interest (POI) recommendation and filtered them

out for two geographic regions in order to carry out some

initial data explorations and test the quality of the data. The

regions were identified with the company and we agreed

to investigate one region in Canada (Waterloo) and one

region in the United States (New York - Manhattan). The

reasons behind these selections were in several folds; Frist,

the company had another source for the (POI), only for

US users, so it made sense to look at this new source.

Second, the company had built an off-line recommender

based on last signal received from the user showing the

nearest ten locations from the second US source. These

recommendations were based on previous end of day data.

Third, the company had a lot of users in North America

and these users frequently travelled between Canada and the

USA. Fourth, despite having the regions in two countries

yet the proximity between locations was interesting. Fifth,

the nature of the two regions was quite different and worth

comparing. For the purpose of this investigation we had

to first identify a set of meaningful key values,and then

carry out data extractions using PIG Latin programming

language, then utilizing Heat maps package in “R” scripting

we managed to plot the data points retrieved. Figure 6

illustrates different category selection for Manhattan region,

one focusing on the food - drink POIs’ versus food- drink,

shops, leisure , sports art and history POIs. As it can be noted

the number of POIs increase drastically with the inclusion of

extra categories. This might confuse the end user and provide

��

Figure 6. POI for Manhattan, left side only food and drink, right sidefood and drink and other categories

irrelevant information. In figure 7 we compare presence

Waterloo�Manhattan�

Figure 7. POIs in Waterloo versus Manhatten POIs

of POIs for two regions Waterloo and Manhattan for the

same set of categories. As noted before the density pattern

between the two regions under study is quite different. In

Waterloo we noted several focal points while in Manhattan

the POIs’ seemed continuous. This is due to the nature of the

regions while Waterloo is considered a business residential

region, Manhattan has more of a touristic character, thus

every few meters POIs may appear.

These facts need to be considered while designing the

recommender system. Also, as we carried out the exercise

we noted that the quality of the U.S. data was more complete

than the quality of the Canadian data. This can be attributed

to the fact that many contributions have come from some

U.S. government agencies.3) User taste/interest data: We investigated the users

information that was available at the company. Basically

they have only one demographic feature, which is learned

throughout the different social databases and the taste graph

data that the company had collected. The taste taxonomy

is a big data initiative that the company has invested and

dedicated a lot of resources to build. Users information is

gathered during their interactions via their mobile devices

125125125125

by looking at several dimensions, such as the browsing his-

tory, the most frequently visited websites, online purchases,

application usage, device usage and social graph. Thus per

user a taste graph is developed reflective of his/her interests

and tastes. The Taste graph is currently being utilized for

recommendation purposes through the user network channel.

Accordingly for the city under study, we investigated the

user data at three levels: 1) Taste 2) Age 3) Location history.

Our analysis showed that the city of Waterloo had 9,988

subscribers out of which 90.4% had age data and 61.3% had

Taste graph data while only 40% had location history data.

Figure 8 below illustrates the user data for the three dimen-

sions under study. A closer look at the taste data we observed

Total Waterloo Users : 9,988

Total Waterloo Users with age Info: 9,033

Total unique Waterloo Users with taste: 6,129

Total Waterloo Users tastes : 83,676

Total Waterloo Users with location info : 4,000+ (note :one month) ��

��

��

Figure 8. Data related city under study

that the taste Taxonomy Graph had several layers (Tiers),

as per the company’s recommendation we only looked at

Tier one and Tier two. Tier one is a high level category of

interest while tier two can be considered as a sub-category of

Tier one. For example if tier one indicates “Sports” tier two

would be “Swimming” or “Ice-hockey”. Upon investigating

the taste graph for our study city we note that not all users

have Tier 1&2 data. Also the majority of users have Taste of

type technology and Computing around 8,992 observations.

Similarly further observations were noted with regard to

Tier 1 categories for example there were 5,282 “Hobbies &

Interests” and 5,102 “Arts & Entertainment” upon discussing

these facts with the company it was advised that any user

who carries out browsing activities gets assigned to the

“technology and Computing” tier until they identify a certain

browsing pattern. As for the other two the explanation was

that they have reached a point where they could identify

that the user is interested in the major category for instance

arts while his/her further activities have not shown a clear

sub-category. Based on these explanations our model needs

to cater for these cases and give a weighted average for the

user taste/interest.

B. Summary of findings

The location data history available is very sparse, im-

posing a challenge in terms of identifying a user location

pattern. The (OSM) data structure scope covers much more

than what is required for the project scope.The (OSM) data

is rich, yet depending on the country / region the quality

of data varies. The (OSM) and the Taste Graph have a

lot of similarity in terms of their logical structure, which

could be a good opportunity to take into consideration

while building the model. Based on the Algorithms the

company is currently adopting to build the taste graph. Our

model should take into account a weighted approach in

determining the appropriate taste for each user. The current

model (algorithm) available only covers one country the U.S.

and builds on the previous day data based on geographic

closeness. The Taste data available is of good quality and

is currently being used for other recommender applications

successfully. There is minimum demographic user data other

than “Age”, which can be utilized in our model. The majority

of signals are coming from WIFI indicating users preference

to transmit location data through locations where WIFI is

available rather than GPS.

C. Model

Based on our initial findings we focused on the strong

points identified in order to build our model. The strengths

were: 1. The data quality of the Taste Graph information

per user is high 2. There are daily processes enhancing the

quality of this taste graph 3. A high degree of similarity

between the categorical structure of the OSM map and the

taxonomy of the taste graph structure. 4. Age information

is available and being maintained. Our high-level model

comprises of two modules, one offline and one online, as

per Figure 9. The offline module prepares the data that

requires high computational time and could be run on a

periodic basis. Three main files are prepared in the offline

module: 1. The POIs from the (OSM) are filtered based

on a certain subset of categories that are relevant to our

recommender location project. 2. The tagging of the above

filtered file with the respective mapping category against the

categories of the Taste taxonomy, we call this file POICT.

3. The User- to- User similarity based on taste and age. The

online module receives the trajectories from the users device

indicating where the user is and retrieves from the files

prepared by the offline module the relevant nearest points

of interest based on our proposed algorithm illustrated in

Figure 10. As per figure 10 the first step in the on-line

module is to check weather the user has sent a signal within

a reasonable distance from the last place he/she used to be, in

order to control this we define distance and time thresholds

to be set by the company and calculate the distance/time

difference between the last two locations the user was at.

Then the module would retrieve all the POIs’ that are within

a certain diameter from the user location. This diameter is

another parameter to control the module. Following that the

module would give high priority for the POIs’ that have

a tag matching the users taste. To enhance the accuracy of

recommender system, we also used user similarity approach,

126126126126

�� !��"��#��#!��$��!�� % ��&��'��#��( �� ) ��&*�� +� �� ! ��,�� *�� #!��!�� -��!�� ! ��.��!�� -�

��# #

�!�� !��!�+� ��!��! �!

Figure 11. Model example

Data Preparation

Model POICT (OSM & Taste)

Model User Similarity

(Taste & Age)

��

POICT

User Similarit

y

Recommender Logic

��

Figure 9. High level Model

which allows the system to consider other user with similar

tastes to the active user, when we do not have sufficient

information of the taste of active user for a particular taste.

In this case, user similarity is defined as two users with the

similar interest/taste and age closeness. The last alternative

is to recommend POIs’ applying geographical closeness to

the current user location. This option ignores user profile

information, and gives the same recommendation to the

different possible users at the same place. Figure 11 shows

�� /��0��/��0��/�� 0! �"��"��#�$��1�/$0 ��%��&�&2��31�%��%��'��%�&2��31��1�4(��1��%)��(�**��"��+��)��5��)�,�� %)��$(��)�,�� %)��$(��&%�6��$(��-��$�+ ��(��1��62(��

Figure 10. Pseudo code for algorithm

an example.

Below we further clarify the details of the model, this

comes in two steps. First, modelling the points of interests

and second creating the user similarity model as mentioned

in section III-C.

1) Modelling points of interest POI: Model POI Cate-

gories with the Taste Graph (POICT) is an offline process

127127127127

which contains three main steps:- step1: Filtering the re-

spective POI categories; step2: Creating a Mapping Standard

between POI categories; and step3: Tagging POI data with

taste category standards. For step one we filtered the OSM

based on a subset of categories that relate to the Points of

Interest relevant to the location recommender. Table V shows

these categories, this step enabled us to have a manageable

size of the map. Then we tagged each taste with a unique

ID “Tag” as per the example illustrated in figure 12. This

“Tag” was then put on a parameter file mapping relevant

Key/Values from the OSM to the tag.

��Taste Graph :

Sports Swimming

Tag :

T1637

OSM Key/Value OS ey/y Va ue

Figure 12. Example of mapping OSM to Taste Graph

Table VCATEGORIES OF OSM USED IN OUR PROJECT

Shopcuisine

craftDiets-*amenitytourism

sportleisurenatural

Buildingoffice

2) Modelling user similarity: The other major offline

part of the model is to model user similarity. The purpose

of this stage is to find one hundred similar users for each

user and store them in a file. Then refer to this file when

we need to investigate similar user behaviour for an active

user regarding the tastes that we do not have sufficient taste

information. As mentioned earlier, the reason for this is to

cover for cases where the user taste is missing.

1. We first built a user—taste matrix, where each row

represents a user and each column indicates a taste.

⎡⎣ : : : : : : ...1 1 0 0 1 1 0: : : : : : ...

⎤⎦

Xij = 1 if and only if user i is interested in taste j.

Xij = 0 Otherwise.

Accordingly each user can be represented by a vector,

which shows user interest on all given tastes.

�u∗i = [1100110]

2. As the importance of each taste has different effect

on similarity score between two users, more popular tastes

are less important in similarity measure [11], we assign

a weight to each taste to control this effect as per Equation 1:

wj = log

(n

nj

)(1)

Wheren : indicated the number of usersnj : indicates the number of users with taste j

3. We apply the taste weight on the user-taste matrix,

using the following two steps:

a) Transfer weight vector (�W)to a (n ∗ n) matrix:

Wn∗n =

⎡⎢⎢⎣w1 0 0 0 ... 00 w2 0 0 ... 0: : : : ... :0 0 0 0 ... wn

⎤⎥⎥⎦

b) Apply the weight to user-taste vector:

(�ui)n∗1 = Wn∗n ∗ (�u∗i )n∗1 (2)

�ui =

⎡⎢⎢⎣w1 0 0 0 ... 00 w2 0 0 ... 0: : : : ... :0 0 0 0 ... wn

⎤⎥⎥⎦ ∗

⎡⎢⎢⎣u∗i1

u∗i2

:u∗in

⎤⎥⎥⎦ =

⎡⎢⎢⎣w1u

∗i1

w2u∗i2

:wnu

∗in

⎤⎥⎥⎦

4. To add the effect of users’ age on similarity we

normalized age values, as per Equation 3:

Nage(ui) =Age(ui)

Max(Ages)(3)

5. Calculate differences between user i and user j using

Euclidean Distance [7] [2] as per Equation 4:

d(ui, uj) =√√√√λ(NAge(ui)− (NAge(uj))2 + (1− λ)n∑

k=1

(xik − xjk)2/n

(4)

6. Measure similarity score, according to Equation 5:

S(ui, uj) = max(∀i&j) (d(ui, uj))−d(ui, uj) = 1−d(ui, uj)(5)

128128128128

Table VICOMPARISON OF THE PERFORMANCE OF THE APPROACH IN THE

SIMULATIONS FOR TWO LOCATIONS

Location 1 Location 2Top 1 Hit Rate 0.5 0.6Top 2 Hit Rate 0.5 0.6Top 3 Hit Rate 0.6 0.7Top 4 Hit Rate 0.6 0.6Top 5 Hit Rate 0.7 0.7Top 6 Hit Rate 0.7 0.7Top 7 Hit Rate 0.7 0.8Top 8 Hit Rate 0.7 0.8Top 9 Hit Rate 0.7 0.8Top 10 Hit Rate 0.7 0.8

IV. EVALUATION OF THE MODEL

There are several approaches to evaluate recommender

systems. One common approach is an off-line evaluation

based on previous data [4]. In our case we couldn’t apply this

approach, due to the restrictions caused by the sparsity of

location history data and the lack of check-ins information of

visited places. Other options include on-line evaluation i.e.

putting the system into production and logging the responses

and feedback from users then applying a measurement of

satisfaction. We also could not apply this method, as the

company had not developed the system, yet.

Another option is to simulate a user scenario and mea-

sure user satisfaction through developing a prototype; we

adopted this approach to evaluate our proposed algorithm.

Accordingly, we developed a simulator, in Python. The

simulator receives the spatiotemporal information (Lat - Lon

, timestamp) for a user, and gives a list of recommended

locations around an active user.

In order to carry out the simulation, the company iden-

tified and provided us with 10 volunteer users ready to

participate in the experiment. Each volunteer is presented

with a scenario of being at two well-known locations in

his or her home town. Then two lists of points of interest

(POI) were provided to each volunteer. Each list contained

10 locations. The volunteers were requested to evaluate if

the points of interest presented to them were relevant or

not, based on their preference. In our survey, we checked

whether the top X recommended locations were relevant

to the volunteers or not [4]. The summary of the top

X relevance analysis is given in Table VI. 80% of the

volunteers identified at least one relevant interesting POI

in our recommendation list.

V. THREATS TO VALIDITY

In this section, we describe certain threats to the validity

of our study. We classify threats into four groups: external

validity, internal validity, construct validity and conclusion

validity [13].

A threat to external validity arises is that our work is

specific to one company as a study. We have identified a

clear relationship between the categories of user tastes and

the categories of the OSM, however this relationship might

not apply to other companies. Despite the case we believe

the framework and methodology proposed may easily be

applied to other companies. Differences can be captured

and addressed during the mapping process.

A threat to internal validity may exist in the

implementation because an incorrect implementation

can influence the output. For example, we wrote PIG scripts

to retrieve the data, Python scripts to build the simulator

and the similarity matrix. In our project, this threat is

mitigated by manually investigating the outputs.

Another threat exists when user taste profiles are incomplete,

in such cases the mapping into OSM categories would

result in not retrieving any relevant points of interest. In

order to mitigate this threat we use other user similarities

and the nearest location.

A threat to conclusion relates to our recommender output,

our assumption that the recommender provides relevant

points of interest to users has been evaluated by ten users

under a simulated environment, as a proof of concept.

Nevertheless, once our approach is deployed in real life,

end results may differ. For such we intend to deploy an

on-line feedback evaluation for the first three months once

the system is developed and deployed to measure the level

of overall satisfaction and amend the model as necessary.

A threat to construct validity relates to the OSM data

accuracy, as the data is constructed by volunteers across the

world, if this data is inaccurate then our recommender would

give inaccurate results. Our methodology does not address

this threat, yet all reports indicates that data quality of OSM

is enhancing year on year.

VI. CONCLUSIONS

The scope of the project was to develop an algorithm to be

used as an infrastructure for a location based recommender.

Our project has passed through several validation steps, the

methodology we adopted was an interactive and iterative

methodology whereby each step of investigation and devel-

opment was presented to the company for review and feed-

back. The feedback was taken into account in the next step.

Our key strategy was to identify points of strength in the

company’s processes and data and build up-on those points.

As the scope of the project involved massive data in various

structured and unstructured formats we considered that our

algorithm is scalable and further optimized to allow for

speedy recommendations. Thus we recommended a two step

approach whereby all high computational activities involving

large data that does not change frequently, to be carried out

129129129129

off-line. We have managed through a structured review of

the available literature and company data sources to deliver a

personalized, optimized algorithm that has been accepted by

the company. We evaluated our proposed algorithm through

a simulation program that we coded using feedback from 10

users.

VII. ACKNOWLEDGMENTS

This research is supported in part by NSERC DG Engage

Grant (EG) no. 461900 2013

REFERENCES

[1] J. Bao, Y. Zheng, and M. F. Mokbel. Location-based andpreference-aware recommendation using sparse geo-socialnetworking data. In Proceedings of the 20th InternationalConference on Advances in Geographic Information Systems,pages 199–208. ACM, 2012.

[2] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez.Recommender systems survey. Knowledge-Based Systems,46:109–132, 2013.

[3] Foursquare. Foursquare. https://foursquare.com/.

[4] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl.Evaluating collaborative filtering recommender systems. ACMTransactions on Information Systems (TOIS), 22(1):5–53,2004.

[5] Q. Li, Y. Zheng, X. Xie, Y. Chen, W. Liu, and W.-Y. Ma. Min-ing user similarity based on location history. In Proceedingsof the 16th ACM SIGSPATIAL international conference onAdvances in geographic information systems, page 34. ACM,2008.

[6] O. S. Map. Open-street map. http://wiki.openstreetmap.org/wiki/MapFeatures.

[7] M. J. Pazzani and D. Billsus. Content-based recommendationsystems. In The adaptive web, pages 325–341. Springer, 2007.

[8] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo. Socio-spatial properties of online location-based social networks.ICWSM, 11:329–336, 2011.

[9] B. Shaw, J. Shea, S. Sinha, and A. Hogue. Learning to rankfor spatiotemporal search. In Proceedings of the sixth ACMinternational conference on Web search and data mining,pages 717–726. ACM, 2013.

[10] M. Sklar, B. Shaw, and A. Hogue. Recommending interestingevents in real-time with foursquare check-ins. In Proceedingsof the sixth ACM conference on Recommender systems, pages311–312. ACM, 2012.

[11] X. Su and T. M. Khoshgoftaar. A survey of collaborative fil-tering techniques. Advances in artificial intelligence, 2009:4,2009.

[12] L.-Y. Wei, Y. Zheng, and W.-C. Peng. Constructing popularroutes from uncertain trajectories. In Proceedings of the18th ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 195–203. ACM, 2012.

[13] C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell,and A. Wesslen. Experimentation in software engineering.Springer Science & Business Media, 2012.

[14] X. Xiao, Y. Zheng, Q. Luo, and X. Xie. Finding similarusers using category-based location history. In Proceedings ofthe 18th SIGSPATIAL International Conference on Advancesin Geographic Information Systems, pages 442–445. ACM,2010.

[15] X. Xiao, Y. Zheng, Q. Luo, and X. Xie. Inferring socialties between users with human location history. Journal ofAmbient Intelligence and Humanized Computing, 5(1):3–19,2014.

[16] V. W. Zheng, Y. Zheng, X. Xie, and Q. Yang. Collaborativelocation and activity recommendations with gps history data.In Proceedings of the 19th international conference on Worldwide web, pages 1029–1038. ACM, 2010.

[17] Y. Zheng and X. Xie. Learning location correlation fromgps trajectories. In Mobile Data Management (MDM), 2010Eleventh International Conference on, pages 27–32. IEEE,2010.

[18] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining interestinglocations and travel sequences from gps trajectories. InProceedings of the 18th international conference on World

wide web, pages 791–800. ACM, 2009.

130130130130

本文献由“学霸图书馆-文献云下载”收集自网络，仅供学习交流使用。

学霸图书馆（www.xuebalib.com）是一个“整合众多图书馆数据库资源，

提供一站式文献检索和下载服务”的24 小时在线不限IP

图书馆。

图书馆致力于便利、促进学习与科研，提供最强文献下载服务。

图书馆导航：

图书馆首页文献云下载图书馆入口外文数据库大全疑难文献辅助工具

http://www.xuebalib.com/cloud/

http://www.xuebalib.com/

http://www.xuebalib.com/cloud/


http://www.xuebalib.com/vip.html

http://www.xuebalib.com/db.php

http://www.xuebalib.com/zixun/2014-08-15/44.html