+ All Categories
Home > Documents > Knowledge extraction from crowdsourced data for the...

Knowledge extraction from crowdsourced data for the...

Date post: 26-Jun-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
33
Geoinformatica DOI 10.1007/s10707-017-0306-1 Knowledge extraction from crowdsourced data for the enrichment of road networks Gregor Joss´ e 1 · Klaus Arthur Schmid 1 · Andreas Z ¨ ufle 2 · Georgios Skoumas 3 · Matthias Schubert 1 · Matthias Renz 2 · Dieter Pfoser 2 · Mario A. Nascimento 4 Received: 28 April 2016 / Revised: 3 July 2017 / Accepted: 31 July 2017 © Springer Science+Business Media, LLC 2017 Abstract In current navigation systems quantitative metrics such as distance, time and energy are used to determine optimal paths. Yet, a “best path”, as judged by users, might take qualitative features into account, for instance the scenery or the touristic attractive- ness of a path. Machines are unable to quantify such “soft” properties. Crowdsourced data provides us with a means to record user choices and opinions. In this work, we survey Mario A. Nascimento’s work was partially supported by NSERC, Canada. Matthias Schubert [email protected] Gregor Joss´ e [email protected] Klaus Arthur Schmid [email protected] Andreas Z¨ ufle [email protected] Georgios Skoumas [email protected] Matthias Renz [email protected] Dieter Pfoser [email protected] Mario A. Nascimento [email protected] 1 Ludwig-Maximilians-Universit¨ at M¨ unchen, 80538, M ¨ unchen, Germany 2 George Mason University, Fairfax, 22030, USA 3 National Technical University of Athens, Zografou 15773, Greece 4 University of Alberta, Edmonton T6G2R3, Canada
Transcript
Page 1: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

GeoinformaticaDOI 10.1007/s10707-017-0306-1

Knowledge extraction from crowdsourced datafor the enrichment of road networks

Gregor Josse1 ·Klaus Arthur Schmid1 ·Andreas Zufle2 ·Georgios Skoumas3 ·Matthias Schubert1 ·Matthias Renz2 ·Dieter Pfoser2 ·Mario A. Nascimento4

Received: 28 April 2016 / Revised: 3 July 2017 / Accepted: 31 July 2017© Springer Science+Business Media, LLC 2017

Abstract In current navigation systems quantitative metrics such as distance, time andenergy are used to determine optimal paths. Yet, a “best path”, as judged by users, mighttake qualitative features into account, for instance the scenery or the touristic attractive-ness of a path. Machines are unable to quantify such “soft” properties. Crowdsourced dataprovides us with a means to record user choices and opinions. In this work, we survey

Mario A. Nascimento’s work was partially supported by NSERC, Canada.

� Matthias [email protected]

Gregor [email protected]

Klaus Arthur [email protected]

Andreas [email protected]

Georgios [email protected]

Matthias [email protected]

Dieter [email protected]

Mario A. [email protected]

1 Ludwig-Maximilians-Universitat Munchen, 80538, Munchen, Germany

2 George Mason University, Fairfax, 22030, USA

3 National Technical University of Athens, Zografou 15773, Greece

4 University of Alberta, Edmonton T6G2R3, Canada

Page 2: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

heterogeneous sources of spatial, spatio-temporal and textual crowdsourced data as a proxyfor qualitative information of users in movement. We (i) explore the process of extractingqualitative information from uncertain crowdsourced data sets employing different tech-niques, (ii) investigate the enrichment of road networks with the extracted information byadjusting its properties and by building a meta-network, and (iii) show how to use theenriched networks for routing purposes. An extensive experimental evaluation of our pro-posed methods on real-world data sets shows that qualitative properties as captured bycrowdsourced data can indeed be used to improve the quality of routing suggestions whilenot sacrificing their quantitative aspects.

Keywords Crowdsourced data · Routing · Data mining · Path computation · Knowledgediscory · Road networks

1 Introduction

Crowdsourced data is paramount in a plethora of scientific and real-world applications.Technological progress, such as the ubiquity of smartphones and GPS receivers, has greatlyfacilitated contributing information. Nowadays, a large share of crowdsourced data (or:user-generated content) contains spatial information, often referred to as volunteered geo-graphic information (VGI), as introduced in [1]. VGI may refer to geo-tags which have beenexplicitly or implicitly added to a tweet, picture or status update. Also, a check-in at a reg-istered location or a review for a restaurant’s menu in a social network can be consideredVGI. Other examples include the shared record of a user’s favorite cycling route or a textualdescription of a museum in a blog entry.

In this work, we explore how different user-generated data sources may be used to enrichroad networks in order to be brought to use for routing purposes. It is our hypothesis thatcrowdsourced data conveys semantic knowledge and that it expresses sentiment. In contrast,commercial solutions and established algorithms almost exclusively rely on quantitativemeasures, i.e., “hard” metrics. Using crowdsourced data as a proxy, we aim to bring rout-ing suggestions in line with users’ preferences. For example, assuming that Flickr userstake photos of particularly appealing places, a routing algorithm which prefers areas wherephotos are dense, generates paths which are likely to be more appealing.

The diversity of spatial crowdsourced data constitutes one of the main challenges of thiswork. Some data sources are inherently noisy, others provide particular depth of informa-tion. For example, check-ins at curated locations are more reliable than geo-tagged tweetswhich are dependent on the quality of the user’s GPS signal and the density of possiblePoints of Interest (POI) in their environment. A plain textual mention of a “Tibetian YogaStudio”, for instance, may be ambiguous and might not be mapped to a unique location. Astextual sources are particularly ambiguous, we develop specific methods. The general topicof data quality, however, is beyond the scope of this paper. In general, we trust the surveyeddata sources and aim at extracting valuable information for routing purposes.

Another phenomenon which is addressed in this paper are particular relations betweena number of POIs. Especially for the application of routing, it makes sense to not onlyconsider singular POIs but multiple connected and related POIs. For instance, when con-sidering the sequence of check-ins of one user during a day, the sequence may indicate arecommendable itinerary.

Page 3: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Extracting knowledge from crowdsourced data, we want to enrich the underlying roadnetwork such that it may be used for conventional shortest path algorithms. Given sourceand destination, we provide paths which serve as a trade-off between quantitative routingcriteria such as distance and the qualitative information provided by the crowd. Allowingfor slight detours from cost-optimal paths, we generate paths which reflect the underlyingqualitative criteria. While many related problems, such as the family of Orienteering Prob-lems, have been proved to be NP-hard, we aim for easy integration and polynomial runtimeat query time. We explore different methods for the enrichment of the road network. In addi-tion to network modification, we introduce a meta-network which ties path computation tothe POIs of the network, generating valid itineraries. Our evaluation is based on real-worldcrowdsourced data sets and real-world tourist trip recommendations.

The methodology of this paper is summarized in Fig. 1. Processing heterogeneous datasources, we propose extraction methods to mine (sequential) occurrences of POIs. Theseoccurrences are used for different approaches of network enrichment. Finally, conventionalrouting algorithms may be applied to the enriched networks in order to produce alternativepath results.

As an example for the query output, consider the routing scenario in Fig. 2 which is setin the city of Paris, France. The solid line represents the conventional shortest path fromstarting point “Gare du Nord” to the destination at “Quai de la Rapee”. The dot dashedand dotted lines represent alternative paths computed in enriched networks as proposed inthis paper. The triangles in this example mark POIs as extracted from travel blogs, in thiscase the POIs correspond to landmarks and sights. For instance, the path represented by thedot dashed line passes locations like “Place de la Republique”, “Cirque d’hiver” and “laBastille” and justifiably satisfies the requirement of being touristically appealing. Comparedto the conventional shortest path, indicated by the solid line, it will yield greater value fortravelers.

Fig. 1 Flow chart illustrating the methods presented in this paper

Page 4: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Fig. 2 Shortest (solid line) and alternative paths (dot dashed and dotted lines) along POIs in Paris, France.This result is output of the methods presented in this paper. [Originally published in [2]]

This work is divided into three parts. First, we give an overview of different data sources,categorize them, mention advantages as well as possible drawbacks. Second, we explorehow crowdsourced knowledge can be extracted from different data sources. We presentdifferent approaches for singular, pairwise or sequential occurrences of POIs. Third, weinvestigate how the knowledge may be used to enrich the underlying road network and maybe employed in routing algorithms. In our experiments, we show that routing algorithmsindeed benefit from incorporating crowdsourced knowledge. More precisely, we are ableto generate paths similar to professional tourist trip recommendations. These paths reflectthe qualitative criteria but only incur a minimal increase in path length. This work extendsthe research presented in [3], which focused on spatial relations between pairwise occur-rences of POIs. Incorporating ideas published in [4, 5], we additionally present means forenrichment based on singular occurrences of POIs. A further novel extension in this workis considering sequences of POIs (triples or longer). Finally, the our evaluation integratesfurther datasets.

Page 5: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

The rest of this paper is organized as follows: In Section 2, we summarize relatedresearch. Section 3 classifies data sources. Extracting knowledge from these sources isdetailed in Section 4. Section 5 introduces several methods for enriching road networkswith the extracted knowledge. The whole process is evaluated experimentally in Section 6.Section 7 concludes this work with a summary of its contributions.

2 Related work

2.1 Semantic mining and enrichment of trajectories

The discovery of semantic places through the analysis of raw trajectory data has been inves-tigated thoroughly over the course of the last years. The objective of this field of research isto analyze user trajectories and, in combination with POI databases, to extract semanticallyrelevant places based on spatio-temporal patterns (number of times a POI is visited and thetime spent there). The authors in [6–8] provide solutions for the semantic place recognitionproblem and categorize the extracted POIs into pre-defined types such as “home”, “work”,“education” and “shopping”. Moreover, the concept of “semantic behavior” has recentlybeen introduced. This refers to the use of semantic abstractions of the raw mobility data,including not only geometric patterns but also knowledge extracted jointly from the mobil-ity data as well as the underlying geographic and application domains in order to understandthe actual behaviour of users in movement. Several approaches like [9–11, 11–13] have beenintroduced during the last decade. The core contribution of these articles lies in the devel-opment of a semantic approach that progressively transforms the raw mobility data intosemantic trajectories enriched with POIs, segmentation and annotations. Finally, a recentwork, [14], extracts and transforms the aforementioned semantic information into a textdescription in the form of a diary. The difference between our work and these approaches isthat the mentioned approaches neither integrate the extracted semantic information into theroad network nor use it for routing purposes. Instead, they combine trajectories with POIdatabases to extract semantic information on POIs and possibly enrich other trajectories(e.g., paths computed by an arbitrary algorithm) with semantic information. For instance,a sequence of timestamped geo-coordinates might be mapped to the semantic sequence:home → work → kindergarden → supermarket → home → restaurant.

2.2 Qualitative routing

The term qualitative routing is not defined precisely. We use it to describe two approaches torouting which do not solely rely on absolute measures and are therefore more “qualitative”rather than purely quantitative. First, the computation of routes which are easier to memo-rize, describe and follow. Second, the computation of routes which are particularly scenic,interesting or popular.

The first problem has been tackled from various angles and even research disciplines.Following a rather cognitive line of argument, the authors of [15] minimize the complex-ity of a route description. This is done by finding a trade-off between distance and weightswhich describe the complexity of the routing description at every intersection. Differentcognitive models for the complexity can be employed. Later works explore the role of land-marks in route descriptions [16] and strategies on how to create compact route descriptions[17]. In [18], the authors explore the concept of route descriptions in detail by definingand evaluating agent models and deriving an agent-centric qualitative representation of the

Page 6: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

graph. A less cognitive and more spatial approach is chosen by the authors of [19]. Theyintroduce cost criteria that allow for a trade-off beween distance and complexity based onnetwork properties. It might be possible that the cognitive agent models could benefit fromthe extraction process presented in this work, however, this goes beyond the scope of thepaper. Our methods are based on crowdsourced knowledge such that ideally the crowdbecomes its own agent, possibly making complex models obsolete.

The second problem has for instance been examined in [20] where a method is presentedfor computing beautiful, quiet and happy paths, as the authors phrase it. In order to quantifythese qualities, the authors rely on explicit statements about the beauty of locations, obtainedfrom a platform which asked users to specifically rate photos according to three categories.This is a valid approach, however, it does not scale well. In contrast, we argue to mine thiskind of information from existing crowdsourced data, in order to avoid acquiring ratings inthe manner of mechanical turks for locations on a global scale.

Two recent works which address a similar problem but rely on existing crowdsourceddata are [21, 22]. Similar to the work in this paper, the authors of [21] propose to enricha road network with a notion of popularity mined from geo-tagged photos. Note that wereview heterogeneous data sources while [21] is limited to photos. Also, the problem andits solution differ from what is presented in this work. The user in [21] is asked to specifya number of landmarks for which connecting paths are computed which take the popu-larity of road segments into account. For each route segment connecting two consecutivelandmarks, the optimal connecting path (the one with maximum popularity) is computed.Maximizing the popularity is equivalent to the Longest Path Problem which is NP-hard ingeneral. Instead, our work aims at integrating crowdsourced knowledge into polynomialrouting algorithms for point-to-point queries with no additional input parameters (such aslandmarks to be visited along the way). An interesting use of crowdsourced data is presentedin [22] where the authors grid the road network and sample a location from each cell. Then,using the pretrained neural network ImageNet [23], the authors classify the Google StreetView pictures of each sample location according to six categories of scenicness (“sky, river,coast, . . .”). At query time, for a user input of start, target, maximum travel time budgetand one of the six categories, a route is computed which maximizes the scenicness score ofthe chosen category while abiding by the budget. However, the problem is NP-hard (it is aninstance of the Arc Orienteering Problem, see below). An approximate solution is given butno runtime numbers and no comparison to optimal results are presented.

2.3 Touristic trip planning

There are numerous variations of touristic trip planning problems and most are related to theoriginal trip planning query (TPQ) [24]. The input of a TPQ are start and target as well as aset of categories (of POIs). The output is the cost-optimal path from start to target visitingexactly one instance of each category. The NP-hard traveling salesman problem can easilybe reduced to the TPQ by assuming that every POI belongs to its own category. Hence,efficient solutions to the TPQ or further constrained modifications are heuristic. However,in real-world scenarios the candidate sets of POIs may often be narrowed down drastically(e.g., by spatial pruning or additional constraints), allowing for a full enumeration of allsolutions within seconds.

Early variations of the TPQ, [25, 26], allowed for a particular order of some of the cate-gories and for further constraints regarding the entities of the categories. Further constrainedmodifications of the query often focus on the use case of touristic trip planning, occasionallyalso referred to as itinerary planning, largely summarized in [27]. For example, the query

Page 7: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

objective may be to maximize the subset of a set of predefined POIs which can be visited ina tour with a certain time-constraint [28]. Younger works in this area exploit crowdsourceddata for POI categorization [29], for POI-popularity estimation [30], for POI-recommendersystems [31], for the determination of opening hours or recommended visiting times [32],for deriving the average duration of stay at each POI [33] or for combinations of the above.In contrast to these works, we do not focus solely on the purpose of itinerary planning butfollow a more general approach. We emphasize the process of knowledge extraction withregard to the diversity of the crowdsourced data, its quality and properties. Also, we investi-gate various different data types and sources and give insight on the aspects of related POIsand sequences of POIs. It is our belief that itinerary planning methods like the above couldbe refined using the contributions of this work. However, we choose not to adapt the ratherrigid structure and the NP-hard nature of trip or itinerary planning queries.

2.4 Orienteering problems

Another relevant family of NP-hard problems are the orienteering problem (OP) and thearc-orienteering problem (AOP) and variants thereof. The input of both are start and targetlocations and a cost budget, usually a maximum travel time. Additionally, there is a non-negative value assigned to the nodes of the graph of the OP, while for the AOP there is avalue assigned to the edges of the graph (or arcs, hence the name). The output of both queriesis the path from start to target which has the greatest accumulated value among all pathsfrom start to target not exceeding the cost budget. The OP and numerous modifications havebeen thoroughly studied [34], especially in the field of Operations Research. Lately, theAOP has gained attention [35, 36], also in the database research community [37]. The con-tributions of this work have great relevance for the OP and AOP, as the extracted knowledgecan be conceived as a value associated with nodes or edges. Using the techniques proposedand compared in this paper, crowdsourced information may be integrated into existing solu-tions, adding another facet to the perception of “value” in OPs and AOPs. As mentionedbefore, we focus on incorporating crowdsourced knowledge into conventional shortest pathalgorithms with polynomial runtime. We therefore distinguish the ideas presented in thispaper from OP and AOP solutions.

3 Data types and processing

In the following, we give a brief overview of different types of data sources. All of thesources are explored theoretically and practically. Subsequent to this introduction, weintroduce our methodology of knowledge extraction and network enrichment.

3.1 Spatially enriched data

By spatially enriched data we refer to data with attached or inherent geo-tags, e.g., check-ins at pre-defined locations, geo-tagged tweets or geo-tagged images. The great benefit ofbasic spatial point data is its wide availability and large coverage. Most services relyingon spatially enriched data provide APIs and exemplary data sets, such as Foursquare, Yelp,Twitter, Flickr.1 While check-ins at curated locations provide precise spatial information

1www.foursquare.com, www.yelp.com, www.twitter.com, www.flickr.com

Page 8: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

geo-tagged data is more error prone due to imprecise GPS sensors. The authors of [38]show that the deviation of geo-tags from their actual positions is sufficiently small whenthe locations are popular locations such as tourist hotspots. Interestingly, [39] shows thatthe deviation differs for different photo sharing services. In addition, [39] also shows thatdeviation is greater in Latin America and Asia than in North America and Europe, the lattertwo with an average deviation not exceeding 60 meters. Given these results and the fact thatGPS signals are not exactly but very close to being normally distributed around the actualposition [40], we choose to trust geo-tags as provided by our data sources.

Of course, the spatial information is just an additional component to the actual contentof the service, e.g., Twitter’s tweets, Yelp’s reviews or Flickr’s photos. Depending on thecontent, different notions of information, value or popularity are reflected. For example, thenumber of Flickr photos in the vicinity of a particular POI may be interpreted as a measurefor its appeal or, more generally, its popularity (as in [3–5, 37]). Similarly, the numberof check-ins in location-based networks like Foursquare and Yelp may be considered ameasure for trendiness or popularity (as in [29, 31, 32]). While Flickr photos tend to describeaesthetic appeal, accumulated check-ins rather reflect the popularity of restaurants, barsor clubs (according to the users of the particular service). This underlines the diversity ofinformation provided by different data sources.

3.2 Spatio-temporal data

We define spatio-temporal data as sequences of timestamped locations, possibly enrichedwith additional data, such as textual descriptions of a trip, of locations visited along the tripor meta-information like the vehicle used for the trip or the weather on that particular day.The prime example of spatio-temporal data are trajectories as collected by contributors tothe open source map service OpenStreetMap, as contributed by users of shared mobilityservices such as Capital Bikeshare, or as recorded and uploaded by runners, cyclists orothers to platforms like Endomondo.2 However, any temporally ordered sequence of geo-locations can be classified spatio-temporal data, like consecutive check-ins of a particularuser of a location-based network. In this case, if a significant number of users has visitedthe same locations within the time frame of for example a day, one may recommend visitingthese locations as part of a day’s trip (as in [30, 31, 33]).

3.3 Textual data

From data which contains explicit spatial information, we now turn to implicitly spatialdata. Pure textual narrative, for instance blog entries or other text corpora, often containsmentions of POIs. Aside from the lack of geo-locations, the ambiguity of language andduplicate identifiers raise difficulties. For instance, “Chinatown” is not a unique identi-fier, neither is “Gelateria Bella Italia” (it does not even relate locally) and a “TibetianYoga Studio” is most likely not in Tibet. Nevertheless, we want to stress the importanceof textual data as it is a particularly rich source of crowdsourced information. Althoughauthoring explicit spatial data has been facilitated by technical progress, it sometimes stillrequires special applications and/or special knowledge, e.g., when contributing to Open-StreetMap. Hence, many users are more comfortable using narrative when generatingcontent. Especially when evaluating visited places, users often generate narrative using

2www.openstreetmap.org, www.capitalbikeshare.com, www.endomondo.com

Page 9: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

qualitative adjectives such as “beautiful”, “interesting” and “cool”. Similarly with move-ment patterns, a lot of users describe their motion using toponyms and spatial relations(“near to”, “next to”, “close by”, etc.) rather than using geo-coordinates. Hence, there isa largely unused abundance of crowdsourced knowledge in the form of blog entries orother narratives (freely) available on the internet. In the following, we describe how wemine toponyms and in turn geo-locations from travel blogs as an example for crowd-sourced textual data. This is a prerequisite for the knowledge extraction methods presentedin Sections 4.1 and 4.3.

In order to gather travel blog entries, we use web-crawling techniques and compile adatabase consisting of 250,000 blog entries from 20 different travel blogs3 as presented in[2, 41]. Extracting qualitative information from text requires the detection of toponyms, i.e.,placenames within the raw text. By geoparsing, candidate phrases containing references toPOIs are identified. Let us note that geoparsing is a method well-established in the field ofNatural Language Processing, and we used the Natural Language Toolkit [42] in our imple-mentation. Subsequently, we geocode these POIs, i.e., we map the POIs onto geo-locations.This is done using the geographical gazetteer database GeoNames4 which contains over tenmillion POI names, their synonyms and their coordinates worldwide. Of the 500,000 POIsmined from the text corpus, we were able to geocode 480,000. For further details, we referthe reader to [43] and [41].

4 Knowledge extraction

This section describes the extraction process of qualitative knowledge from crowdsourceddata with increasing complexity. First, singular occurrences of POIs are examined, thenrelationships and sequences are investigated. Particular attention is paid to textual data dueto its ambiguity and noise.

4.1 Single occurrences

Accumulation of separate occurrences of POIs is a basic but effective method for quanti-fying spatial information resulting in numeric scores which reflect some notion of value orpopularity. The number of photos taken in the vicinity of a POI or the number of check-ins at particular locations have both been employed as a measure for popularity in research.When considering check-ins, the basic approach is to sum up the number of check-ins ateach location and directly use this score as a measure. A straightforward extension is tofactor in the location-specific rating provided by the corresponding location-based socialnetwork, e.g., the ten-point-rating of Foursquare or the five-star-rating of Yelp.

In contrast, geo-tagged photos are not necessarily taken exactly at the POI they depictbut in a more or less strict vicinity. One basic approach is to consult a POI database (e.g.,GeoNames) to define a distance threshold ε and to count the photos in the ε-range of eachPOI. Alternatively, one may assign all pictures to a certain POI which have this POI as theirnearest neighbor, as suggested in [37]. Another means of aggregation is presented in [44].The authors propose to give logarithmically decreasing weight to subsequent photos of thesame POI by the same user.

3www.travelblog.com, www.traveljournal.com, www.travelpod.com4www.geonames.org

Page 10: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Results of any of these means of aggregation may be improved when considering thetext tags of the given photos. Using a gazetteer database which provides synonyms (e.g.,“Cologne Cathedral”, “Kolner Dom”, “Hohe Domkirche St. Petrus”), the text tags of allphotos in a greater vicinity of the POI may be scanned for word matches and assignedaccordingly.

Besides check-ins and geo-tagged photos we propose another measure for popularitybased on single occurrences of POIs in crowdsourced data. While photos tend to measureappeal and check-ins indicate popularity, we also exploit a combination of two crowd-sourced data sets to infer sentiment according to Twitter. As described before, we haveaccess to the names and geo-locations of touristically relevant POIs. Crawling Twitter’spublic feed for mentions of each POI’s synonymous toponyms, it is possible to obtain asignificant number of tweets in the area of interest. Finally,we apply the sentiment analysisfeature of the Natural Language Toolkit [42] to each tweet and aggregate the output scorefor each POI separately. This gives us another notion of popularity in terms of a numericscore for each POI reflecting the sentiment of Twitter users in regard to that POI which willbe examined in our experimental evaluation.

4.2 Pairwise occurrences

We now investigate how to mine sequences of POIs rather than separate occurrences. Byconsidering pairs of POIs, for example, it is possible to model connections between thePOIs. Depending on the nature of the pairings different knowledge will be reflected in theconnections. If a significant number of users checked in at the same two locations consec-utively, it might make sense to recommend the pair of locations rather than each locationseparately when planning an itinerary or a route. Check-ins at pre-defined locations arehardly subject to errors (compared to GPS data) due to the fact that the set of curatedlocations where checking in is possible. Therefore, it usually suffices to simply sum andnormalize occurrences of pairs of check-ins in order to score their value when visited incombination. As before, additional information such as specific ratings may easily be takeninto account. More importantly, when considering pairs of POIs visited in conjunction, spa-tial connectivity will become a factor. We explore this aspect in the following using purelytextual data to show that even intricate data sets can be mined for knowledge about pairwiseoccurrences.

We make use of the text corpus described in Section 3.3. More precisely, we considertoponyms (mapped to POIs) which are linked by spatial relations describing closeness, suchas “nearby”, “next to”, “at”, “in” or “in front of”. We refer to these spatial relations ascloseness relations. If a pair of POIs is mentioned significantly often linked by a closenessrelation, one may deduce that the two POIs are in fact close to each other. From the textcorpus of 250,000 travel blog entries, we were able to mine 660,000 triplets of the form(Pi, closeness relation, Pj ). Here and in the following, we denote POIs by Pk with vary-ing index. A sample of POIs in London, UK, as well as New York City, US, and theirrespective closeness relations are visualized in Fig. 3.

The details to our probabilistic approach are given in the Appendix. For each triplets ofthe form (Pi, closeness relation, Pj ) we compute a spatial feature vector reflecting dis-tance and orientation of the POIs’ locations. The entirety of feature vectors for a specificcloseness relation is used to train a Gaussian Mixture Model (GMM). Given a particularGMM and the spatial feature vector of a pair of POIs, we obtain the (posterior) probabil-ity that these POIs stand in the given closeness relation to each other. From the posteriorprobabilities we derive a closeness score for each pair of POIs. This score reflects a

Page 11: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

at

on

on

across

on

inat

towardson

a tatin

near

at

near

in

at

in

at

near

inon

at

front

onat

in

at

in

on

in

on

intowards

in

right

at

nearin

in

in

near

near

in

at

at

at

south

in

in atacross

on

on in

at

in

in

at

in

at

in

at

on

on

at

inat

west

front

at

on

front

in

in

at

on

in

at

towards

in

on

on

at

on

aton

in

away

at

in

towardsin

in

at

left

on

towards

in

at

at

at

at

in

on

in

near

in

on

at

in

in

on

behind

on

at

in

right

in

on

on

at

near

in

in

in

in

at

at

withinout

on

inoutside

in

attowards

in

in

in

in

in

at

in

at

in

out

at

in

in

at

at

on

in

in

at

in

into

by

in

in

in

at

on

on

on

on

on

in

at

on

in

in

in

on

at

at

by

at

on

in

at

nextin

beyond

at

in

onat

onin

at

atnear

inat

on

at

at

in

in

in

in

in

in

on

by

in

in

in on

in

in

at

on

on

Fig. 3 POIs (nodes) and their relations (edges) extracted from travel blogs. Visualized are two samples fromLondon, UK, (left) and New York City, US. [The right illustration was originally published in [41]]

confidence in the notion of closeness as reflected by the relations in the underlying data.Hence, this method can be used to evaluate occurrences of pairs of POIs in textual dataw.r.t. their closeness. In the following, when speaking of a pairwise occurrence (of POIs),we mean a mined pair of POIs with non-zero or sufficient score.

4.3 Sequential occurrences

Extending the above introduced concept, we now consider sequential occurrences of POIshaving more than two elements. By the same logic as above, general sequences of POIsmight bear additional information compared to single or pairwise occurrences. In particular,itineraries might be refined incorporating triples, quadtruples or even longer sequences ofPOIs. In the use case of itinerary planning, combining two separate pairs of occurrencesmight yield unwanted categorical duplicates. For instance, consider two distinct pairs ofPOIs both containing a restaurant. In this case, concatenating the two pairs will guide theuser to two restaurants, possibly within a short time-span. When mining longer sequences ofPOIs, we gain valuable information about combinations of POIs which the crowd found tobe valid. For example, longer sequences might also contain two restaurants. However, whenmining such a sequence frequently, it implies a validation by the crowd, e.g., one restaurantmight be a great lunch spot, the other a nice dinner place. Accumulating the significantsequences, we obtain a measure for the relevance of a certain combination of POIs. Clearly,this also holds for similar sequences in other types of crowdsourced data, like, for instance,Flickr photos. Mining sequential photos of Flickr users to extract knowledge about theirmovement patterns is an established approach in the research community [31, 33].

From a theoretical point of view, the process of knowledge extraction is similar to thatdescribed above. Consider, for example, consecutive Foursquare check-ins by one user dur-ing a day. If the user checks in at two places, the combination forms a pair, if they check inat three places, a triple is formed. In this interpretation of the data, the two pairs containedin a triple need not necessarily form a pair themselves. Another way of looking at the datamight be to imply that every relevant triple also generates two pairs. Besides consecutive

Page 12: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Foursquare check-ins and Flickr pictures, one may also extract sequences of POIs from tex-tual sources, for example from travel blog entries, following the approach introduced above.The challenge is now, how to enrich the network with the obtained scores. Consider the fol-lowing example: Assume you have mined and quantified the score of two POI pairs (Pi, Pj )

and (Pj , Pk) and the score of their concatenation (Pi, Pj , Pk). The score of the triple mightsurpass the score of the pairs. This might be because relevant triples are not considered togenerate pairs or simply because triples are scored higher than pairs . The question on howto use the quantified knowledge in order to enrich the underlying road network is tackled inthe next section. As before, when speaking of a sequential occurrence (of POIs), we mean amined sequence of POIs with non-zero or sufficiently significant score (greater than a giventhreshold).

5 Enrichment

In this section, we show how the popularity scores presented in the previous section aremade actionable for routing purposes. We explore two different approaches: For the firstapproach, presented in Section 5.1, we pursue the idea that a gain in score corresponds to areduction in cost, i.e., gain and cost are assumed to be reciprocal. Conventional routing algo-rithms expand paths with promising edges, i.e., edges with low cost values. By reducing thecost of edges with non-zero score, conventional routing algorithms favor these edges, steer-ing the exploration in the direction of such areas. Following this first approach, we obtainwhat we refer to as enriched graphs. For the second approach, presented in Section 5.2, weintroduce what we refer to as POI graphs, i.e., a meta-network where nodes correspond toPOIs and edges correspond to shortest paths between them. This allows for the computationof paths which are ensured to visit POIs provided by the data source.

Both approaches utilize path computation algorithms which minimize given cost criteriaalgorithms such as Dijkstra’s algorithm [45] and state-of-the-art path skyline algorithms[46, 47]. We refer to these algorithms as conventional routing algorithms.

5.1 Enriched graphs

Wemodel any road network as a directed and weighted graph derived fromOpenStreetMap5

(OSM) data, as illustrated in Fig. 4a. We denote the graph by G = (V ,E, c), wherethe vertices (or nodes) v ∈ V correspond to crossroads, dead ends etc., and the edgese ∈ E ⊆ V × V represent streets connecting vertices. Furthermore, let c : E → R

+0 denote

the function which maps every edge onto its cost criterion. We introduce the following nota-tions: euv = (u, v) and cuv = c(euv). If not stated otherwise, the cost criterion is distance.Other possible criteria are for instance travel time or energy consumption. If multiple costcriteria are used, they are denoted by c1, . . . , ck . Furthermore, let P denote the set of POIs.We assume every POI to be a node in the graph, i.e., P ⊆ V . This is a minor constraintas we can easily map each POI to the nearest node in the graph or introduce pseudo-nodesat the POI’s location. Finally, a set of consecutive edges is referred to as path. Clearly, thenotion of a cost criterion c extends to any path p. By c(p) we denote the summed cost ofthe edges of path p.

5www.openstreetmap.org

Page 13: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

(a) Part of a road network with twoPOIs (black circles).

(b) Illustration of two possible skylinepaths connecting a pair of related POIs.

Fig. 4 Example Road Network

Single occurrences In Section 4.1 we describe the knowledge extraction process consider-ing single occurrences of POIs. The output of this process is a normalized, node-associatedscore. Thus, POIs are assigned a numeric score in the range [0, 1], describing some notionof value or popularity. 0 corresponds to POIs with no score, and 1 corresponds to the highestpossible score. In order for this score to be used in conventional routing algorithms, it hasto be offset against the underlying cost criterion. Otherwise, the optimal path would be thesolution to a traveling salesman problem visiting all POIs to acquire the maximum possiblescore. We consider the score to be a reduction in cost, and, in the following, we convert thenode-associated score into an edge-associated cost.

Let s(v) denote the score we associate with a node v ∈ V . Note that for v ∈ V \ P :s(v) = 0, and for P ∈ P : s(P ) ≥ 0. For each edge (u, v) = e ∈ E, we define thescore-discounted cost (sdc) sdc(e) as

sdc(e) := c(e) · φ(s(u)+s(v)) (1)

where φ ∈ (0, 1) is a scaling factor for the influence of the respective score. Intuitively, if e

connects two vertices u and v with score 0, sdc(e) equals c(e). As the total score s(u)+s(v)

of nodes u and v increases, the score-discounted cost of e decreases exponentially. Thus,for an exceptionally large score of u and v, the adjusted cost sdc(e) of edge e approacheszero. The parameter φ controls how quickly the score-discounted cost sdc(e) converges tozero for an increasing score s(u) + s(v) of the two vertices connected by e, making thediscounted edge e more attractive for routing algorithms.

If a given road network graph G is enriched with a score function sdc reflecting thepopularity of single occurrences of POIs, we refer to G1 = (V ,E, sdc) as the enriched(road network) graph. Since each edge in G1 is given the score-discounted cost sdc(e),the notion of optimal paths will change according to the new cost measure. This way, weachieve our goal of incorporating the notion of crowdsourced qualitative information usingconventional shortest path algorithms.

Pairwise occurrences Now, let us consider pairwise occurrences of POIs such as con-secutive check-ins of a user of a location-based network. The actual path the user chosebetween those check-ins is not recorded. Although there is a (spatial) connection betweenthe check-in locations, it is unclear which part of the network should be enriched. We makethe assumption that users generally prefer cost-optimal paths, i.e., paths which are optimalunder the given cost criteria, such as the shortest path, fastest path or pareto-optimal paths.

Page 14: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

We propose to discount all edges along these paths, such that routing algorithms will pre-fer sections of the network with many optimal paths between crowd-favored pairs of POIs.To summarize and illustrate this, reconsider Fig. 4b. The set of optimal paths w.r.t. multiplecost criteria is highlighted. Our approach decreases the cost of all edges located on any ofthe highlighted paths connecting the POIs. This approach is described in greater detail inthe following.

If the underlying road network uses one cost criterion only, usually distance or traveltime, we discount the cost-optimal path connecting the pair of POIs. In multicriteria roadnetworks, e.g., additional cost values for energy consumption or toll fees, there are severalpossibilities of enrichment. A straightforward approach is to discount the cost-optimal pathsaccording to each criterion, i.e., enriching up to d paths with crowdsourced knowledge whenthe network has d criteria. In order to increase the density of enrichment, we propose toenrich the path skyline or linear path skyline [46, 48, 49]. If the number of cost criteriais relatively high (more than three) and/or the average extent between query end points isrelatively big (> 100 km), it is recommendable to use the linear path skyline instead of theconventional path skyline to restrict the influence of enrichment. Once we have selected theset of paths to enrich, we proceed as follows.

Let (Pi, Pj ) denote a pair of POIs mined from the data. For instance, two frequent con-secutive check-in locations or two POIs which are connected by spatial closeness relations.According to one of the definitions above, we compute a set of optimal paths, denoted bySi,j . Although the paths contained in Si,j differ from one another, they often share edges. Ifan edge occurs in more than one optimal path, we only discount its cost once. Let Ei,j ⊂ E

denote the set of all distinct edges which are part of at least one optimal path from Pi to Pj .Then we define the score-discounted cost of cost criterion ci of an edge as

sdci(e) = ci(e) ·∏

e∈Ei,j

(1 − φsij )

As before, φ ∈ (0, 1) is a scaling factor. sij denotes the score of the pair of POIs. Forexample, sij might be the number of consecutive check-ins at Pi and Pj normalized by themaximum number of check-ins at a pair of POIs. Or, sij = csij might be the closeness scoredefined in Section 4.2. Analogous to before, given a road network graph G, we define theenriched (road network) graph for a score sij reflecting connections between pairs of POIsas G2 = (V ,E, sdc1, . . . , sdck).

Discussion The above approaches directly modify the costs in the underlying road net-work. The cost of an edge is reduced if a relevant amount of qualitative informationregarding that edge (or its vicinity) is found. More precisely, if edges e and f have the samecost but e has a higher score than f , then sdc(e) < sdc(f ). When mining single occur-rences of POIs from crowdsourced data, this adequately reflects the local influence of thescore of each POI on the cost of the adjacent edges. When considering pairs of POIs, thecost of an edge is reduced if it is part of an optimal path connecting the pair. This, however,does not “force” routing algorithms to compute a path which actually visits both of thesePOIs. This can be ensured by introducing a meta-network.

5.2 POI graphs

For some applications, visiting POIs may be desirable (e.g., when planning touristicitineraries). For these applications, we introduce a meta-network which also allows to con-sider sequential occurrences of POIs. We refer to this meta-network as POI graph because

Page 15: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

the nodes of the graph correspond to POIs and the edges correspond to paths connectingthese POIs.

Enrichment of pairwise occurrences In the following, we build a POI-graph G2POI using

pairwise occurrences. Each POI mentioned in a significant number of occurrences formsa vertex of G2

POI. For any pairwise occurrence between two POIs Pi and Pj an edge eij ,is added to G2

POI. This edge holds a reference to the cost-optimal path connecting the pair.Figure 4b depicts an example: A new edge is added as a shortcut between the two depictedPOIs. Note that the precomputation of cost-optimal paths is a static preprocessing stepwhich does not aim to accelerate subsequent path searches but to bind path computationstronger to the set of mined POI pairs.

When issuing a query, the user inputs start and target nodes. Of course, these are notnecessarily POIs, i.e., part of G2

POI. Thus, we propose a slight modification of Dijkstra’salgorithm, as pseudo-coded in Algorithm 1. The idea is to find entry and exit POIs whichare close to the start and target node respectively (for details about this rather technical stepwe refer the interested reader to [3]). Subsequently, two paths are computed in the originalroad network: the cost-optimal path from s to the entry POI and the cost-optimal path fromthe exit POI to t . Between entry and exit POI, routing is executed in G2

POI, i.e., followingcost-optimal paths between pairs of POIs which have been mined. Note that according tothe inherent cost criteria, the result path is unlikely to be optimal. This is a desired effect:We explicitly want to compute paths which trade-off between quantitative aspects and qual-itative knowledge. As we will show in our experiments, cost-optimal paths cannot reflectqualitative knowledge while paths in enriched graphs and POI graphs can and do.

Enrichment of sequential occurrences So far we discussed the enrichment of singleand pairwise occurrences. The reason is that in order to enrich the network with crowd-sourced knowledge from sequential occurrences of POIs, a meta-network is needed. Assumethat in one scenario only the two POI pairs (Pi, Pj ) and (Pj , Pk) were mined from thedata, and in another scenario the triple (Pi, Pj , Pk) was mined from the data. In the for-mer case, users frequently travel from Pi to Pj , and from Pj to Pk , but do not take thefull journey from Pi to Pk (via Pj ). In the latter case, the full journey was mined as a fre-quent sequence of POIs, and, thus, Pj might be a simple stopover in-between two popularPOIs Pi and Pk . When mining pairwise occurrences of POIs, the distinction between these

Page 16: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

scenarios is not possible. The information that the sequence of all three POIs is popular islost. We now extend the concept of the POI graph to triples of occurrences, then to arbitrarysequences.

We denote the extension of a POI graph G2POI to triples of POIs as G3

POI. Before, POIswhich occurred in sufficiently many pairs according to the data formed a node. In G3

POI eachPOI forms a node which occurs in sufficiently many pairs or sufficiently many triples withgreater score than its two consecutive pairs. This means, we only introduce edges for tripleswhich bear greater score when visited in sequence than visiting its two pairs separately,i.e., s(Pi, Pj , Pk) > s(Pi, Pj ) + s(Pj , Pk). As before, each pair of POIs mined from thedata is connected by an edge, also called direct edge for distinction. Additionally, for anytriple (for which holds s(Pi, Pj , Pk) > s(Pi, Pj )+s(Pj , Pk)), we introduce a indirect edge(or shortcut) in G3

POI from Pi to Pk holding a reference to the concatenated cost-optimalpath from Pi to Pj and from Pj to Pk . This is illustrated in Fig. 5 where the doubled linesrepresent shortcuts visiting triples of POIs and the single lines represent paths between POIs.We refer to the middle node of a shortcut, as seen in Fig. 5a) as an intermediate node. In thefollowing, we denote the direct edge from Pi to Pj by eij and the indirect edge (Pi, Pj , Pk)

by eijk .The idea of introducing shortcuts to model particularly valuable sequences of POIs is

appealing. However, conventional routing algorithms may generate result paths with cycles.This is due to two reasons. First, intermediate nodes are not explicitly visited by a routingalgorithm and can therefore not be flagged appropriately. Second, in order to promote theusage of sequences of POIs, we discount shortcuts which may lead to violations of thetriangle inequality. e.g., cijk < cij + cjk . Before we decide which cycles to avoid and how,we introduce definitions to distinguish different types of cycles.

Definition 1 In a directed graph, a cycle is a path (oriented and consecutive set of edges)where no node is visited twice except for the start/end node. We distinguish between cycleswhere the start/end node is a conventional node or an intermediate node (of an indirectedge). We refer to the former as (simple) cycles and to the latter as inter (intermediate)

Fig. 5 Illustration of the introduced terminology (a)), of cycles with direct and indirect return (b) and c))and of inter cycles with direct and indirect return (d), e) and f))

Page 17: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

cycles. Furthermore, we distinguish between cycles where the last edge, i.e., the “returning”edge, is a direct edge and where it is an indirect edge. We refer to these cycles as havingdirect return or indirect return, respectively.

The possible occurrences of these cycles are depicted in Fig. 5: Fig. 5b and c show simplecycles with direct and indirect return, respectively. Figure 5d, e and f show inter cycles,where the cycle is closed at a node which is not explicitly visited by the routing algorithm,as it is “hidden” by a shortcut.

In the following, we examine which type of cycle may be part of a result path generatedby a cost-minimizing routing algorithm and how this affects the result. We first considersimple cycles which require no handling at all, stated by the following lemma.

Lemma 1 Simple cycles cannot occur in result paths.

Proof This lemma is a direct consequence of the Dijkstra property: When visited, everynode is reached through the cost-optimal path. Formally, the cost of any path p =(. . . , ei,j , . . . , ei,k, . . . ) will never be less than that of p′ = (. . . , ei,k, . . . ) as all edge costsare non-negative, i.e., ci,k ≤ ci,j + · · · + ci,k .

Next, consider inter cycles with direct and indirect return, as illustrated in Fig. 5d, e andf. Let us first consider inter cycles with indirect return. Clearly, such cycles imply a detour.Yet, this detour may be compensated by the increased popularity incurred by a valuablesequence of POIs. The gain of visiting the sequence (Pi, Pj , Pk) may justify revisiting Pj ,reflected in a sufficiently significant score-discounted cost. However, this is only valid forinter cycles with indirect return, not for those with direct return. This is because inter cycleswith direct return offer no additional gain for revisiting a POI. Figuratively speaking, if thegain of the sequence has been collected, returning to a previously visited POI bears no gain.Hence, we want to ensure that inter cycles with direct return cannot occur when employinga routing algorithm on a POI graph with triples G3

POI. If G3POI fulfills a specific requirement,

we can prove this property. The left side of Fig. 6 exemplifies the occurrence and illustratesnotation.

Lemma 2 Let G3POI be a POI graph for pairs and triples of POIs. If for every indirect edge

(Pi, Pj , Pk) holds:

δijk < cjk + ckj (�)

then no result path has inter cycles with direct return. δ denotes the additional discount ofthe POI triple over the concatenation of the two pairs, i.e., δijk := cij + cjk − cijk .

Fig. 6 Illustration of an inter cycle with direct return in a triple of POIs (left). Visualization of both possibleinter cycles with direct return in a 4-sequence of POIs

Page 18: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Proof We first prove the statement for cycles of the form (Pi, Pj , Pk), (Pk, Pj ). Assume,there was an inter cycle with direct return:

cijk + ckj ≤ cij

By the property (�) we have:

cij = cij + cjk + ckj − cjk − ckj

<(�)

cij + cjk + ckj − δ

= cijk + ckj

which is a contradiction to the assumption. Hence, there cannot exist an inter cycle withdirect return in a result path. For any longer cycles (Pi, Pj , Pk), . . . , (Pk, Pj ) the statementholds because edge-costs are non-negative.

Note that if one of the POI pairs is not connected, the property is fulfilled trivially. If(Pi, Pj ) are not connected, the property is always fulfilled, if (Pk, Pj ) are not connected,then no direct return to an intermediate node is possible. It should be noted that in the dataset for our experiments with triples of POIs, the property was rarely violated, and enforcingit bears minor computational overhead.

In the following, we extend the lemma to sequences of POIs. The problem withsequences of n POIs is the possible direct return to any of its n − 2 intermediate node. Fora sequence of four POIs this is illustrated on the right side of Fig. 6. Lemma 2 forbids thedirect return to the last intermediate node, i.e., the n − 1-st POI. In order to also forbid theother direct returns, we have to formulate similar conditions for all other intermediate nodes.With increasing length of sequences, however, it becomes less likely that cases occur wherethese conditions have to be enforced. Also, the conditions can be checked when constructingthe POI graph, i.e., only once during a preprocessing phase of graph extraction.

For this purpose, we introduce the following new notation. Sequences of POIs will beindexed by numbers instead of letters. Additionally, by δr,s , r > s, we denote the differencein cost between the s-prefix sequence of an r-sequence. For instance, for a sequence offour POIs (P1, P2, P3, P4), δ4,2 = c12 − c1234 or δ4,3 = c123 − c1234. Generally, δr,s :=c1,...,s − c1,...,r . Note that these values are not necessarily positive. If they are, this impliesa high discount of the full sequence compared to the prefix sequence. Obviously, sinceall edges, direct or indirect, have a non-negative cost, the cost of visiting additional POIsincreases monotonically – despite any discount. Hence, the greater the interval between s

and r , the less likely the value is positive. Bearing this in mind, we now extend the abovestatement inductively. This means, for any POI graph with sequences of POIs up to lengthr , we assume the statement for sequences up to length r − 1 holds.

Lemma 3 Let GrPOI be a POI graph for sequences of POIs up to length r . If for every

indirect edge (P1, . . . , Pr) the following statments hold

δr,r−1 < cr,r−1

δr,r−2 < cr,r−2

...

δr,3 < cr,3

δr,2 < cr,2

Page 19: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

then no result path has inter cycles with direct return.

Proof As before, it suffices to prove the statement for cycles of the form

(P1, . . . , Ps, . . . , Pr), (Pr , Ps)

which directly follows from the above:

c1,...,r + cr,s > c1,...,s

Any inter cycle with direct return to POI Ps has greater cost than the shortcut of length s,e1,...,s . Like before, the cycle cannot be part of a result path as it has greater cost than thecycle-free counterpart. The argument holds analoguously for longer cycles.

In the following, we assume POI graphs for triples and longer sequences to fulfill theabove properties. Therefore, it is ensured that we do not produce result paths with cycles(with direct return). For routing purposes, we may use Algorithm 1, replacing G2

POI by G3POI

or GrPOI dependent on the POI graph at hand. For a given user query, i.e., start and target

nodes, close entry and exit POIs are retrieved. From start to entry POI and from exit POI totarget, the cost-optimal paths in the underlying road network are computed. Between entryand exit POIs, routing is executed in G3

POI or GrPOI, i.e., hopping along pairs, triples or longer

sequences of POIs. Finally, the three subpaths are concatenated yielding a path from start totarget.

This concludes the section on enrichment of road networks with knowledge mined fromcrowdsourced data. In the next section, we investigate the whole pipeline experimentally.We examine different data types, different data sets and different means of enrichment.

6 Experimental evaluation

In this section, we evaluate the presented methods experimentally. Finding measures forevaluation, however, is not straightforward. Obviously, if there existed an absolute mea-sure for popularity, quality or value, this work would be obsolete. Nevertheless, first, wedeliver a proof a concept that paths generated in networks enriched with crowdsourcedknowledge indeed provide a valuable trade-off between quantitative metrics and qualitativeinformation. We compare professional tourist trip recommendations to the paths generatedby our proposed approaches when using consecutive Foursquare check-ins as a data source.Second, we examine the influence of parameter φ for the enrichment of networks, as pre-sented in Section 5.1. We also examine the actual distance of pairs of POIs perceived as“close” according to the travel blog data. Third, we compare our proposed enriched network(Section 5.1) to our proposed meta-network (Section 5.2). We show that paths generatedin our meta-network indeed steer the trade-off between quantitative metrics and qualitativeinformation in the direction of qualitative information. Fourth and finally, we investigate theimpact of taking sequences of POIs instead of pairs of POIs into account. We substantiatethe claim that with marginal overhead, it is possible to improve results further. Before wego into detail on the results, we briefly explain the setup.

In order to draw comparisons between the different settings, we locate all our exper-iments in the city of Paris, France, which has high data density for all our sources. Forroad network data, we use OpenStreetMap which according to the recent results presentedin [50, 51] has particularly high data quality in Europe. The road network extracted fromthe raw data has about 1M nodes and around 1.8M edges. All language processing was

Page 20: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

implemented in Python, the modeling of pairwise occurrences was implemented in Matlab.The tasks of network enrichment and path computation were conducted in the Java-basedframework MARiO [52] on an Intel(R) Core(TM) i7-3770 CPU at 3.40GHz and 32 GBRAM running Linux (64 bit).

For all experiments, we rely on three data sources fromwhich we extract different notionsof popularity. First, a Flickr data set, provided by authors of [53], consisting of 14M pho-tos worldwide and over 40K photos in the metropolitan area of Paris, France. Second, aFoursquare data set, provided by the authors of [54, 55], consisting of 33M check-ins byover 250K users at more than 3.5M venues worldwide, thereof 7.6K venues in the areaof Paris. Third, the aforementioned textual data set extracted from travel blogs. This tex-tual data set contains 200 significant POIs and 2K occurrences of closeness relations (seeSection 4.2) in the area of Paris. Using Twitter’s public feed, we obtain around 200K tweetsregarding these POIs. From each data source, we may derive different enriched networksand meta-networks. For instance, mining Foursquare check-in data for single occurrences,we obtain G1(FSQ). Mining the same data set for pairwise occurrences, we may constructthe enriched graph G2(FSQ) as well as the POI graph G2

POI(FSQ). Each of these networksis utilized to compute paths, and in general result paths are different when computed in dif-ferent networks. Table 1 gives an overview which type of (POI) graph is derived from whichsource and introduces the notation used (FLR, FSQ, TXT). Note that this selection is notexhaustive.

Evidently, the density of the data sets varies considerably. Therefore, we restrict ourselfto single, pairwise and triple occurrences of POIs and omit longer sequences. In order toobtain meaningful results, we empirically determine eleven hotspot regions in Paris whereall data sets are particularly dense. A hotspot is a circular region with a 500 meter radiuscentered such that it contains significantly many data points of all sources. The hotspotcenters are visualized in Fig. 7. Covered by the entirety of the eleven hotspots are over5K nodes of the underlying road network. For a query, two random nodes are drawn fromthis set as start and target. The results of each setting are grouped in distance bracketscorresponding to the Euclidean distance between start and target. For each setting 6K runsare executed.

As a measure for evaluation of crowd popularity, we use Flickr picture density (as, forinstance, used in [4, 31, 33, 37]). For a given path p, the Flickr score SFLR corresponds to thesummed number of Flickr photos within a 40 meter radius of the course of p. Clearly, the

Table 1 This table shows which type of enriched graph or POI graph can be derived from the differentsources used in this work

Graph G1 G2 G2POI G3

POI

Number of occurrences Enrichedgraph single

Enrichedgraph pairs

POI graphpairs

POI graphtriples

Flickr (FLR) 3 7 7 7

Foursquare (FSQ) 3 3 3 3

Travel Blogs + SentimentAnalysis (TXT+SA)

3 7 7 7

Travel Blogs + ClosenessRelations (TXT+CR)

7 3 3 7

Page 21: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Fig. 7 Visualization of the eleven hotspot centers in Paris, France

measure is biased towards graphs enriched with Flickr data. However, in lack of an objectivemeasure for quality, we argue that SFLR is a well-established indicator for popularity.

6.1 Proof of concept

Using crowdsourced data for network enrichment, we propose a scalable approach to com-pute paths which do not sacrifice quantitative aspects but reflect the qualitative informationprovided by the crowd. Of course, quality is inherently subjective and will always be subjectto discussion. Hence, there cannot be an absolute measure for quality; a fact that impedesevaluation of our methods. Nevertheless, we provide an exemplary proof of concept. Weconsider tourist trip recommendations as presented in a tourist guide for Paris.6 Particularly,we consider some of the tours where start and target are distinct. For a given start and targetpair we compute the conventional shortest path as a baseline. Additionally, we employ themeta-network G2

POI(FSQ), i.e., the POI graph enriched with pairwise Foursquare check-ins.Note that all the examples are located in inner-city Paris where many areas are pedestrianprecincts or parks and many streets are one-way. Our road network extracted from Open-StreetMap data, however, is optimized for driving and therefore not applicable here. Thus,we utilized Algorithm 1 to generate a sequence of POIs and determine the shortest pathsbetween these POIs in a pedestrian network. We compare the derived walking paths to theprofessional tourist trip recommendations.

The results are displayed in Fig. 8 where POIs are indicated by markers, the green linescorrespond to the tourist trips, the purple lines correspond to shortest paths and the redlines correspond to the output of our algorithm as just described. For these cases of populartourist routes recommended in a professional guide, our approach is capable of returning

6Insight Guides: Explore Paris, https://www.insightguides.com/shop/product/insight-guides-explore-paris/9781786716590,seeGoogleBooksforexcerpts.

Page 22: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Fig. 8 Comparison of professional tourist trip recommendations (green), conventional shortest paths (pur-ple) and POI sequences provided by Algorithm 1 and connected by walking paths (red). Markers indicatePOIs recommended in the tourist guide

paths that are very similar. This becomes even more evident, when we compare these pathsto conventional shortest paths, depicted as purple lines in Fig. 8. A perfect example is shownin the top-left example, where the shortest path will completely avoid the (subjectively andqualitatively speaking) beautiful island Ile de la Cite, containing famous touristic attractionssuch as the bridge Pont Neuf and the cathedral of Notre-Dame. Yet, we still see some dif-ference to the professional tourist routes which point to limitations of our approach. Manyparts of the professional tourist routes involve cycles, for instance going to one dead-endand returning. Since our approaches employ shortest path algorithms (albeit on modifiednetworks), such cycles are not possible. In contrast, our approaches avoid cycles as cyclesreflect suboptimality in networks with non-negative costs.

Summarizing, the examples in Fig. 8 show that our methodology is indeed able to pro-duce paths which greatly resemble professional tourist trip recommendations, but withouthaving to purchase and carry this information. In all examples, the shortest path hardly vis-its any POIs and cannot compete. The professional tourist trips visit slightly more POIs thanour paths but mostly by allowing for cycles in the path. Nevertheless, the POI sequencesprovided by our algorithm yield competitive paths in terms of touristic appeal. This showsthat networks enriched with crowdsourced data are indeed able to reflect underlying metricsas well as the qualitative knowledge provided by the crowd.

6.2 Parameter and closeness evaluation

In the following, we evaluate the influence of the parameter φ. Recall that φ ∈ [0, 1] scalesthe impact of the scores on the costs of a road network edge. For the same score, a smallerφ will yield greater reduction in cost (cf. Eq. 1). Figure 9a shows the SFLR score results

Page 23: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

for paths in three different graphs, G1(FLR),G1(FSQ) and G1(TXT+SA) (see Section 5.1).The score results are indicated by the size of the circles and given relative to the SFLR scoreof the conventional shortest path. As expected, the scores increase for decreasing φ. For φ

values lower than 0.5 the result paths become unreasonably long, i.e., sacrificing their quan-titative aspects which we want to avoid. As mentioned before, the measure SFLR is biasedtowards graphs enriched with Flickr data. Hence, it is not surprising that G1(FLR) attainsthe highest scores. While G1(FSQ) scores slightly less, G1(TXT+SA) scores significantlylower. This implies that the notions of popularity induced by Flickr and Foursquare data setsare more parallel to each other while that of travel blog data is rather orthogonal. Overall,we may conclude that parameter φ serves its purpose as designed. The default value whichhas been shown to be most efficient and is used in all other experiments is φ = 0.7.

Next, we examine the notion of closeness as provided by closeness relations mined fromtravel blog data. As explained in Section 4.2 (and detailed in the Appendix), we mine travelblogs for pairwise mentions of POIs which are linked by words implying spatial closenesssuch as “next to”. Training a probabilistic model on the positions of the POIs, we obtain acloseness score for each pair of POIs. We claim that this score reflects a confidence whethera given pair of POIs is close as perceived by the crowd. In particular, a higher closenessscore does not imply that POIs are likely to be closer to each other but that the confidencethat they are close is higher. We will now substantiate the claim that this notion of closenessis congruent with intuition. Figure 9b displays the actual distance between pairs of POIsconsidered close as suggested by our method (i.e., closeness score greater 0). The medianof distances is below 2.5 kilometers, 75% are closer than 5 kilometers. Outliers may beattributed to false text mining (e.g., “white is as close to blue as the Eiffel Tower is closeto the Statue of Liberty”) or to tourists traveling by bike or car rather than walking andtherefore using a different understanding of closeness. For comparison, we also show thedistance between two consecutive Foursquare check-ins. It is easily observed that the notionof closeness derived from travel blogs is stricter than that of consecutive check-ins. Hence,our method may, in fact, be applied to mine a representative notion of closeness from noisytextual narrative.

(a) Parameter evaluation forscaling parameter in graphsG1 (FLR), G1 (FSQ), G1 (TXT+SA)w.r.t. the measure SFLR.

(b) Actual distances between pairwiseFoursquare check-ins compared to POIpairs with positive closeness scoremined from travel blogs.

Fig. 9 Parameter and closeness evaluation

Page 24: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

6.3 Comparing enriched graphs and POI graphs

In the following, we compare the different graph types presented in Sections 5.1 and5.2. When mining pairwise occurrences of POIs, we may derive two types of graphs, theenriched graph and the meta-network referred to as POI graphs. Consider a pair of POIsPi, Pj mined with significant score from the data source at hand. In an enriched graph thecost of the shortest path from Pi to Pj will be reduced according to the score of the pair.Hence, applying a conventional shortest path algorithm, the discounted edges along thispath are likely to be favored. In the POI graph, Pi and Pj will become nodes connected byan edge. Applying Algorithm 1, the “whole” path between Pi and Pj will be considered asa single edge, forcing the algorithm to actually visit the POIs. Paths generated in POI graphsare therefore expected to be less quantitative and instead earn higher qualitative scores. Forpairwise occurrences we rely on two data sources, consecutive Foursquare check-ins andcloseness relations in travel blogs. From these sources four graphs are derived, two enrichedgraphs (G2(FSQ), G2(TXT+CR)) and two POI graphs (G2

POI(FSQ), G2POI(TXT+CR)). In

analogy to the Flickr score measure SFLR, we define two measures tailored to the datasources FSQ and TXT+CR. The Foursquare score SFSQ is the total number of check-ins atPOIs within a 40 meter of the course of the given path. The textual score STXT+SA is thetotal number positive mentions of all POIs within a 40 meter radius of the route of the givenpath. SFSQ and STXT+SA are obviously biased towards graphs enriched with Foursquare dataand textual data respectively. However, in this section, we compare different graph typesenriched with the same data source. Therefore, the bias equally applies to the different graphtypes.

Figure 10a and b show the results for Foursquare graphs G2(FSQ) and G2POI(FSQ) rel-

ative to path lengths and SFSQ scores of conventional shortest paths. Figure 10c and dshow the results for textual closeness graphs G2(TXT+CR), G2

POI(TXT+CR) relative topath lengths and STXT+SA of conventional shortest paths. In terms of path length, bothenriched networks, G2(FSQ) and G2(TXT+CR), create results hardly longer than the short-est paths. Of course, when routing in a POI graph, paths increase considerably in length.Compared to the shortest paths, we observe roughly doubled path lengths in G2

POI(FSQ)

and G2POI(TXT+CR). Interestingly, the increase in length is greater when start and target are

closer. The detours which have to be made due to the structure of the POI graph carry lessweight as the overall distance grows.

For the marginal increase in length using enriched graphs G2(FSQ) and G2(TXT+CR),results attain considerably higher scores as shown in Fig. 10b and d. SFSQ scores are around30% higher than those of shortest paths, STXT+SA scores are roughly double those of short-est paths. As expected, the POI graphs intensify this effect. By roughly doubling pathlength, three (G2

POI(FSQ)) to five times (G2POI(TXT+CR)) the scores of shortest paths are

attained. Of course, this effect is dependent on parametrization and data density, but theresults substantiate the claim that routing in POI graphs yields higher qualitative scores. Foran application where quantitative metrics are less important, POI graphs may be employed.This holds, for example, for tourist route recommendation systems where path length is aminor criterion. Also, in the case of sportive routing, e.g., for cyclists and runners, pathlength might be a limiting factor, but attractiveness of the path itself is usually the pri-mary factor. In contrast, when only minor increase in length is tolerated, enriched networksyield the better trade-off. For negligible additional path length, considerable gain in score isachieved. This may be beneficial for car navigation systems. For drivers, travel time is usu-ally the highest ranked criterion. However, some drivers might b e willing to accept a minordetour for a more scenic trip.

Page 25: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

(a) Path lengths in G (FSQ) and GPOI(FSQ) (b) SFSQ scores in G (FSQ) and GPOI(FSQ)

(c) Path lenghts in G (TXT+CR) andGPOI(TXT+CR)

(d) STXT+SA scores in G (TXT+CR) andGPOI(TXT+CR)

Fig. 10 Path lengths and scores of optimal paths in Foursquare graphs (top) and textual closeness graphs(bottom) relative to shortest path (in G)

6.4 Comparing pairs and triples

For our final setting, we examine the benefit of enriching POI graphs with sequential occur-rences. We rely on consecutive Foursquare check-ins (of a user during one day), i.e. the POIgraphs G2

POI(FSQ) and G3POI(FSQ) as defined in Sections 5.1 and 5.2. As mentioned before,

we were able extract just about 28K consecutive check-in pairs by the same user during oneday. Extending the sequence of consecutive check-ins to triples, we were able to mine 14Ktriple occurrences. Combining pairs and triples of POIs, we build a POI graph G3

POI(FSQ)

which, in addition to the links between POI pairs, holds shortcuts for relevant occurrencesof POI triples. Algorithm 1 can be applied to this kind of graph without modification. How-ever, the structure of the graph can lead to cycles during path computation. In Section 5.2,we detailed which kinds of cycles in result paths are problematic. We explained which areto be avoided (indirect cycles with direct return) and how to ensure this by asserting a par-ticular property (recall Lemma 2). Building G3

POI(FSQ), we encountered violations of thisproperty in a negligible number of triples (< 100). It can therefore be stated that ensuringthe correctness of POI graphs for triples is not at all a limiting constraint.

Figure 11a shows the path lengths, and Fig. 11b shows the Foursquare scores SFSQ.Additionally using triples of POIs has hardly any increasing effect on path lengths. Thiscan be attributed to the presence of edges between pairs as well as triples in G3

POI(FSQ).Therefore, if the trade-off between length and score of a triple is suboptimal, the routingalgorithm will often choose the path corresponding to only one of the pairs. Thus, the POI

Page 26: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

(a) Path lengths (b) SFSQ scores

Fig. 11 Path lengths and scores of optimal paths in graphs G2POI(FSQ) and G3

POI(FSQ) relative to shortestpath (in G)

graph for triples offers additional knowledge without degrading the knowledge extractedfrom pairwise occurrences. At the cost of virtually no increase in path length, we are able toattain 15% to 30% higher scores in G3

POI(FSQ) compared to G2POI(FSQ). This emphasizes

the value of POI sequences for routing purposes. Seeing as the overhead in programming islow, it can be recommended to integrate this kind of knowledge if it may be extracted fromthe corresponding data source.

6.5 Runtime evaluation

This section describes the query-time performance of proposed algorithms G1, G2, G2POI,

and GrPOI on various distances between path start and destination. As is shown in Fig. 12,

overall runtime scales slightly superlinearly with variations between approaches. All timesare given in milliseconds per query.

Baseline approaches G1 (orange) and G2 (green) operate as traditional routing onenriched pre-processed graph, where Dijkstra’s algorithm was used for path finding. Notethat any routing algorithm could be used here instead. Due to the lack of additional over-head, paths for short distances are found in comparatively short time. However, an increasein distance keeps G2 relatively stable in runtime, whereas G1 takes significantly longer toreturn resuls. We argue this is caused by the distribution of enrichment data sources – G1

Fig. 12 Runtime evaluation for different distances between path start and destination

Page 27: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

is based on single-instance flickr images that have diverse distributions on a local level,but can be found in almost every neighborhood. Hence, a Dijkstra search over longer dis-tances explores large sections of the network where cost has dropped due to enrichmentdiscounts. In contrast, G2 is based on the materialization of POI-pair relationships, whichtypically affects corridors and sub-sets of the road network, keeping overall edge cost up.This behaviour can be easily adjusted by fine-tuning enrichment parameters.

Meta-network approaches G2POI (light blue) and Gr

POI (dark blue) employ a two-steprouting process and therefore take more time to compute results. More specifically, for everyconnection of two relevant POI along the path, an entire new Dijkstra-search has to be issuedto find this connection in the actual road network. In direct comparison, Gr

POI takes slightlylonger since it uses more available sequences of POIs that ultimately need to be translatedinto road paths.

7 Conclusions and outlook

This work covers the topics of knowledge extraction and network enrichment. Our goal wasto incorporate qualitative information into routing algorithms in order to bring path compu-tation more in line with human intuition. To achieve this goal, we extracted knowledge fromheterogeneous sources of crowdsourced – i.e., user-generated – spatial data, including spa-tially enriched data, spatio-temporal data and purely textual data. By using this knowledgeto enrich existing networks or to construct abstract meta-networks, we map the qualitativepreferences of users onto the actual road network. As a result, we are indeed able to com-pute paths which correspond to the respective sentiment of the crowd, and it is possible torepresent varying sentiments using different data sources.

In our extensive overview, we investigate different data types and sources, examine vari-ous occurrences of inherent knowledge and present different means for network enrichment.Furthermore, our wide theoretic fundament is substantiated by a number of experimentsusing real-world datasets with different characteristics. Our results clearly show that incor-porating qualitative information into road networks is not only feasible but leads to enhancedsolutions for the task of path computation, i.e., paths which reflect qualitative knowledgewithout sacrificing their quantitative aspect.

Appendix: Mining Spatial Closeness from Textual Data

As presented in [3], we rely on a probabilistic model to counter the ambiguity inherent intextual data. This is done in three steps. First, we create feature vectors for each occurrenceof a particular spatial closeness relation. Second, employing the feature vectors we traina Gaussian Mixture Model (GMM) for each kind of closeness relation. Third, we use thederived GMMs to infer a closeness score which can be used to enrich the underlying roadnetwork.

For each occurrence of a closeness relation (e.g., “next to”) mined from the text (e.g.,the triplet (Pi, “next to”, Pj )), a spatial feature vector vij = (r, φ) is created. vij describesthe Euclidean distance between Pi and Pj and the orientation as the counterclockwise rota-tion of the x-axis, centered at Pj , to Pi . Thus, for any triplet extracted from the corpus(sufficiently often) we obtain a two-dimensional feature vector. We group these vectors bythe relation they represent, obtaining a set of vectors VRELk for each of the closeness rela-tions RELk . For each relation, we propose to train a probabilistic model. Following the

Page 28: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

argumentation in [43] where similar relations were examined, we propose employingGMMs, a well-proven and extensively studied method for many supervised and unsuper-vised learning problems [56].

A GMM is a weighted sum of a fixed number M of Gaussian component densities

Pθ (v) =M∑

i=1

wig(v; μi,�i)

where v is an l-dimensional vector, wi are the mixture weights (summing to 1) and

g(v; μi,�i) = 1((2π)l det(�i)

)1/2 exp

(−1

2(v − μi)

T �i(v − μi)

)

is an l-variate Gaussian density function with mean vector μi ∈ Rl and covariance matrix

�i ∈ Rl×l . The model is fully characterized by the weights, mean vectors and covariance

matrices, collectively represented in θ = {wi, μi,�i}, i = 1, . . . , M . In our case, l = 2,the dimensionality of the spatial feature vectors v. For the parameter estimation of eachGaussian component, Expectation Maximization (EM) [57] is the state-of-the-art technique.It updates the parameters of the components iteratively w.r.t. a given (feature) vector setuntil a convergence threshold is reached.

Employing EM, we obtain a set of probabilistic models P(· | θk), one for each closenessrelation RELk ∈ {REL1, . . . , RELm} in the set of all closeness relations. For a pair ofPOIs Pi and Pj with spatial feature vector vij , Pθk (vij ) is the probability that Pi and Pj

stand in spatial relation RELk to each other. Based on this information, we now derive acloseness score for pairs of POIs by Bayesian inference.

For two distinct POIs, letRELij denote the set of all closeness relations existing betweenPi and Pj . Note that RELk denotes an abstract relation, while RELij denotes the set ofoccurrences of relations between a pair of POIs. Furthermore, let vij denote the spatialfeature vector of Pi and Pj . In order to determine a numeric score describing the close-ness of the POIs, we estimate the posterior probability of all closeness relations RELk andaccumulate them. The posterior probability of relation RELk is given by

P(RELk | vij ) = P(vij | θk) P(θk)∑ml=1P(vij | θ l) P(θ l)

where P(θk) = P(RELk) denotes the prior probability of relation RELk given by thetrained model represented by θk .

In a traditional classification problem the task would be to choose the closeness relationwith the highest posterior and to assign the pair of POIs to this class. We, in contrast, con-sider each posterior probability P(RELk | vij ) as a measure of confidence of the existenceof RELk between Pi and Pj . Stressing that all considered relations represent spatial close-ness, we combine all posteriors into one measure which we call closeness score csij of thepair of POIs Pi and Pj :

csij = 1

m

m∑

k=1

P(RELk | vij )

max{P(RELk | vij ) | ∀ i = j}

This closeness score can be used to enrich the underlying road network.

Page 29: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

References

1. Goodchild MF (2007) Citizens as sensors: the world of volunteered geography. GeoJournal 69:211–2212. Skoumas G, Schmid KA, Josse G, Zufle A, Nascimento MA, Renz M, Pfoser D (2014) Towards

knowledge-enriched path computation. In: Proceedings of the 22nd ACM SIGSPATIAL internationalconference on advances in geographic information systems, Dallas/Fort Worth, TX, USA, November4-7, 2014, pp 485–488

3. Skoumas G, Schmid KA, Josse G, Schubert M, Nascimento MA, Zufle A, Renz M, Pfoser D (2015)Knowledge-enriched route computation. In: Advances in spatial and temporal databases - 14th interna-tional symposium, SSTD 2015, Hong Kong, China, August 26-28, 2015. Proceedings, pp 157–176

4. Josse G, Franzke M, Skoumas G, Zufle A, Nascimento MA, Renz M (2015) A framework for com-putation of popular paths from crowdsourced data. In: 31st IEEE international conference on dataengineering, ICDE 2015, Seoul, South Korea, April 13-17, 2015, pp 1428–1431

5. Josse G, Schmid KA, Zufle A, Skoumas G, Schubert M, Pfoser D (2015) Tourismo: A user-preferencetourist trip search engine. In: Advances in spatial and temporal databases - 14th international symposium,SSTD 2015, Hong Kong, China, August 26-28, 2015. Proceedings, pp 514–519

6. Lv M, Chen L, Chen G (2012) Discovering personally semantic places from gps trajectories. In: ACMCIKM12, pp 1552–1556

7. Yan Z, Chakraborty D, Parent C, Spaccapietra S, Aberer K (2011) Semitri: A framework for seman-tic annotation of heterogeneous trajectories. In: Proceedings of the 14th international conference onextending database technology, pp 259–270

8. Palma AT, Bogorny V, Kuijpers B, Alvares LO (2008) A clustering-based approach for discover-ing interesting places in trajectories. In: Proceedings of the ACM symposium on applied computing,pp 863–868

9. Alvares LO, Bogorny V, Kuijpers B, de Macedo JAF, Moelans B, Vaisman A (2007) A model forenriching trajectories with semantic geographical information. In: Proceedings of the 15th annual ACMinternational symposium on advances in geographic information systems. ACM, pp 22:1–22:8

10. Parent C, Spaccapietra S, Renso C, Andrienko G, Andrienko N, Bogorny V, Damiani ML, Gkoulalas-Divanis A, Macedo J, Pelekis N, Theodoridis Y, Yan Z (2013) Semantic trajectories modeling andanalysis. ACM Comput Surv ’13 45:42:1–42:32

11. Spaccapietra S, Parent C (2011) Adding meaning to your steps. In: Proceedings of the 30th internationalconference on conceptual modeling, pp 13–31

12. Yan Z, Chakraborty D, Parent C, Spaccapietra S, Aberer K (2013) Semantic trajectories: Mobility datacomputation and annotation. ACM Trans Intell Syst Technol 4:49:1–49:38

13. Yan Z, Spremic L, Chakraborty D, Parent C, Spaccapietra S, Aberer K (2010) Automatic construc-tion and multi-level visualization of semantic trajectories. In: Proceedings of the 18th internationalconference on advances in geographic information systems, pp 524–525

14. Feldman D, Sugaya A, Sung C, Rus D (2013) idiary: From gps signals to a text-searchablediary. In: Proceedings of the 11th ACM conference on embedded networked sensor systems. ACM,pp 6:1–6:12

15. Duckham M, Kulik L (2003) Simplest paths: Automated route selection for navigation. In: Proceedingsof the international conference on spatial information theory COSIT 2003. Springer, pp 169–185

16. Duckham M, Winter S, Robinson M (2010) Including landmarks in routing instructions. Journal onLocation Based Services 4(4):28–52

17. Richter KF, Duckham M (2008) Simplest instructions: Finding easy-to-describe routes for navigation.In: Proceedings of 5th international conference geographic information science 2008, pp 274–289

18. Westphal M, Renz J (2011) Evaluating and minimizing ambiguities in qualitative route instructions. In:Proceedings of the 19th ACM international conference on advances in geographic information systems,pp 171–180

19. Sacharidis D, Bouros P (2013) Routing directions: Keeping it fast and simple. In: Proceedings of the21st ACM international conference on advances in geographic information systems, pp 164–173

20. Quercia D, Schifanella R, Aiello LM (2014) The shortest path to happiness: Recommending beautiful,quiet, and happy routes in the city. CoRR (abs/1407.1031

21. Sun Y, Fan H, Bakillah M, Zipf A (2015) Road-based travel recommendation using geo-tagged images.Comput Environ Urban Syst 53:110–122

22. Runge N, Samsonov P, Degraen D, Schoning J (2016) No more autobahn!: Scenic route generation usinggoogles street view. In: Proceedings of the 21st international conference on intelligent user interfaces.ACM, pp 147–151

23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neuralnetworks. In: Advances in neural information processing systems, pp 1097–1105

Page 30: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

24. Li F, Cheng D, Hadjieleftheriou M, Kollios G, Teng SH (2005) On trip planning queries in spatialdatabases. In: Proceedings of the 9th international conference on advances in spatial and temporaldatabases. SSTD’05. Springer-Verlag, Berlin, Heidelberg, pp 273–290

25. Kanza Y, Safra E, Sagiv Y, Doytsher Y (2008) Heuristic algorithms for route-search queries overgeographical data. In: ACM SIGSPATIAL GIS, p 11

26. Chen H, KuWS, Sun MT, Zimmermann R (2011) The partial sequenced route query with traveling rulesin road networks. Geoinformatica 15:541–569

27. Gavalas KC, Mastakas K, Pantziou G (2014) A survey on algorithmic approaches for solving tourist tripdesign problems. J Heuristics 20:291–328

28. Garcia A, Arbelaitz O, Linaza MT, Vansteenwegen P, Souffriau W (2010) Personalized tourist routegeneration. Springer

29. Chen C, Zhang D, Guo B, Ma X, Pan G, Wu Z (2015) Tripplanner: personalized trip planning leveragingheterogeneous crowdsourced digital footprints. IEEE Trans Intell Transp Syst 16:1259–1273

30. Hsieh HP, Li CT, Lin SD (2012) Exploiting large-scale check-in data to recommend time-sensitiveroutes. In: Proceedings of the ACM SIGKDD international workshop on urban computing. ACM

31. Brilhante I, Macedo JA, Nardini FM, Perego R, Renso C (2013) Where shall we go today?: Planningtouristic tours with tripbuilder. In: Proceedings of the 22nd ACM international conference on information& knowledge management. CIKM ’13. ACM, New york, NY, USA, pp 757–762

32. Hsieh H, Li C (2014) Mining and planning time-aware routes from check-in data. In: Proceedings of the23rd ACM international conference on conference on information and knowledge management, CIKM2014, Shanghai, China, November 3-7, 2014, pp 481–490

33. De Choudhury M, Feldman M, Amer-Yahia S, Golbandi N, Lempel R, Yu C (2010) Automatic con-struction of travel itineraries using social breadcrumbs. In: Proceedings of the 21st ACM conference onhypertext and hypermedia. HT ’10. ACM, New york, NY, USA, pp 35–44

34. Vansteenwegen P, Souffriau W, Oudheusden DV (2011) The orienteering problem: a survey. Eur J OperRes 209:1–10

35. Souffriau W, Vansteenwegen P, Berghe GV, Oudheusden DV (2011) The planning of cycle trips in theprovince of east flanders. Omega 39:209–213

36. Verbeeck C, Vansteenwegen P, Aghezzaf EH (2014) An extension of the arc orienteering problem andits application to cycle trip planning. Transport Res E-Log 68:64–78

37. Lu Y, Shahabi C (2015) An arc orienteering algorithm to find the most scenic path on a large-scale roadnetwork. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographicinformation systems. GIS ’15. ACM, New York, NY, USA, pp 46:1–46:10

38. Hauff C (2013) A study on the accuracy of flickr’s geotag data. In: Proceedings of the 36th internationalACM SIGIR conference on Research and development in information retrieval. ACM, pp 1037–1040

39. Zielstra D, Hochmair HH (2013) Positional accuracy analysis of flickr and panoramio images forselected world regions. J Spat Sci 58:251–273

40. Zandbergen PA (2008) Positional accuracy of spatial data: Non-normal distributions and a critique of thenational standard for spatial data accuracy. Trans GIS 12:103–130

41. Skoumas G, Pfoser D, Kyrillidis A, Sellis T (2016) Location estimation using crowdsourced spatialrelations. ACM Trans Spatial Algorithms Syst 2:5:1–5:23

42. Loper E, Bird S (2002) Nltk: The natural language toolkit. In: Proceedings of the ACL-02 work-shop on effective tools and methodologies for teaching natural language processing and computationallinguistics, vol 1, pp 63–70

43. Skoumas G, Pfoser D, Kyrillidis A (2013) On quantifying qualitative geospatial data: A probabilisticapproach. In: Proceedings of the second ACM international workshop on crowdsourced and volunteeredgeographic information, pp 71–78

44. Schlieder C, Matyas C (2009) Photographing a city: an analysis of place concepts based on spatialchoices. Spat Cogn Comput 9:212–228

45. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–27146. Schubert M, Renz M, Kriegel HP (2010) Route skyline queries: a multi-preference path planning

approach. In: Proceedings ICDE, pp 261–27247. Shekelyan M, Josse G., Schubert M (2015) Paretoprep: Efficient lower bounds for path skylines and fast

path computation SSTD1548. Shekelyan M, Josse G, Schubert M, Kriegel H (2014) Linear path skyline computation in bicriteria

networks. In: Database systems for advanced applications - 19th international conference, DASFAA2014, bali, indonesia, april 21-24 2014. Proceedings, Part I, pp 173–187

49. Shekelyan M, Josse G, Schubert M (2015) Linear path skylines in multicriteria networks. In: ICDE15,pp 459–470

50. Neis P, Zielstra D (2014) Recent developments and future trends in volunteered geographic informationresearch: The case of openstreetmap. Future Internet 6:76–106

Page 31: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

51. Neis P, Zielstra D, Zipf A (2013) Comparison of volunteered geographic information data contributionsand community development for selected world regions. Future Internet 5:282–300

52. Graf F, Kriegel HP, Renz M, Schubert M (2011) Mario: multi-attribute routing in open street map.In: Advances in spatial and temporal databases. vol 6849 of lecture notes in computer science,pp 486–490

53. Mousselly-Sergieh H, Watzinger D, Huber B, Doller M, Egyed-Zsigmond E, Kosch H (2014) World-wide scale geotagged image dataset for automatic image annotation and reverse geotagging. In:Proceedings of the 5th ACM multimedia systems conference, pp 47–52

54. Yang D, Zhang D, Chen L, Qu B (2015) Nationtelescope: monitoring and visualizing large-scalecollective behavior in lbsns. J Netw Comput Appl 55:170–180

55. Yang D, Zhang D, Qu B (2015) Participatory cultural mapping based on collective behavior in locationbased social networks. ACM Transactions on Intelligent Systems and Technology. in press

56. Bishop CM (2006) Pattern recognition and machine learning (information science and statistics).Springer-verlag New York, inc., secaucus, NJ, USA

57. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the emalgorithm. J R Stat Soc Ser B 39:1–38

Gregor Josse received his degree in mathematics and his PhDdegree in computer science from the Ludwig-Maximilians-UnversitatMunich, Germany, in 2012 and 2016, respectively. His PhD the-sis titled “Efficient Query Processing in Complex Modern Traf-fic Networks” addressed several theoretical and practical prob-lems occurring in modern traffic networks. Gregor Josse is cur-rently a researcher at the Database Systems Group of the Ludwig-Maximilians-Universitat in Munich and has been a co-author in 13peer-reviewed publications.

Klaus Arthur Schmid is a post doctoral researcher at Ludwig-Maximilians-Unversity Munich, Germany. He reiceived his doctoratein computer science in 2016, his thesis entitled “Searching and Min-ing in Enriched Geo-Spatial Data”. His research is focused on spatio-temporal query processing, spatio-temporal data-mining, uncertaindata management and he co-authored fifteen publications.

Page 32: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Andreas Zufle is an assistant professor at the department of Geog-raphy and Geoinformation Science at George Mason University. Hereceived his PhD in Computer Science and his Diploma in Statis-tics and Computer Science from Ludwig-Maximilians-Universitat(LMU Munich) in Munich, Germany. Dr. Zufle’s research quest isto bridge the gap between datascience and geo-science, two fieldsworking independently on often identical research problems. To bringthese communities together, Dr. Zufle has been co-organizing andprogram-co-chairing the International ACM SIGMOD Workshop onManaging and Mining Enriched Geo-Spatial Data (GeoRich) for thelast three years. His research is focused on spatiotemporal query pro-cessing, spatio-temporal data-mining, uncertain data management.Since 2011, he published more than 50 papers in refereed conferencesand journals and has an h-index of 14.

Georgios Skoumas received his degree in electronic and com-puter engineering from the Technical University of Crete in Chania,Greece, the M.Phil. degree in computer science from the Universityof Cambridge in Cambridge, U.K., in 2010, and the Ph.D. degreein electrical and computer engineering from the National TechnicalUniversity of Athens in Athens, Greece, in 2015. He was a ResearchAssistant with Idiap Research Institute in Martigny, Switzerland, in2011 and a visiting Research Assistant with Ludwig-Maximilian-Universitat in Munich, Germany, in 2014. His research interests liein the areas of data mining and machine learning. Dr. Skoumas was arecipient of a Marie Curie Fellowship in 2011.

Matthias Schubert is a professor at the Ludwig-Maximilians-Universitat (LMU Munich) and one of the founders of the DataScience Lab at LMU Munich. He received his doctoral degree inComputer Science in 2004 and finished his habilitation in 2009.According to Google Scholar (as of July 2017) his publications havebeen cited more than 1,700 times, resulting in an H-index of 22. Hisresearch areas are spatial information systems, similarity search anddata mining. He did applied research projects in the areas of webcrawling, medical imaging, fleet optimization for electric vehiclesand game analytics.

Page 33: Knowledge extraction from crowdsourced data for the ...static.tongtianta.site/paper_pdf/06c9d39e-0591-11e9-8d92-00163e08… · Keywords Crowdsourced data · Routing · Data mining

Geoinformatica

Matthias Renz is Associate Professor at the Computational and DataScience Department at the George Mason University (GMU), Fair-fax, VA, USA. His main research topics are data science, scientificand spatial databases, data mining and uncertain databases. To date,he has more than 100 peer-reviewed publications that in total receivedover 1800 citations with an h-index of 19. He served as general chairand program chair for several international conferences includingSSTD, ACM SIGSPATIAL, and DASFAA, founded and organizedworkshops at ACM SIGMOD and ACM SIGPATIAL, gave severalinvited tutorials, seminars, and keynotes.

Dieter Pfoser is an associate professor at the Department of Geog-raphy and Geoinformation Science, George Mason University, USA.He received his PhD in Computer Science from Aalborg Univer-sity, Denmark in 2000. His research interests include spatiotemporaldatabases, routing algorithms and crowdsourcing geospatial data. Hehas co-authored over 90 papers in peer-reviewed journals and confer-ences. He is an editorial board member of Transactions in GIS andan associate editor of Geoinformatica. He has been a PC co-chairof SSTD 2011 and W2GIS 2014, and organized the SIGSPATIALGEOCROWDWorkshops in 2012 and 2013.

Mario A. Nascimento is a Professor at the University of Alberta’sDepartment of Computing Science and currently serves as the depart-ment’s Chair.

Before joining the University of Alberta in 1999, he was aresearcher with the Brazilian Agency for Agricultural Research andalso an adjunct faculty member with the Institute of Computing ofthe University of Campinas. Mario has also been a visiting professorat the National University of Singapore’s School of Computing, Aal-borg University’s Department of Computer Science, LMU’s Institutefor Informatics and at the Federal University of Ceara in Brazil.

His main research interests is contained in areas of spatial andspatiotemporal data management and according to Google Scholarhis publications have been cited ∼3,300 times (∼2,950 according toSemantic Scholar), earning him an H-index of 28. In 2007 he wasrecognized as a Senior Member of the ACM.

Besides often serving as a program committee member for themain database conferences, and as (co-) chair of several workshops

and symposia, Mario has also served as ACM SIGMOD’s Information Director (2002–2005) and ACMSIGMOD Record’s Editor-In-Chief (2005–2007). He is currently a member of the VLDB Journal’s EditorialBoard and of the SSTD Endowment’s Board of Directors. He also serves as Chair of the NSERC-ComputerScience Liaison Committee and CS-Can/Info-Can representative at CRA’s Board of Directors.

Finally, he finds it rather amusing writing about himself in the third person.


Recommended