+ All Categories
Home > Documents > A rank aggregation framework for video multimodal geocoding

A rank aggregation framework for video multimodal geocoding

Date post: 11-Dec-2016
Category:
Upload: ricardo-da-silva
View: 217 times
Download: 2 times
Share this document with a friend
37
Multimed Tools Appl DOI 10.1007/s11042-013-1588-4 A rank aggregation framework for video multimodal geocoding Lin Tzy Li · Daniel Carlos Guimarães Pedronette · Jurandy Almeida · Otávio A. B. Penatti · Rodrigo Tripodi Calumby · Ricardo da Silva Torres © Springer Science+Business Media New York 2013 Abstract This paper proposes a rank aggregation framework for video multimodal geocoding. Textual and visual descriptions associated with videos are used to define ranked lists. These ranked lists are later combined, and the resulting ranked list is used to define appropriate locations for videos. An architecture that implements the proposed framework is designed. In this architecture, there are specific modules for L. T. Li (B ) · D. C. G. Pedronette · J. Almeida · O. A. B. Penatti · R. T. Calumby · R. da S. Torres RECOD Lab, Institute of Computing, University of Campinas (UNICAMP), Campinas, SP 13083-852, Brazil e-mail: [email protected] D. C. G. Pedronette e-mail: [email protected] J. Almeida e-mail: [email protected] O. A. B. Penatti e-mail: [email protected] R. T. Calumby e-mail: [email protected] R. da S. Torres e-mail: [email protected] L. T. Li Telecommunications Res. & Dev. Center, CPqD Foundation, Campinas, SP 13086-902, Brazil D. C. G. Pedronette Department of Statistics, Applied Mathematics and Computing, Universidade Estadual Paulista (UNESP), Rio Claro, SP 13506-900, Brazil e-mail: [email protected] R. T. Calumby Department of Exact Sciences, University of Feira de Santana (UEFS), Feira de Santana, BA 44036-900, Brazil
Transcript
Page 1: A rank aggregation framework for video multimodal geocoding

Multimed Tools ApplDOI 10.1007/s11042-013-1588-4

A rank aggregation framework for videomultimodal geocoding

Lin Tzy Li · Daniel Carlos Guimarães Pedronette ·Jurandy Almeida · Otávio A. B. Penatti ·Rodrigo Tripodi Calumby · Ricardo da Silva Torres

© Springer Science+Business Media New York 2013

Abstract This paper proposes a rank aggregation framework for video multimodalgeocoding. Textual and visual descriptions associated with videos are used to defineranked lists. These ranked lists are later combined, and the resulting ranked list isused to define appropriate locations for videos. An architecture that implements theproposed framework is designed. In this architecture, there are specific modules for

L. T. Li (B) · D. C. G. Pedronette · J. Almeida · O. A. B. Penatti · R. T. Calumby ·R. da S. TorresRECOD Lab, Institute of Computing, University of Campinas (UNICAMP),Campinas, SP 13083-852, Brazile-mail: [email protected]

D. C. G. Pedronettee-mail: [email protected]

J. Almeidae-mail: [email protected]

O. A. B. Penattie-mail: [email protected]

R. T. Calumbye-mail: [email protected]

R. da S. Torrese-mail: [email protected]

L. T. LiTelecommunications Res. & Dev. Center, CPqD Foundation, Campinas, SP 13086-902, Brazil

D. C. G. PedronetteDepartment of Statistics, Applied Mathematics and Computing,Universidade Estadual Paulista (UNESP), Rio Claro, SP 13506-900, Brazile-mail: [email protected]

R. T. CalumbyDepartment of Exact Sciences, University of Feira de Santana (UEFS),Feira de Santana, BA 44036-900, Brazil

Page 2: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

each modality (e.g, textual and visual) that can be developed and evolved indepen-dently. Another component is a data fusion module responsible for combining seam-lessly the ranked lists defined for each modality. We have validated the proposedframework in the context of the MediaEval 2012 Placing Task, whose objective isto automatically assign geographical coordinates to videos. Obtained results showhow our multimodal approach improves the geocoding results when compared tomethods that rely on a single modality (either textual or visual descriptors). Wealso show that the proposed multimodal approach yields comparable results to thebest submissions to the Placing Task in 2012 using no extra information besides theavailable development/training data. Another contribution of this work is related tothe proposal of a new effectiveness evaluation measure. The proposed measure isbased on distance scores that summarize how effective a designed/tested approachis, considering its overall result for a test dataset.

Keywords Video geotagging · Multimodal retrieval · Rank aggregation ·Effectiveness measure

1 Introduction

The great amount of geographical entities available on the Web has created agreat interest in locating them on maps. Geographical information is often enclosedinto digital objects (like documents, images, and videos) and using it to supportgeographical queries is of great interest.

Nowadays, there are many devices with a GPS unit embedded, such as cellphonesand cameras, that associate location tags with photos and other published contentlike Twitter updates, Facebook posts, and other posts in social medias. On the Web,tools like Google Maps1 and Google Earth2 are very popular, and partially meet theneeds of Web users for geospatial information. By using these tools, users can, forexample, find an address on a map, look for directions from one place to another,find nearby points of interest (e.g., restaurants, coffee shops, museums), and listthe nearby streets. Other common queries usually desired by users include findingdocuments, videos, and photos that refer to a certain location’s vicinity. Additionally,large collections of digital objects can be browsed based on the location to which theyare related.

The development of those spatially-aware services (e.g., search and browse),on the other hand, demands that digital objects be geocoded or geotagged, i.e.,the location of digital objects in terms of their latitude and longitude needs to bedefined in advance. Geocoding is a common expression used in the GeographicInformation Retrieval (GIR) community. Other existing designations like geotaggingand georeferencing usually appear in the multimedia domain [36]. In the GeographicInformation System (GIS) area, georeferencing is a term largely used to refer to agiven location where something exists, in a physical space, in terms of a coordinatesystem (i.e., latitude and longitude).

1http://maps.google.com/ (As of Apr. 2013).2http://www.google.com/earth/ (As of Apr. 2013).

Page 3: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Most of the initiatives for video geocoding are based only on textual informa-tion [18, 30, 36]. Even some works that claim to be multimodal are in fact reducing theproblem to a textual geocoding problem. In the work proposed in [39], for example,other modalities, such as sound/speech, are converted into textual transcripts that areused in text-based geocoding methods.

One problem commonly found on approaches based on textual information relieson the lack of objectivity and completeness, in the sense that the understanding ofthe visual content of a multimedia object may change according to the experienceand perception of each subject. Other challenges include lexical and geographicalambiguities in recognizing place names [31]. In this scenario, a promising alternativeis to use the image/video visual content. The objective is to explore image/videovisual properties (such as texture, color, and movement) as alternative cues forgeotagging. Furthermore, having multiple (and usually complementary) sources ofinformation for multimedia geocoding also opens the opportunity of using existingfusion approaches to combine them.

In fact, some methods have been proposed to handle the video geocoding problemby exploiting multiple modalities [22, 51]. In these methods, however, the geocodingprocess consists in the use of ad hoc methods (usually one per modality) that areused in a sequential manner to define the location of videos. In these methods, eachmodality works as a filter that refines the results of previous steps.

In this article, we propose a rank aggregation framework for video multimodalgeocoding. Textual and visual descriptions associated with videos are used to defineranked lists. These ranked lists are later combined, and the resulting ranked list isused to define appropriate locations for videos. An architecture that implementsthe devised framework is proposed. In this architecture, there are specific modulesfor each modality (e.g., textual and visual) that can be developed and evolved inde-pendently. Another component is a data fusion module responsible for combiningseamlessly the ranked lists defined for each modality. To the best of our knowledge,this is the first attempt to address the geocoding problem using this kind of solution.

We have validated the proposed framework in the context of the MediaEval 2012Placing Task, whose objective is to automatically assign geographical coordinates tovideos. Obtained results show that our multimodal approach improves the geocodingresults when compared to methods that rely on a single modality (either textualor visual descriptors). We also show that the proposed multimodal approach yieldscomparable results to the best submissions to the Placing Task in 2012 using no extrainformation besides available development/training data.

Another contribution of this work is related to the proposal of a new effectivenessevaluation measure. The proposed measure is based on distance scores that summa-rize how effective a designed/tested approach is, considering its overall result for atest dataset.

The paper is organized as follows. Section 2 presents related work on videogeocoding and overviews the data fusion work. Section 3 formalize the video geocod-ing problem and detail our proposed Rank Aggregation Framework for VideoMultimodal Geocoding. Next, Section 4 details the architecture that implements thisframework, showing how we instantiated each of its modules to extract text andvisual features from videos, besides the proposed strategies to aggregate them in amore effective geocoding approach. Section 5 presents the experiment for validatingthe proposed architecture, the data sets, its common evaluation system, our new

Page 4: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

evaluation proposal, the experimental setup, and the results. Finally, Section 6 drawsour conclusions and discusses future work.

2 Related work

This section presents an overview of related work dedicated to video geocoding anddata fusion.

2.1 Multimedia geocoding

The commonest solutions for geocoding multimedia material rely on textual infor-mation [30, 36]. Recently, however, more attention has been given to methods thatuse image/video content in the geocoding process.

Hays and Efros [17] present a strategy to predict a location based on a given imageby finding a probability distribution of images over the globe, using a dataset of over 6million geotagged images (their knowledge base) from all over the world. Unknownimages are described by selected image descriptors (e.g., color histograms, GIST)and compared to the big knowledge base. The top-k most similar returned geotaggedimages are used to estimate the location of a given unknown image. Although thisstrategy is not precise in finding an exact location most of the time, it indicatesroughly where an image was captured. For 16 % of the cases, their method correctlypredicted an image location to within 200 km radius. Extensions of this approachrely only on the use of the text tags associated with the images [49]. Other work onphoto geotagging based only on visual content has been proposed in the context oflandmark recognition [36]. Kalantidis et al. [19], in turn, propose geotagging non-landmark images using a big geotagged and clustered dataset as knowledge base.

Research on video geocoding has been done for the Placing Task, which waslaunched in 2010 as a benchmarking initiative to evaluate new algorithms formultimedia access and retrieval (a spin-off of VideoCLEF), called MediaEval [30].This task aims to automatically assign latitude and longitude coordinates to eachof the provided test videos. The most recent approaches for video geocoding weresubmitted to the Placing Task at MediaEval 2010, 2011, and 2012 as we overviewbelow.

2.1.1 Placing task 2010 and 2011

They can be basically divided into methods based on textual information andmethods based on visual information. In this work, we are interested in methods forcombining different modalities of video data to improve the video geocoding results.

In the 2010 edition of the Placing Task, there were three main approaches, assummarized in [30]: (a) geoparsing and geocoding textual descriptions extracted fromavailable metadata assisted by a gazetteer of geographic names, such as Geonames;3

(b) propagation of the georeference of a similar video in the development database

3http://www.geonames.org (As of Apr. 2012).

Page 5: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

to the test video; (c) division of the training set in geographical regions determinedby clustering or a fixed-size grid using a model to assign items to each group. Themodel estimation is based on textual metadata and visual clues. The best result in2010 for this task was observed for the method proposed by Van Laere et al. [53],which used only metadata of images and videos, combining approaches (b) and (c):first, a language model was employed to identify the most likely area of a video andthen the most similar resources from the training set were used to determine its exactcoordinates. Just one research team reported one of their results using only visualcontent in 2010 [21].

Kelm et al. also presented a hierarchical approach to geocode videos automaticallybased on both textual and visual information [20]. The proposed method can bedivided into the following steps: (1) geographic boundaries are extracted based onNatural Language Processing (NLP) for toponyms recognitions, and are filtered byGeonames—and Wikipedia-based filters; (2) a textual region model based on a doc-ument classification method, which selects regions with higher probability of beingassigned, is employed; (3) a visual model based on the similarity of all frames of avideo with regard to training set (videos and photos) mean feature vectors of regionsis used. The results are then combined based on their rank sum and, finally, the mostsimilar videos from training data contained in selected regions are determined andtheir lat/long are assigned to the test video. In summary, a geographical boundaryextraction reduces the number of possible regions in a first stage. Then the textualmodel returns the log-likelihoods of the remaining regions based on the tags of eachtest video. Next, the visual model returns the similarities considering the featurevectors of the region model and the test video. Their approach is based on differentand well-defined stages with fusion done on rank level using the rank sum algorithm.

In 2011, six groups submitted their results, but only four of them submitted fora run in which only visual features are used to predict the location of test videos:ICSI team [6], WISTUD team [16], UGENT team [28], and UNICAMP team [33].However, most of them considered visual features as a backup predicting approachfor the cases in which no tags or textual description associated with a test video areavailable. Text-based video geocoding still yields better results than the visual-basedones being UGENT the best results when considering the run in which they wereallowed to use additional crawled data.

Choi et al. [6] (ICSI team) use the top-three results of searches based on textualmetadata as anchor points for an 1-NN search using visual features match (GIST).Each test video (its temporal mid-point frame) is compared to the whole develop-ment set (photos and video frames) that is within 1 km radius from those anchorpoint. They also considered video acoustic clues in the video geocoding process whenmatches of textual- and visual-based results were too low.

Using the 2011 database, Kelm et al. [23] extended their previous work [20]introducing a spatial segmentation at different hierarchy levels with a probabilisticmodel to determine the most likely location at these levels. The world map wasiteratively divided into segments of different sizes and those spatial segments for eachlevel was used as classes for their probabilistic model. They used additional externalresources like GeoNames and Wikipedia for toponym detection when generatinghierarchical segments (e.g., national borders detection). They combined modalityin a sequential mode: first used text for geo-predicting, then in case of absence ofmetadata the visual approach is applied.

Page 6: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

2.1.2 Placing task 2012

In 2012, the best results was accomplished by the CEALIST group [45] using textualand additional external data. Its approach combines a language model that dividesthe Earth into cells of 1 km; and a user model based on tagging probability, whichexploits users’ past geotagging behavior. To construct the user model, this teamdownloaded 3,000 geotagged metadata per user for training purposes. The visual-only submission of the CEALIST team is based on the bag-of-word model withSURF descriptor. A bag is associated with video frames and are later used to execute50-NN video searches with the aim of performing spatial clustering within 5 km [45].

The IRISA team [50, 51] approach is based on tag analysis with a fallback systemthat relies on user’s information (upload history, social info like friend and hiscurrent/prior location, home town). For the run based on visual properties, theteam used their proposed descriptors, which are based on SIFT and VLAD. Thosedescriptors are used to index training videos using product quantization. The finalstep of the proposed geocoding approach relies on performing a NN-search on thecreated index aiming at generating a list of candidate videos. That list is then used todefine a list of coordinates, whose medoid lat/long is assigned to the test video [50].They used the visual content as one of the last resource to be used to geocode a videoin their sequential pipeline.

Extending their approach proposed in the previous year, the Ghent and Cardiffteam (Ghent) [29] relied on clustering tags found in photos of the training set.Additional information was crawled and used in the geocoding process. Later thentest videos are classified into the most probable cluster according to χ2 featureselection. As 2012’s test dataset had shortage of tags, they used other information(title, description) to treat as tags when they are not found. They used as fall backsystem the default location (either user home location or center of London). Theirvisual-only solution relied on extracting SIFT features from photos of both trainingand test set. SIFT feature vectors associated with frames are then compared to findthe most similar training photos. The results where textual and visual features wereused to geocode did not improve the results of their approach that relies on videos’metadata.

The ICSI team [5] proposed an approach based on a graph model created by usingtextual tags to infer the location of a given test video. For 1 km precision level, theproposed graph model was not able to outperform the results obtained by their lastyear’s approach [6] combined with gazetteers. For other precision levels, however, itsgraph model yields better results. The only-visual submission of this team was basedon GIST features.

The TUD team [35] presented a exploratory study only using visual featuresassociated with regions defined by partitioning the Earth based on different externalresources, such as climate and bioma data. Their best results (visual) were those inwhich the world was divided into regions based on bioma data over which the trainingphotos were distributed and then clustered into subregions. They used the visualfeatures provided by the organizers of the Placing Task 2012 [46].

In 2012, Kelm et al. represented the TUB team [22]. They tacked the geocodingtask as a classification problem that considers different hierarchies or spatial seg-ments as explained in [23] and discussed previously in Section 2.1.1. Their visual-only approach uses visual features extracted from 3.2 million images and fromvideo keyframes of development/training set. Based on their spatial segmentation

Page 7: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

in different levels, a k-d tree is built iteratively for different image descriptors andsegmentation level. The most similar spatial segment is determined by traversing thecreated k-d tree, using the Euclidean norm [22]. Their best results was accomplishedusing additional resources.

In our work, we do not use any additional or external resource (e.g., gazetteers,more crawling, etc) and barely use the ≈3.2M Flickr image data set. We apply a latefusion approach to combine video features. A rank aggregation method combinesscores generated by various features (from different modalities, e.g., text and visual).Therefore, the features are homogeneously and seamlessly combined, representingan important advantage of our approach. Additionally, other new features can beeasily added to the fusion step and this approach opens new research opportunitiesrelated to the development and use of rank fusion methods in video geocoding tasks.

Regarding our previous publications, two of them [32, 33] are the two-page longworking notes for the 2011 and 2012 Placing Task. They introduce our geocodingmethod and discuss preliminary results, while reference [34] is a four-page shortpaper that reports briefly the results of our geocoding method for the 2011 dataset.This journal paper reports results for the most current (2012) dataset and additionallypresents, explains, and discusses deeply additional experiments conducted to validateour approach. The current version also has an extensive overview of related work anda formalized framework proposal. Another contribution of this work is the proposalof a new effectiveness evaluation measure for geocoding tasks.

2.2 Data fusion

Rank aggregation methods combine scores/rankings generated by different features(from different modalities) to obtain a more accurate one. In many situations, rankaggregation has been seen as a way for obtaining a consensus ranking when multipleranked lists are provided for a set of objects. In multimedia retrieval tasks, anaccurate information fusion of the different modalities is essential for the system’soverall effectiveness performance [26]. The main reasoning behind informationfusion systems is based on the conjecture [27] that, by combining features, it ispossible to achieve a more precise representation of the data being analyzed.

Given the wide range of applications, information fusion has established itself asan independent research area over the last decades [26]. Most of the approachesfall in three broad categories: early fusion, late fusion, and transmedia fusion. Theearly fusion approach consists in representing the multimedia objects in a multimodalfeature space designed via a joint model that attempts to map different features fromeach other. On the contrary, late fusion and transmedia fusion strategies consider thedifferent features independently. Late fusion techniques mainly consist in mergingthe similarity information encoded by a single modality by means of aggregationfunctions. In transmedia approaches, one of the modalities is used to obtain relevantobjects. Next, the retrieval system exploits the other available modalities with theaim of improving their results [7].

A common approach for performing information fusion tasks in multimediaretrieval systems consists in using rank aggregation methods. Different modalities,or even assorted descriptors of a same modality, may produce different rankings(or similarity scores). Thus, these distinct views of same data may provide differentbut complementary information about the multimedia objects. The best combina-

Page 8: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

tions occur when all systems being combined have good effectiveness performance,although it is possible to get improvements when only one of the systems is effective.This observation is consistent with the statement that the combinations with thelowest error rate are those whose inputs are independent and non-correlated [10].

More formally, rank aggregation can be seen as the task of finding a permutationthat minimizes the Kendall-tau distance to the input rankings. The Kendall-taudistance can be defined as the sum over all input rankings of the number of pairs ofelements that are in a different order in the input ranking than in the output ranking.If the input rankings are permutations, this problem is known as the Kemeny rankaggregation problem [47].

Rank aggregation methods have been exploited for a large number of multimediaapplications, since there has been an explosion of such type of digital content inthe last years [7]. Unsupervised approaches considering similarities scores [14] orrank positions [8] have been used. Score-based rank aggregation approaches are verycommon, being applied in different scenarios, from multimodal image retrieval [42]to association of photos with georeferenced textual documents [4].

Different strategies have been used, considering mainly the information of: (i)the score computed for an item in a ranked list and; (ii) the position (or rank)assigned to an item in a ranked list. CombSum and CombMNZ algorithms [14],for example, consider the sum of the normalized relevance scores computed bydifferent systems to compute a new relevance score. The Borda count method [8]uses rank information in voting procedures. Rank scores are assigned linearly todocuments in ranked lists according to their positions and are summed directly.Although very simple and presenting linear complexity, these approaches have beenused as baselines for many works, along decades. Another traditional approachused for analyzing rank aggregation tasks consists in the Condorcet criterion. TheCondorcet voting algorithm defines that the winner of the election is the candidatethat beats or ties with every other candidate in pairwise comparisons. In other words,given a distance between two ranked lists as the number of pairs whose elementsare ranked reversely, then the Condorcet’s result is the one that minimizes the totaldistance [11]. Another common approach is based on Markov Chain. Items arerepresented in the various lists as nodes in a graph, with transition probabilities fromnode to node defined by the relative rankings of the items in the various lists. Theaggregate rankings are computed by a stationary distribution on the Markov Chain,by determining which nodes would be visited most often in a random walk on thegraph [48].

Taking as a starting point the traditional initial methods, many variations havebeen proposed and the rank aggregation approaches have remained in constantevolution. Although still conserving the unsupervised strategy, initial rank aggrega-tion approaches have evolved to more sophisticated algorithms [11, 25, 38]. At thesame time, however, even simple new approaches have been proposed with goodeffectiveness results [9]. Reciprocal Rank Fusion (RRF) [9], for example, is a simplemethod for combining the document rankings from multiple IR systems. RRF sortsthe documents according to a naïve scoring formula and the reasoning behind is thatwhile highly ranked documents are more important, the importance of lower-rankeddocuments does not vanish.

In this paper, we use an unsupervised score-based rank aggregation approachfor combining ranked lists defined by features of different modalities to improve

Page 9: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

geocoding result. It is inspired by works that successfully combined textual and visualevidences to improve multimedia retrieval [7, 56]. To the best of our knowledge, theuse of rank aggregation methods in video geocoding tasks has not been investigatedin the literature yet.

3 Proposed framework for video multimodal geocoding

This section presents the proposed framework for video multimodal geocoding.Section 3.1 formalizes the proposed geocoding process, while Section 3.2 presentsan architecture that has been implemented to validate the proposed framework.

3.1 Formalization

Let Cdev={v1, v2, . . . , v|Cdev|} be a video collection named development set, such thateach video vi ∈ Cdev has its location, (xvi , yvi), defined. Let Ctest={v1, v2, . . . , v|Ctest|}be a video collection named test set, such that the location (xv j, yvq) of vq ∈ Ctest isunknown.

The objective of the geocoding process is to assign a proper location to videos vq ∈Ctest given the known locations available in the development set, i.e., the developmentset is used as a knowledge base. Our solution to this problem exploits a multimodalvideo retrieval paradigm in which the location of a test video is determined accordingto its distance to videos in the development set.

Let D = {D1, D2, . . . , D|D|} be a set of video descriptors, such that each videodescriptor Dk ∈ D defines a distance function ρ : Ctest × Cdev → R, where R denotesreal numbers. Consider ρ(x, y) ≥ 0 for all (x, y) and ρ(x, y) = 0, if x = y. Thedistance ρ(vq, vi) among all videos vq ∈ Ctest, vi ∈ Cdev can be computed to obtaina |Ctest| × |Cdev| distance matrix A.

The proposed framework is multimodal if we assume that video descriptors usedin D defines distance functions that exploit different modalities (e.g., visual proper-ties, textual descriptions). Examples of different video descriptors are presented inSection 4.

Given a query video vq ∈ Ctest, we can compute a ranked list Rq in response to thequery by taking into account the distance matrix A. The ranked list Rq={v1, v2, . . . ,v|Cdev|} can be defined as a permutation of the collection Cdev, such that, if vi is rankedat lower positions than v j, i.e., vi is ranked before v j, then ρ(vq, vi) < ρ(vq, v j). Inthis way, videos of the development set are ranked according to their distance to thequery video vq. Note that the proposed formalism can be easily extended to deal withsimilarity scores as they can be defined in terms of distance functions.

We can also take each video descriptor Dk ∈ D, in order to obtain a set Rvq ={R1, R2, . . . , R|D|} of ranked lists for the query video vq.

The geocoding function G : R → R2 is used to define the location of a query video

vq, given its ranked lists Rvq :

(xvq , yvq) = G(Rvq) (1)

The implementation of G requires the use of an appropriate rank aggregationmethod to combine the ranked lists defined in Rvq , as well as a strategy to definea location given the final ranked list generated. The rank aggregation methods

Page 10: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

evaluated in our experiments are presented in Section 4.3. The assigned location,in turn, is defined, in our current implementation, in terms of the top-ranked videovi ∈ Cdev in the final ranked list.

3.2 Proposed architecture

The proposed architecture for video multimodal geocoding combines the videovisual and textual descriptions, defined in terms of descriptors. It is composed ofthree modules (Fig. 1):

1. Text-based geocoding: it is responsible for all text processing and uses GIRgeocoding techniques to predict a location based on the available textual meta-data;

2. Content-based geocoding: this module predicts a location based on the visualsimilarity of the test images/videos with regard to the knowledge database(available training dataset); and

3. Data fusion: this module combines the geocoding results generated by theprevious modules and computes the final result of the geocoding. The idea isto rely on both textual and visual descriptions whenever possible.

The modules of this architecture can be developed and evolved independentlyand later their individual results can be combined seamlessly by the fusion module.The idea behind is that the better the results provided by the individual module, thebetter should be the final combined results.

Fig. 1 Proposed architecture for video multimodal geocoding

Page 11: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

In this work, the modules of the proposed architecture were implemented basedon existing content-based and textual information retrieval methods. In fact, thearchitecture can handle as many different modalities as desired. The final result isa combination of the results from each modality, which is treated by a data fusionmodule that takes advantage of rank aggregation methods.

In the following section, we present how each module of the proposed architecturehas been implemented.

4 Architecture implementation

This section describes the components of the proposed architecture that are consid-ered in our experiments.

4.1 Text retrieval & GIR

For processing available textual information of videos, we propose the use of (1)Geographic Information Retrieval (GIR) techniques to recognize; and associatea location with digital objects based on their textual content and (2) InformationRetrieval (IR) classical matching functions to retrieve similar digital objects.

In the context of the Placing Task, we exploit the following metadata associatedwith videos: title, description, and keywords. In our current implementation, textprocessing is based on classical IR text matching using the vector space model andtraditional similarity functions [37]: cosine, bag-of-words (normalized documentsterms intersection), dice, okapi, and tf-idf-sum.

Let C be a collection with j distinct t terms of index t j. According to the vectorspace model, a document di is represented as a vector: di = (wi1, wi2, . . . , wit), wherewij is the weigh of the term t j in the document di. The term weighs for a document isoften calculated as the t f × idf value, where t f is the term frequency and idf is theinverse document frequency of the term in the collection. The idf value is calculatedas log(N/nt), where N is the number of documents in the collection and nt is thenumber of documents that have at least one occurrence of the term t j.

Several similarity functions have been proposed in the literature to compare twovectors. Examples of widely used functions include cosine, bow, okapi, dice, andtf-idf-sum. Those functions, used in our framework, are defined next.

cosine(d1, d2) =∑t

i=1 w1i × w2i√∑t

i=1 w21i × ∑t

i=1 w22i

, (2)

where wij is the document as previously defined. Equation (2) basically calculates thecosine between the vectors of each document. The closer the cosine is to 1, the moresimilar the documents are.

bow(d1, d2) = |{d1}⋂{d2}||d1 + d2| , (3)

Page 12: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

where {di} is the set of terms that occur in the document di. This is a simple measureof the percentage of common words between two documents.

dice(c, d) = 2 × |c ∩ d||c| + |d| (4)

The dice equation measures the similarity of a document d with regard to a queryc based on the number of common terms in relation to the total of terms in bothdocument and query.

okapi(d1, d2) =∑

t∈d1⋂

d2

3 + t fd2

0.5 + 1.5 × tamd2tammed

+ t fd2

× logN − df + 0.5

df + 0.5× t fd1,

(5)

where t f is the term frequency in the document, df is the term frequency in thecollection, N is the number of documents in the collection, tamdi is the size ofdocument i, and tammed is the average document size in the collection.

Given a query c with n terms ti, the tf-idf-sum is given by the sum of the tf-idfvalues for each query term in relation to the document d of the collection.

tf -idf -sum(c, d) =n∑

i

t f (ti, d) × idf (ti) (6)

4.2 Visual information retrieval

To encode video visual properties, we have used two main approaches. One is basedon video frames and do not consider transitions between them, which is called bag-of-scenes [43]. The other approach specifically encodes motion information by usinghistogram of motion patterns [1].

4.2.1 Bag-of-Scenes (BoS)

One of our approaches to encode video visual properties is based on a dictionaryof scenes [43]. The main motivation for using the bag-of-scene model is that videoframes can be considered as a set of pictures from places. Pictures may containimportant information regarding place location. Therefore, if we have a dictionaryof pictures from places (scenes), we can assign each video frame to the mostsimilar pictures of the dictionary. The final video representation will then be a placeactivation vector making it representative for the geocoding task.

An important advantage of the bag-of-scene model is that the dictionary iscomposed of visual words carrying more semantic information than the traditionaldictionaries based on local descriptions. In the dictionary of scenes, each visual wordis associated with pictures of a place [43]. A consequence of this property is that thebag-of-scenes feature space has one dimension for each semantic concept, making iteasier to detect the presence or absence of the concept in the video feature vector.

The process of creating a dictionary of scenes is similar to the one used to createa dictionary of local descriptions. The main difference is that the feature vectorsrepresent the whole images and not local patches. Practically speaking, instead of

Page 13: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

quantizing SIFT space we quantize the bag-of-words space, for example. Thus, eachvisual word is an image feature vector and not a local patch feature vector.

After creating the dictionary of scenes, the steps to represent a video are the sameemployed when a dictionary of local descriptions is used to represent an image. Inthe former, a video is a set of frames. In the latter, an image is a set of local patches.Therefore, we use the popular approaches, like hard and soft assignment [52], toassign a video frame to the scenes in the dictionary. Next, we apply pooling strategies,like average and max pooling [3], to summarize the assignments and create the videofeature vector, which is called the bag-of-scenes. Thus, comparisons between twobag-of-scenes are performed using the Euclidean distance function.

In [43], the bag-of-scene model is evaluated considering two possibilities to createthe dictionary. One uses the video frames of training set as scenes (BoF) and theother uses an external image dataset (BoS). The results for both representationsare very similar. We refer the reader to [43] for details concerning the evaluationof different parameters in the bag-of-scene model, like the dictionary size, thecoding, and pooling strategies, as well as the use of different low-level descriptorsto represent each video frame.

4.2.2 Histogram of Motion Patterns (HMP)

Besides encoding visual properties using a dictionary of scenes from places ofinterest, we also adopted a simple and fast algorithm to compare video sequencesdescribed in [1]. It consists of three main steps: (1) partial decoding; (2) featureextraction; and (3) signature generation.

For each frame of an input video, motion features are extracted from the videostream. For that, 2 × 2 ordinal matrices are obtained by ranking the intensity valuesof the four luminance (Y) blocks of each macroblock. This strategy is employed forcomputing both the spatial feature of the 4-blocks of a macroblock and the temporalfeature of corresponding blocks in three frames (previous, current, and next). Eachpossible combination of the ordinal measures is treated as an individual pattern of16-bits (i.e., 2-bits for each element of the ordinal matrices). Finally, the spatio-temporal pattern of all the macroblocks of the video sequence are accumulated toform a normalized histogram. For a detailed discussion of this procedure, refer to [1].

The comparison of histograms can be performed by any vectorial distance functionlike Manhattan (L1) or Euclidean (L2) distances. In this work, we compare videosequences by using the histogram intersection, which is defined as

d(HV1 ,HV2) =∑

i min(HiV1

,HiV2

)∑

i HiV1

,

where HV1 and HV2 are the histograms extracted from the videos V1 and V2,respectively. This function returns a real value ranging from 0, for situations in whichthose histograms are not similar at all, to 1 when they are identical.

4.3 Data fusion

The data fusion module aims at combining the similarity scores of different modal-ities, producing a more accurate one. Given a query video (whose location isunknown), it is compared with all those of the knowledge dataset (training set),

Page 14: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

considering different features associated with different modalities. Each feature, inturn, produces a different score. The goal of the data fusion module is to combinethe scores produced by features of different modalities in order to produce a moreeffective score. In this work, we evaluated three rank aggregation methods in thevideo geocoding task.

The first one is based on a multiplication approach of scores, initially proposedin [42] for multimodal image retrieval. The method was evaluated in several imageretrieval tasks related to the combination of image descriptors and combinationof visual and textual descriptors. That experimental evaluation considered fifteenvisual descriptors (considering shape, color, and texture descriptors) and six textualdescriptors with good results.

Let vq be a query video that is compared to another video vi in the dataset. Letsim(vq, vi) be a function defined in the interval [0, 1] that computes a similarity scorebetween the videos vq and vi , where 1 denotes a perfect similarity. Let S = {sim1,sim2, . . . , simm} be a set of m similarity functions defined for the different featuresconsidered. The new aggregated score sima is computed by multiplying individualfeature scores as follows:

sima(vq, vi) =m

√∏mk=1(simk(vq, vi) + 1)

m(7)

By multiplying the different similarity scores, high scores obtained by one feature arepropagated to the others, leading to high aggregated values. The reasoning behindthe multiplication approach is inspired by the Naïve Bayes classifiers [40, 42, 55]. Ina general way, Naïve Bayes classifiers work based on the probability of an instancebe of a class, given a set of features and assuming conditional independence amongfeatures. In a simplified manner, that classifier assumes that the presence of aparticular feature of a class is unrelated to the presence (or absence) of any otherfeature. Under the independence assumption, the probabilities of each feature be ofa given class are multiplied. In this case, as an analogy, the proposed multiplicationapproach can be seen as the computation of the probability of videos vq and vi besimilar, considering independent features.

We also evaluated the traditional Borda [54] approach and the recently proposedReciprocal Rank Fusion (RRF) [9]. Both methods consider the rank information,i.e., the positions of images in ranked lists produced by different descriptors.

Let D = {D1, D2, . . . , Dm} be a set of descriptors and let vq be a query video.For each descriptor Dj ∈ D we can compute a different ranked list τq,D j for thevideo query vq. A given video vi is ranked at different positions (defined by τq,D j(i))according to each descriptor Dj ∈ D. The objective is to use these different rank datato compute a new distance between video vq and vi.

The Borda [54] method considers directly the rank information for computing thenew distance FBorda(q, i) between video vq and vi. Specifically, the distance is scoredby the number of videos not ranked higher than it in the different ranked lists [24].The new distance can be computed as follows:

FBorda(q, i) =m∑

j=0

τq,D j(i). (8)

Page 15: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

The Reciprocal Rank Fusion also uses the rank information for computing asimilarity score between video vq and vi. The scores are computed according to anaïve scoring formula:

FReciprocal(q, i) =m∑

j=0

1k + τq,D j(i)

, (9)

where k is a constant. In our implementation we used k = 60, as suggested by theoriginal paper.

5 Experiments & results

The aim of conducted experiments is to evaluate our approach based on the pro-posed framework for multimodal geocoding, instantiating this experiment to videogeocoding in the Placing Task at MediaEval 2012. Next section describes this task, aswell as the available datasets used in the experiments. In the following subsections,we present our strategies to address the proposed task and discuss obtained results.

5.1 MediaEval 2012

This section introduces the Placing Task and its data in the MediaEval 2012 initiative.

5.1.1 Datasets

The datasets provided by the MediaEval 2012 organizers for the Placing Task arecomposed of a development and a test set [46]. The development set contains 15,563videos and 3,185,258 CC-licensed images from Flickr.4 All of them are accompaniedby their latitude and longitude information, as well as title, tags, and descriptionsprovided by the owner of that resource, comments of her/his friends, users’ contactlists and home location, and other uploaded resources on Flickr. Videos are providedwith their extracted keyframes and corresponding pre-extracted low-level visualfeatures, and metadata. More than 3 million images available are uniformly sampledfrom all parts of the world. Also, pre-extracted low-level visual features of eachimage are available. The test set comprises 4,182 videos, their keyframes withextracted visual features, and related metadata (without geographic location).

The keyframes were extracted by the organizers at each 4 second intervals fromvideos and saved as individual JPEG-format images. The following visual featuredescriptors for keyframes and photos were extracted and provided: Color andEdge Directivity Descriptor (CEDD), Gabor Texture, Fuzzy Color and TextureHistogram (FCTH), Color Histogram, Scalable Color, Auto Color Correlogram,Tamura Texture, Edge Histogram, and Color Layout.

Participants in the Placing Task at MediaEval 2012 were allowed to use im-age/video metadata, audio and visual features, as well as external resources, depend-ing on the run submitted. At least one run should use only audio/visual features.

4http://www.flickr.com/ (As of Apr. 2012).

Page 16: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

The experiment reported here validate our architecture composed of existingtools. Our team used only resource provided by the Placing Task 2012 organizersand we did not make use of any external resource like gazetteers, Wikipedia,or additional crawling. Thus it is fair to compare these results with other teams’equivalent results, which we will highlight in the last subsection.

In fact the image database was only used by BoS5000CEDD to sample images for its

dictionary of bag-of-scene. Besides that, all the other methods relied only on the15,563 videos (development set) as its geo-profile database.

5.1.2 Placing task evaluation criteria

According to the evaluation criterion defined in the Placing Task 2012, theeffectiveness of a method is based on the great circle distance (Harvesine) ofthe estimated geographic coordinate of a video to its corresponding ground truthlocation, in a series of widening circles of radius (in km): 1, 10, 100, 1000, 10000.Thus, an estimated location is counted as correct if it is within a particular circle size,which can be seen as quality or precision level if it lies within a given circle radius.

The results are often reported using a table with an accumulative count ofcorrectly assigned videos at each precision level. This table shows a given methodbehavior at different precision level, for example, in which radius level an evaluatedmethod is able to perform with satisfactory performance. However, when comparingmethods, the participants of the Placing Task usually prefer to emphasize the resultsof smaller circle radius. In that case, we are more interested in determining as moreaccurate as possible the location of a video. More details about Placing Task atMediaEval 2012 are given at the working notes of the organizers [46].

5.2 Weighted Average Score (WAS): new evaluation criterion

Two evaluation criteria are considered in our experiments. The first one(Section 5.1.2) is the commonest used method to assess the effectiveness of the resultssubmitted to the Placing Task. The second evaluation measure (Section 5.2) is one ofthe contributions of this work and was not used to evaluate video geocoding methodsbefore.

Besides the way each approach is evaluated by MediaEval 2012, in this work, wepropose a new scoring method whose goal is to assess the overall performance of themethod based on those geographic distances. The Weighted Average Score (WAS)gives higher weights to the predictions with higher precision.

In other words, instead of a table with accumulative count, we propose a scorebetween 0 and 1 to indicate an overall precision level for a geocoding method beingevaluated. This proposed score follows the principles of utility theory [13]. Accordingto utility theory, there is a utility function (a user’s preference function) that assignsa utility value (the gained value from a user’s perspective) for each item. Thesevalues vary from item to item. The item can be a book, a product, or a video, asin our case. In general, we assume the utility of a relevant video decreases withits ranking order. More formally, given a utility function U(x), and two ranks x1,x2, with x1 < x2, according to this assumption, we expect the following condition tohold: U(x1) > U(x2). There are many possible functions that can be used to modelthis utility function satisfying the order-preserving condition given above.

Page 17: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Table 1 WAS(a) vs.accumulative count

WAS (a): 0.625888

Precision Average Average Accum.levels (km) distance (km) score(i) count

<=1 0.50 0.959066 30<=10 5.00 0.819113 50<=100 50.00 0.603063 65<=1000 500.00 0.372403 80<=10000 5000.00 0.140127 100

Let d(i) be the geographic distance between the predicted location and the groundtruth location of the video i. The proposed score for the result of a given test query iis defined as follows:

score(i) = 1 − log(1 + d(i))log(1 + Rmax)

, (10)

where Rmax is the maximum distance between any two points on the Earth’s surface.The length of the half of Earth’s circumference at the Equator is 20,027.5 km, thusRmax = 20, 027.5. The log function is used to reduce the impact of different distancesobserved in the interval d(i) ∈ [0, Rmax]. Observe that score(i) ranges from 0 to 1,where 1 indicates a perfect estimation (d(i) = 0); and 0, an incorrect prediction(d(i) = Rmax). The other score values give a sense of how good was the locationestimation with regard to the ground truth.

Let D be a test dataset with n videos whose locations need to be predicted. Theoverall score for the predictions of a method m is defined as:

W AS(m) =∑n

i=1 score(i)n

(11)

This scoring system helps to tackle some issues that might arise when we look atan accumulative count table for each of the 1, 10, 100, 1000, and 10000 km class (thedistance from the estimated geographic coordinate of a video to its ground truth).Consider the examples in Tables 1, 2, and 3 with the hypothetical results for threemethods (a, b, and c) experimented on a test set with 100 queries.

The first and last columns are usually reported when the results are shown inPlacing Task style. However, to help the understanding, we added two more columnsin those tables. The second column gives us the average distance for those test queriesin the corresponding precision level. The third column (Average score(i)) presentsthe average score(i) for the results within that radius/distance.

Table 2 WAS(b) vs.accumulative count

WAS (b): 0.640495

Precision Average Average Accum.levels (km) distance (km) score(i) count

<=1 0.50 0.959066 25<=10 5.00 0.819113 60<=100 50.00 0.603063 65<=1000 500.00 0.372403 80<=10000 5000.00 0.140127 100

Page 18: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Table 3 WAS(c) vs.accumulative count

WAS (c): 0.659540

Precision Average Average Accum.levels (km) distance (km) score(i) count

<=1 0.50 0.959066 25<=10 2.50 0.873527 60<=100 50.00 0.603063 65<=1000 500.00 0.372403 80<=10000 5000.00 0.140127 100

Tables 1 and 2 show two methods whose results differ in terms of the number ofcorrectly geocoded test queries within 1 km and 10 km radii. If one cares only aboutresults in 1 km, surely one would consider the method a as the best one because of itshigher count for that precision level. Nonetheless, for 10 km, the method b is betterfor the same reason, while for 100 km radius they are both tied with the same amountof geocoded items in that precision level. In this case, the WAS will indicate that theresults from method b are better due to: (i) the count difference between them in 1km is smaller than the 10 km radius, and (ii) the disagreement in term of score(i) foritems in 1 km or 10 km is small.

Tables 2 and 3, show an example in which the accumulative count in differentprecision levels is exactly the same for both methods b and c. The difference willshow up when we look into the actual distances between the points that the methodpredicted and their ground truth. A closer look at the tables, we can notice thatat 10 km radius, the average distance for items in that level is 5 km for method band 2.5 km for method c. Precisely, this difference will make WAS(c) higher thanWAS(b). As conclusion, not only does WAS consider these multiple levels of theresults, but also the proposed effectiveness measure takes into account every singleresult of the whole test set to indicate and summarize the level of precision of anevaluated method.

In this work, we present results based on both WAS and the accumulative countat the different precision levels adopted by the Placing Task at MediaEval 2012.However, to compare distinct tested methods proposed in this paper, analyses basedon WAS will be preferred.

5.3 Experimental setup

Our method to geocode a test query video is composed of three steps: text processing,visual content processing, and data fusion. We used 15,563 videos from the develop-ment set (training set) released by organizers of the Placing Task as geo-profiles towhich each test video is compared.

In order to assess our proposed framework, we first evaluate our results using onlyone modality of a video content (textual or visual). In this phase, different (textualand visual) descriptors are used, so the descriptor yielding better results can be usedin the information fusion module.

The visual processing module encodes visual content properties of each providedvideo. Next, the distances among each video in the test set and all videos in thetraining set are computed. Finally, for each test video, a ranked list of training videosis produced. The textual processing module works similarly, except for the feature

Page 19: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

extraction step, which is based on video textual metadata. In summary, each moduleproduces ranked lists of videos that are then processed by the information fusionmodule.

In our geocoding scheme, we consider that the top-ranked training video is theone that should transfer its known lat/long to the query video.

We also report the results for the development set, that is, we perform experimentsconsidering videos of the development set as query videos. In this case, given that thequery video always is the best match to itself (thus it will be the first in this list), weuse the second video of available ranked lists to define the final location.

Regarding the implementation, for textual processing, we set up a Solr5 server,then we used Python Solr API6 to index (with stemmer and tokenizer) and generatethe corresponding term vectors. Later, they were accessed using a Java program,applying the corresponding version of Apache Lucene Core,7 to calculate thetextual similarities described in Section 4.1. The other modules and algorithms wasimplemented using C, shell scripts, and python.

5.4 Results

In this section, we presents the results of our experiments. Section 5.4.1 presents theresults when using a single modality to describe the videos in both development andtest set. It will provide insights about the most suitable descriptors for the geocodingtask. Section 5.4.2 performs a correlation analysis on the results for the methodsused, showing their potential for combination, baring in mind low correlated goodresults are more likely to produce good fused result. Finally, Section 5.4.3 presentsthe results considering the combination of the distinct methods.

5.4.1 Single modality results

We have performed experiments in the development and test sets using each modal-ity (text and visual) in isolation from the other. The objective of this experiment is todetermine which descriptor/approach is appropriate for video geocoding.

For textual data, we applied similarity functions described in Section 4.1 overmetadata associated with available videos: title, description, and keywords. It isworth noting that a deeper analysis of the bag-of-scenes (BoS) and the histogramof motion patterns (HMP) approaches are presented in [43] and in [33], respectively.In our experiments, we have used their best parameters.

For the bag-of-scene method, we performed experiments with dictionaries of 50,500, 5000 scenes. Additionally, we also considered two different inputs for creatingscene dictionaries: the Flickr photo collection and the frames of videos of thedevelopment set. We named BoS50

CEDD, BoS500CEDD, and BoS5000

CEDD for bag-of-sceneswith dictionaries based on Flickr photos, and BoF50

CEDD, BoF500CEDD, and BoF5000

CEDD forthose with dictionaries based on frames.

Figures 2 and 3 show stacked bars associated with evaluated methods consideringthe development and test sets, respectively. Each rectangle in a stack represents one

5http://lucene.apache.org/solr/ (As of Apr. 2012).6http://github.com/tow/sunburnt (As of Apr. 2012).7http://lucene.apache.org/core/ (As of Apr. 2012).

Page 20: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

radius in the set of widening circles (1, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000,and 10000 km) traditionally used by the organizers to measure the performance ofvideo geocoding methods in the Placing Task. In those figures, the textual descriptorsresults are colored in red, while visual results in green. Darker colors mean smallerradius, therefore, the larger the darker rectangles are, the more accurate the evalu-ated method is. For example, the first rectangle in the bottom of the stack refers tothe 1 km radius. The bars related to predictions that are more than 10,000 km fromthe ground truth location are not shown.

In the development set (Fig. 2), we can clearly see a better performance fortext-based approaches in relation to visual-based approaches. As we can observe,for those methods, more video locations are predicted correctly in more precise(lower) radii. The Okapi distance function considering the title, description, andkeywords associated with a video (OKPa), or just using keywords (OKPk) yieldsthe best results for 1 km precision, followed by Dice only using keywords (DICEk).Considering only the visual-based approaches, HMP is slightly better than BoF5000

CEDD.In the test set (Fig. 3), OKPa is again the best method. For visual-based ap-

proaches, there are very small differences among the methods, but HMP is stillslightly better.

As expected, the results for the test set are worse than those observed for thedevelopment set. Most of the text-based approaches are able to geocode into the1 km widening circle for about 50 % of the videos of the development set. For the testset, however, none of them are able to predict very accurately the correct locations.Less than 10 % of the videos are geocoded within the 1 km radius (first rectangle ofeach stack).

Fig. 2 Stacked bars showing the isolated performances of each method in the development set

Page 21: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 3 Stacked bars showing the isolated performances of each method in the test set

Figure 4a and b summarize the performance of evaluated methods, now usingour proposed score W AS(m) (Eq. (11)) for development and test sets, respectively.Again, those figures represent the results for textual and visual descriptors with redand green bars respectively. The results are arranged in descending order. As we canobserve, the conclusions are similar to those drawn for Figs. 2 and 3: Okapi (OKPaand OKPk) is the best text-based method followed by Dice (DICEk and DICEa);while HMP is the best visual-based method followed by BoF5000

CEDD and BoS5000CEDD.

5.4.2 Correlation analysis

This section analyzes the correlation among different features and modalities. Ourobjective is to assess how those features co-vary with each other. In many situations,the correlation analysis provides additional cues that are very useful to selectmethods to be combined [44]. We have performed a correlation analysis to evaluatethe most promising combinations for the text and visual-based methods. Figure 5shows the correlation graph (corrgrams) for the development set results. In this case,we consider the distances between a predicted point and its ground truth.

This kind of plot, aka correlogram, is presented in [15] and shows the correlationvalues for each pair of methods as a squared matrix. The darker the color, the higherthe correlation value. In the lower triangle, the correlation value is denoted by theintensity of the cell color. In the upper triangle of this matrix, the correlation value isgiven by the size of the painted area in the circles, as well as their colors. The diagonalof this matrix holds the name of the methods corresponding to each row and column.

Page 22: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

(a) development set

(b) test set

Fig. 4 W AS(m) measure (ordered by score) for isolated methods

Page 23: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 5 Correlation values for each pair of methods evaluated in the development set

The correlogram indicates higher correlation among the different textual descrip-tors, because of the darker color in their cells and the bigger size of the painted area incorresponding circles. The same behavior can be observed for the correlation scoresamong the different visual descriptors. However, the correlation between textual andvisual descriptors is very low (the lightest colors and smallest painted area in theircorresponding circle). As we stated before, the best combinations occur when theinputs are independent and non-correlated [10]. Therefore, textual and visual-basedmethods are very suitable for the combination. Next section discusses the resultsof multimodal combinations and the reasoning behind the combination choices wemade based on the descriptors individual results and their correlation.

5.4.3 Fusion results

The choice of the best textual and visual methods was performed based on thecorrelation analysis and the effectiveness results of each descriptor. Promising resultshave been reported using similar approaches [42].

Since our work here is focused on data fusion, we will detail our submissions [32]to Placing Task at MediaEval 2012 considering combined results for:

– only textual (Ftext) combines results from textual descriptors Okapi and Dice,considering three implementations which yielded the best results (Fig. 4a) and

Page 24: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

that have low correlation: Okapi applied to three textual metadata fields (title,description, keywords) associated with a video (OKPa); Okapi applied to thekeywords field (OKPk), as well as Dice applied to the keywords (DICEk). Thesetext-based methods have the best scores and are low correlated.

– only visual (Fvisual) combines results from the three best visual descriptors asshown in Fig. 4a: HMP, BoS5000

CEDD, and BoF5000CEDD.

– text & visual (FTxVis) combines two textual and two visual features: OKPa,OKPk, HMP, and BoS5000

CEDD. For textual descriptors, the highest score are thetwo versions of Okapi and Dice. Looking at the correlation of OKPa (bestversion) with the other best text descriptors versions, DICEk and OKPk weretied with the lowest correlation (Fig. 5). Thus, OKPk was paired with OKPa dueto its higher score. Using the same reasoning for the visual descriptors, the bestones are HMP, BoF5000

CEDD, and BoS5000CEDD with a similar correlation between HMP

and the last two. We chose BoS5000CEDD because it is based on Flickr photos, which

might be a better match to complement HMP approach (based on videos).

In the next subsection, we evaluate the three rank aggregation methods consider-ing features defined by Ftext, Fvisual, and FTxVis.

Evaluation of rank aggregation methods In order to choose among the threeimplemented fusion methods, we used our proposed scoring system WAS to analyzethe overall performance. Each of the fusion method was applied to combined text(Ftext), visual (Fvisual), and text and visual descriptors (FTxVis).

In Fig. 6, the WAS for each fusion method and their respective error intervals areshown. In the graphic, the result for each fusion method is suffixed with “.M”, “.B”,

Fig. 6 Results of rank aggregation methods evaluated using WAS(m) and their standard error (SE)interval

Page 25: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

or “.R” to indicate that it was generated, respectively, by Multiplication, Borda, orReciprocal Rank Fusion (RRF) method detailed in Section 4.3.

As we can observe, the multiplication method yields statistically significant betterresult with 95 % confidence when compared to the other fusion method (no inter-section in their confidence interval). Due to these results, from now on we considerthe use of the multiplication approach, when we refer to the rank aggregation step ofour geocoding framework.

Combined versus single modality Figures 7 and 8 show the stacked histogramscomparing the methods for each widening circle used in the Placing Task evaluation.Those figures show the best methods of each modality (red bars for textual and greenones for visual) used in the combination experiments, as well as the results of theircombination (blue bars). Both figures show that fusion methods yield better resultsthan the use of single descriptors (either visual or textual).

In order to take a closer look at the text results, Fig. 9a and b compare the resultsof the fusion method that combines textual descriptors with the results of a singlefeature using our proposed W AS(m) score, considering both the development andthe test sets. As it can be observed, the fusion of textual descriptors (Ftext) is better(higher score) than the use of features in isolation (OKPa, OKPk, and DICEk), inboth development and test sets, with a statistical significance of 95 % confidencelimit.

Fig. 7 Stacked histograms showing the performances, in the development set, of the best methodsfor each modality and their fusion

Page 26: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 8 Stacked histograms showing the performances, in the test set, of the best methods for eachmodality and their fusion in the test set

Figure 10a shows the results for the combination of visual features (Fvisual). Itshows that the combination of HMP, BoF5000

CEDD, and BoS5000CEDD in the development

set improved significantly (95 % confidence limits) over the best visual individualresult (HMP) in the development set (0.3323 against 0.2733). However for the testset, no statistical difference is identified, as shown in Fig. 10b (0.2845 over 0.2826).

The fusion taking account visual and textual features (FTxVis) also yields betterresults than the use of a single modality (OKPa). Additionally, the improvement of

(a) (b)

Fig. 9 WAS(m) general score and standard error (SE) interval: fusion and individual textualdescriptors results in the (a) development set and (b) test set

Page 27: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

(a) (b)

Fig. 10 WAS(m) general score and standard error interval: fusion and individual visual descriptorsresults in the (a) development set and (b) test set

FTxVis over Ftext is more visible for the test set (0.4445 over 0.4292) than for thedevelopment set (0.7403 over 0.7388), although, in both data set, Fvisual and FTxVisresults are not significantly different (statistically speaking), as shown in Fig. 9a and b.One of the reasons might be due to the use of a balanced strategy for combiningtextual and visual features.

Consider the case in which the textual approach provides a perfect estimationand the visual method performs an incorrect prediction. Once the textual and visualfeatures have a same contribution in the final result, their combination may notimprove the overall performance of individual strategies. Therefore, there is roomfor improvements in the fusion module, by incorporating new strategies for assigningdifferent relevance weights to each method being combined, for example. We willaddress this research venue in future work.

In summary, we can see better results when combining methods of differentmodalities or descriptors. The fusion of the three text methods (Ftext), as well as thefusion of visual and textual descriptors (FTxVis) overcome the best single descriptormethod (OKPa), as shown in Figs. 9a and b.

Incorporating user-related data This section describes experiments to evaluate theimpact of using user-related data. We analyze different geocoding strategies basedon combining our best selected visual and textual features with ranked lists definedin terms of (U) just user names found in the videos’ metadata, (UH) user namesand the videos’ owner declared home location, and (UHC) the concatenation of usernames, home location, and comments related to each video.

We treat the user-related data as another textual information. We used the textualdescriptors we described in Section 4.1 to index and process them. We also compareour best single “conventional” textual results (DICEa, Dicek, OKPa, and OKPk)with the different strategies to incorporate user information (U, UH, and UHC).

Figure 11 shows the results for the training set, in term of WAS scores andconfidence intervals. Notice that features based on user information yields worseresults when compared to the “conventional” textual features. The best resultsfor user-related features are observed for DiceUH (0.6819), OKPuh (0.6825), andTfIdUH (0.6692), that is, the geocoding strategies that consider user names andowner location (UH).

Page 28: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 11 Effectiveness results in the development set for ranked lists defined by conventional textual(using OKPa, OKPk, DICEa, and DICEk similarity functions) and user-related properties (usingOKP, Dice, and TfIdf)

We also conducted correlation analysis in order to decide which of user-relateddata should be used in the rank aggregation module. Figure 12 shows that user-related features are low (light color) correlated with both conventional text andvisual descriptors (HMP and BoS5000

CEDD), which indicates that better results can beproduced when combined.

Fig. 12 Correlogram in the development set for conventional text (OKPa, OKPk, DICEa, andDICEk), user-related (TfIdfUH, OKPuh, and DiceUH) features, and two best visual features(BoS5000

CEDD and HMP)

Page 29: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 13 WAS(m) generalscore and standard errorinterval: fusion results forthree different geocodingstrategies (FTx, FTxVis, andTxVisUL) in the test set

Figure 13 shows the geocoding results of three different strategies: FTx, thatconsiders the three best textual features; FTxVis, that uses the best two visual andtextual features; and finally TxVisUL, which combines the two best visual, textualand user-related features. These results consider the use of the multiplicative rankaggregation method. As expected, the user-related features improves the geocodingresults. It is significantly better than the other strategies with 95 % of confidence.

These results support our hypothesis that fusing results of different modality canimprove the final geocoding results.

5.4.4 Comparisons with other video geocoding initiatives

This section compares the proposed geocoding method with the ones provided byother participants of Placing Task 2012. We have not used WAS here as evaluationmeasure, since distance scores for each query video are not available for thegeocoding methods defined by other participants.

First, we compare the submissions which only consider the use of visual content.Figure 14) presents the results. As it can be observed, our results are the best at1 km precision (15.93 %). Our solution refers to the combination of our best visualdescriptors results (Fvisual). Note that other teams only achieve this level of accuracyfor 1,000 km radius. In fact, even at this radius, our method (UNICAMP) is stillahead, with 25.47 % of test videos being correctly geocoded.

Figure 15 presents the best-performance (external information allowed) of allparticipants, at 1 km radius. Our results (UNICAMP) consider the a geocodingstrategy based on combining visual and textual descriptions, but without using anyuser-related data.

Although UNICAMP achieved the third place in the overall best performance,obtained results show how promising is our proposed multimodal framework togeocode videos. We have even outperformed some other submissions that usedextra information. Our textual and the visual processing modules use straightforwardinformation retrieval techniques (KNN searches) to query the test videos against thedevelopment set, and then to assign location. The other participants of Placing Taskat MediaEval 2012, on the other hand, implemented ad hoc methods to define the

Page 30: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 14 Only-visual submission: correctly geocoded test videos for different precision levels

location for a video (Section 2.1). The CEALIST team (first place), for example,exploits additional information, not provided by organizers, to learn tagging patternsof users to improve its results.

Figure 16 shows the same results, but now eliminating those submissions that usedadditional information. In this figure, we also consider the implementation of ourframework with two different strategies based on the incorporation of user-relateddata. UNICAMP results are the same reported in Fig. 15 and does not include user-related data. UNICAMP-UL, in turn, stands for the strategy that considers user-

Fig. 15 Overall best submission to the Placing Task 2012, considering correctly geocoded test videoswithin different precision radius

Page 31: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Fig. 16 Effectiveness performance for different precision levels. Results for methods that do not useany additional resources, nor gazetteers

related data. As it can be observed, UNICAMP-UL yields comparable results to thebest submissions (IRISA1) that does not use additional/external resources.

6 Conclusions

This paper proposes a flexible framework to perform multimodal geocoding bycombining ranked lists defined in terms of different modalities. In our approach,textual and visual descriptors are combined by a rank aggregation approach. To thebest of our knowledge, this is the first attempt to address this problem using this kindof solution.

An architecture is proposed to implement the geocoding framework. This ar-chitecture was validated in the context of the Placing Task of the MediaEval 2012initiative. Conducted experiments demonstrate that the use of the proposed fusionapproach yields better results when compared with those based on a single clue(either textual or visual descriptor). Obtained results also demonstrate that despiteof the simple textual description methods used, the performance of the proposedmethod is comparable to the best submissions of the Placing Task. The potentialof this framework relies on the fact that each module can be improved separately,opening new opportunities for further investigation related to the development anduse of novel rank aggregation methods.

Another contribution of this paper is the proposal of a new score measure to assessthe performance of geocoding methods. Instead of counting videos correctly assignedwithin various predefined precision radii as usually Placing Task participants do, eachmethod is evaluated in terms of a score between 0 (poor) and 1 (perfect), based onthe geographical distances among predictions and ground truth locations.

Page 32: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Future work includes the investigation of other strategies for combining differentmodalities and exploring the strength of each modality for geocoding videos. Somepromising alternatives rely on the use of rank aggregation methods based on re-ranking approaches [41, 42]. In addition, we want to evaluate the use of supervisedmethods for feature fusion [2, 12]. Finally, we also plan to consider other informationsources, such as user profiles, Geonames, and Wikipedia, to filter out noisy data fromranked lists.

Acknowledgements The authors thank CAPES (Brazilian Federal Agency for Support and Evalu-ation of Graduate Education), FAPESP (São Paulo Research Foundation) grants 2011/11171-5 and2009/10554-8, and CNPq (National Council for Scientific and Technological Development) grants306580/2012-8 and 484254/2012-0, as well as CPqD Foundation (Telecommunications Research andDevelopment Center) for their support. Additionally we would like to thank for the suggestions andquestions arisen by the anonymous reviewers that gave us the chance to improve our paper.

References

1. Almeida J, Leite NJ, Torres R da S (2011) Comparison of video sequences with histograms ofmotion patterns. In: International conference on image processing, pp 3673–3676

2. Andrade FSP, Almeida J, Pedrini H, Torres R da S (2012) Fusion of local and global descriptorsfor content-based image and video retrieval. In: Iberoamerican congress on pattern recognition(CIARP’S), pp 845–853

3. Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features forrecognition. In: Conference on computer vision and pattern recognition, pp 2559–2566.doi:10.1109/CVPR.2010.5539963

4. Candeias R, Martins B (2011) Associating relevant photos to georeferenced textual documentsthrough rank aggregation. In: Terra Cognita 2011 workshop. In conjunction with 10th interna-tional semantic web conference

5. Choi J, Ekambaram VN, Friedland G, Ramchandran K (2012) The 2012 ICSI/Berkeley videolocation estimation system. In: Larson MA, Schmiedeke S, Kelm P, Rae A, Mezaris V,Piatrik T, Soleymani M, Metze F, Jones GJF (eds) Working notes proceedings of the MediaEval2012 workshop, Santa Croce in Fossabanda, Pisa, Italy, 4–5 October, 2012, CEUR WorkshopProceedings, vol. 927. CEUR-WS.org

6. Choi J, Lei H, Friedland G (2011) The 2011 ICSI video location estimation system. In: Workingnotes proceedings of the MediaEval workshop, vol 807

7. Clinchant S, Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual informationin multimedia retrieval. In: International conference on multimedia retrieval, pp 44:1–44:8

8. Coppersmith D, Fleischer LK, Rurda A (2010) Ordering by weighted number of wins gives agood ranking for weighted tournaments. ACM Trans Algorithm 6(3):55:1–55:13

9. Cormack GV, Clarke CLA, Buettcher S (2009) Reciprocal rank fusion outperforms condorcetand individual rank learning methods. In: ACM SIGIR conference on research and developmentin information retrieval, pp 758–759

10. Croft WB (2002) Combining approaches to information retrieval. In: Croft WB, Croft WB (eds)Advances in information retrieval, the information retrieval, vol 7. Springer US, pp 1–36

11. Ding D, Zhang B (2007) Probabilistic model supported rank aggregation for the seman-tic concept detection in video. In: Proceedings of the 6th ACM international Confer-ence on Image and Video Retrieval, CIVR ’07, pp 587–594. doi:10.1145/1282280.1282364.http://doi.acm.org/10.1145/1282280.1282364

12. Faria FA, Veloso A, de Almeida HM, Valle E, Torres R da S, Gonçalves MA, Jr WM (2010)Learning to rank for content-based image retrieval. In: International conference on multimediainformation retrieval, pp 285–294

13. Fishburn PC (1988) Nonlinear preference and utility theory/Peter C. Fishburn. Johns HopkinsUniversity Press, Baltimore

Page 33: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

14. Fox EA, Shaw JA (1994) Combination of multiple searches. In: Text REtrieval Conference(TREC-2), vol 500–215, pp 243–252

15. Friendly M (2002) Corrgrams: exploratory displays for correlation matrices. Am Stat 56(4):316–32416. Hauff C, Houben GJ (2011) WISTUD at MediaEval 2011: placing task. In: Working notes

proceedings of the MediaEval workshop, vol 80717. Hays J, Efros AA (2008) im2gps: estimating geographic information from a single image.

In: Conference on computer vision and pattern recognition18. Jones CB, Purves RS (2008) Geographical information retrieval. Int J Geogr Inf Sci 22(3):219–

22819. Kalantidis Y, Tolias G, Avrithis Y, Phinikettos M, Spyrou E, Mylonas P, Kollias S (2011) Viral:

visual image retrieval and localization. Multimed Tools Appl 51:555–59220. Kelm P, Schmiedeke S, Sikora T (2011) A hierarchical, multi-modal approach for placing videos

on the map using millions of flickr photographs. In: Workshop on Social and BehaviouralNetworked Media Access, SBNMA ’11, pp 15–20

21. Kelm P, Schmiedeke S, Sikora T (2011) Multi-modal, multi-resource methods for placing Flickrvideos on the map. In: International conference on multimedia retrieval

22. Kelm P, Schmiedeke S, Sikora T (2012) How spatial segmentation improves the multimodal geo-tagging. In: Larson MA, Schmiedeke S, Kelm P, Rae A, Mezaris V, Piatrik T, Soleymani M,Metze F, Jones GJF (eds) Working notes proceedings of the MediaEval 2012 workshop, SantaCroce in Fossabanda, Pisa, Italy, 4–5 October, 2012, CEUR Workshop Proceedings, vol. 927.CEUR-WS.org

23. Kelm P, Schmiedeke S, Sikora T (2012) Multimodal geo-tagging in social media websitesusing hierarchical spatial segmentation. In: LBSN ’12, pp 32–39. ACM, New York, NY,doi:10.1145/2442796.2442805. http://doi.acm.org/10.1145/2442796.2442805

24. Khudyak KA, Kurland O (2011) Cluster-based fusion of retrieved lists. In: Proceedings ofthe 34th international ACM SIGIR conference on research and development in informationretrieval, SIGIR ’11, pp 893–902

25. Klementiev A, Roth D, Small K (2008) A framework for unsupervised rank aggregation. In:Proc. of the ACM SIGIR conference (SIGIR) workshop on learning to rank for informationretrieval, pp 32–39. http://cogcomp.cs.illinois.edu/papers/KlementievRoSm08a.pdf

26. Kludas J, Bruno E, Marchand-Maillet S (2008) Information fusion in multimedia informationretrieval. In: Boujemaa N, Detyniecki M, Nürnberger A (eds) Adaptive multimedial retrieval:retrieval, user, and semantics. Springer, New York, pp 147–159

27. Kokar MM, Tomasik JA, Weyman J (2004) Formalizing classes of information fusion systems.Inform Fusion 5(3):189–202

28. Laere OV, Schockaert S, Dhoedt B (2011) Ghent university at the 2011 placing task. In: Workingnotes proceedings of the MediaEval workshop, vol 807

29. Laere OV, Schockaert S, Quinn JA, Langbein FC, Dhoedt B (2012) Ghent and cardiff uni-versity at the 2012 placing task. In: Larson MA, Schmiedeke S, Kelm P, Rae A, Mezaris V,Piatrik T, Soleymani M, Metze F, Jones GJF (eds) Working notes proceedings of the MediaEval2012 workshop, Santa Croce in Fossabanda, Pisa, Italy, 4–5 October, 2012, CEUR WorkshopProceedings, vol. 927. CEUR-WS.org

30. Larson M, Soleymani M, Serdyukov P, Rudinac S, Wartena C, Murdock V, Friedland G,Ordelman R, Jones GJF (2011) Automatic tagging and geotagging in video collections andcommunities. In: International conference on multimedia retrieval, pp 51:1–51:8

31. Larson RR (2009) Geographic information retrieval and digital libraries. In: European confer-ence on research and advanced technology for digital libraries, vol 5714, pp 461–464

32. Li LT, Almeida J, Pedronette DCG, Penatti OAB, Torres R da S (2012) A multimodal approachfor video geocoding. In: Larson MA, Schmiedeke S, Kelm P, Rae A, Mezaris V, Piatrik T, Soley-mani M, Metze F, Jones GJF (eds) Working notes proceedings of the MediaEval 2012 workshop,Santa Croce in Fossabanda, Pisa, Italy, 4–5 October, 2012, CEUR Workshop Proceedings, vol.927. CEUR-WS.org

33. Li LT, Almeida J, Torres R da S (2011) RECOD working notes for placing task MediaEval 2011.In: Working notes proceedings of the MediaEval workshop, vol 807

34. Li LT, Pedronette DCG, Almeida J, Penatti OAB, Calumby RT, Torres R da S (2012) Multi-media multimodal geocoding. In: ACM SIGSPATIAL international conference on advances ingeographic information systems, pp 474–477

Page 34: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

35. Li X, Hauff C, Larson M, Hanjalic A (2012) Preliminary exploration of the use of geograph-ical information for content-based geo-tagging of social video. In: Larson MA, Schmiedeke S,Kelm P, Rae A, Mezaris V, Piatrik T, Soleymani M, Metze F, Jones GJF (eds) Working notesproceedings of the MediaEval 2012 workshop, Santa Croce in Fossabanda, Pisa, Italy, 4–5October, 2012, CEUR Workshop Proceedings, vol. 927. CEUR-WS.org

36. Luo J, Joshi D, Yu J, Gallagher A (2011) Geotagging in multimedia and computer vision–asurvey. Multimed Tools Appl 51:187–211

37. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. CambridgeUniversity Press, New York, NY

38. Montague M, Aslam JA (2002) Condorcet fusion for improved retrieval. In: Proceedings ofthe 11th international Conference on Information and Knowledge Management, CIKM ’02,pp 538–548. doi:10.1145/584792.584881. http://doi.acm.org/10.1145/584792.584881

39. Olligschlaeger AM, Hauptmann AG (1999) Multimodal information systems and GIS: theinformedia digital video library. In: 1999 ESRI user conference. http://www.informedia.cs.cmu.edu/documents/ESRI99.html

40. Pedronette DCG (2012) Exploiting contextual information for image re-ranking and rank aggre-gation in image retrieval tasks. Ph.D. thesis, University of Campinas (UNICAMP), Campinas,SP, Brazil

41. Pedronette DCG, Torres R da S (2011) Exploiting clustering approaches for image re-ranking.J Vis Lang Comput 22(6):453–466

42. Pedronette DCG, Torres R da S, Calumby RT (2012) Using contextual spaces for image re-ranking and rank aggregation. Multimed Tools Appl :1–28. doi:10.1007/s11042-012-1115-z

43. Penatti OAB, Li LT, Almeida J, Torres R da S (2012) A visual approach for video geocodingusing bag-of-scenes. In: International conference on multimedia retrieval

44. Poh N, Bengio S (2005) How do correlation and variance of base-experts affect fusion in biomet-ric authentication tasks? IEEE Trans Signal Proces 53(11):4384–4396

45. Popescu A, Ballas N (2012) CEA LIST’s participation at mediaeval 2012 placing task. In: LarsonMA, Schmiedeke S, Kelm P, Rae A, Mezaris V, Piatrik T, Soleymani M, Metze F, Jones GJF(eds) Working notes proceedings of the MediaEval 2012 workshop, Santa Croce in Fossabanda,Pisa, Italy, 4–5 October, 2012, CEUR Workshop Proceedings, vol. 927. CEUR-WS.org

46. Rae A, Kelm P (2012) Working notes for the placing task at mediaeval 2012. In: Larson MA,Schmiedeke S, Kelm P, Rae A, Mezaris V, Piatrik T, Soleymani M, Metze F, Jones GJF (eds)Working notes proceedings of the MediaEval 2012 workshop, Santa Croce in Fossabanda, Pisa,Italy, 4–5 October, 2012, CEUR Workshop Proceedings, vol. 927. CEUR-WS.org

47. Schalekamp F, Zuylen A (1998) Rank aggregation: together were strong. In: Workshop onAlgorithm Engineering and Experiments (ALENEX), pp 38–51

48. Sculley D (2007) Rank aggregation for similar items. In: SIAM international conference on DataMining (SDM 2007), pp 587–592

49. Serdyukov P, Murdock V, van Zwol R (2009) Placing flickr photos on a map. In: ACM SIGIR,pp 484–491. doi:10.1145/1571941.1572025

50. Trevisiol M, Delhumeau J, Jégou H, Gravier G (2012) How INRIA/IRISA identifies geographiclocation of a video. In: Larson MA, Schmiedeke S, Kelm P, Rae A, Mezaris V, Piatrik T,Soleymani M, Metze F, Jones GJF (eds) Working notes proceedings of the MediaEval 2012workshop, Santa Croce in Fossabanda, Pisa, Italy, 4–5 October, 2012, CEUR WorkshopProceedings, vol. 927. CEUR-WS.org

51. Trevisiol M, Jégou H, Delhumeau J, Gravier G (2013) Retrieving geo-location of videos with adivide & conquer hierarchical multimodal approach. In: International conference on multimediaretrieval

52. van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity.IEEE Trans Pattern Anal Mach Intell 32:1271–1283

53. Van Laere O, Schockaert S, Dhoedt B (2011) Finding locations of flickr resources using languagemodels and similarity search. In: International conference on multimedia retrieval, pp 48:1–48:8.doi:10.1145/1991996.1992044

54. Young HP (1974) An axiomatization of borda’s rule. J Econ Theory 9(1):43–5255. Zhang H, Jiang L, Su J (2005) Augmenting naive bayes for ranking. In: International conference

on machine learning, pp 1020–102756. Zhou X, Depeursinge A, Müller H (2010) Information fusion for combining visual and textual

image retrieval in imageclef@icpr. In: Proceedings of the 20th International Conference onRecognizing Patterns in signals, speech, images, and videos, ICPR ’10. Springer-Verlag, Berlin,Heidelberg, pp 129–137. http://portal.acm.org/citation.cfm?id=1939170.1939189

Page 35: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Lin Tzy Li is a Ph.D. candidate in the Institute of Computing (IC) at UNICAMP, Brazil, advisedby Dr. Ricardo da S. Torres. She was a Ph.D intern at Virginia Tech (Digital Library ResearchLaboratory) supervised by Dr. Edward A. Fox from August 2010 to July 2011. She is also aresearcher at Telecommunications Research and Development Center, CPqD Foundation, in Brazil.Lin holds M.Sc. and B.Sc. degrees in Computer Science, and a M.B.A degree. Her current researchinterest includes GIS, Database, data fusion, and integration of image/video, text, and geographicinformation retrievals. Her Ph.D work is on Multimodal Retrieval of Geographic Information.

Daniel Carlos Guimarães Pedronette received a B.Sc. in Computer Science (2005) from the StateUniversity of São Paulo (Brazil) and the M.Sc. degree in Computer Science (2008) from theUniversity of Campinas (Brazil). He got his doctorate in Computer Science at the same universityin 2012. He is currently an assistant professor at the Department of Statistics, Applied Mathematicsand Computing, Universidade Estadual Paulista (UNESP), in Brazil. His research interests involvescontent-based image retrieval, re-ranking, rank aggregation digital libraries, and image analysis.

Page 36: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Jurandy Almeida received his B.Sc. in Computer Science from Sao Paulo State University (UNESP,2004) and his M.Sc. (2007) and Ph.D. (2011) degrees in Computer Science from University ofCampinas (UNICAMP). Nowadays, he is an associate researcher at the Institute of Computing,UNICAMP. He has developed research on databases, image processing and computer vision inapplications of visual information retrieval.

Otávio A. B. Penatti is currently an associate researcher at the Recod lab in the Institute ofComputing, University of Campinas, Brazil. He got his doctorate in computer science in 2012 andreceived his M.Sc degree in 2009 both from University of Campinas. His research interests includecomputer vision, content-based image retrieval, and machine learning applications.

Page 37: A rank aggregation framework for video multimodal geocoding

Multimed Tools Appl

Rodrigo Tripodi Calumby received a B.Sc. in Computer Science from the University of Santa Cruz,Brazil, in 2007. He was awarded the M.Sc degree by the University of Campinas in 2010. He is anassistant professor at the University of Feira de Santana. Calumby is also a Ph.D candidate at theUniversity of Campinas. His main research interests include content-based information retrieval andclassification, multimodal fusion, databases, and machine learning applications.

Ricardo da Silva Torres received a B.Sc. in Computer Engineering from the University of Campinas,Brazil, in 2000. He got his doctorate in Computer Science at the same university in 2004. He is anassociate professor at the Institute of Computing, University of Campinas. His research interestsinclude image analysis, content-based image retrieval, databases, digital libraries, and geographicinformation systems.


Recommended